Dav/Devs LogoDav/Devs

Processing and Standardising the Combined Project Dataset (Part 2)

Part 2 of a multi-step data preparation pipeline, focusing on processing and standardising a combined project dataset to ensure consistency, correctness, and readiness for downstream analysis.

Β·4 min read

By Davina Leong

🧱 Introduction: Consistency Comes Early

After joining multiple CSV trackers into a single dataset, the next critical step is making that data consistent.

This notebook represents Part 2 of the pipeline, where the combined dataset is processed and standardised. Before adding enrichment or advanced features, it’s important to ensure that formats, values, and structures behave predictably.

Clean data starts with consistency.


🎯 Purpose: Stabilising the Dataset

The goal of this step is to:

  • Standardise column formats and naming
  • Normalise values across records
  • Resolve inconsistencies introduced by multiple data sources
  • Ensure the dataset behaves reliably during analysis and ML preparation

This step reduces risk before moving deeper into the pipeline.


🧠 How It Works: Processing as a Pipeline Stage

At a high level, this notebook performs the following:

  1. Load the combined dataset from Part 1
  2. Inspect columns for formatting and value inconsistencies
  3. Apply standardisation rules consistently
  4. Convert data types explicitly
  5. Validate that transformations behave as expected

Each operation is deliberate and repeatable.


🧩 The Technical Part: Standardisation in Practice

A simplified example of processing logic looks like this:

df["project_name"] = df["project_name"].str.strip().str.lower()
df["status"] = df["status"].str.upper()

Other standardisation techniques demonstrated include:

  • 🧹 Trimming whitespace
  • πŸ”„ Explicit type conversion
  • πŸ“ Normalising categorical values
  • 🧠 Applying consistent transformation rules

These steps ensure uniform behaviour across the dataset.


πŸ’‘ Key Takeaways: Predictable Data Enables Progress

This notebook reinforces several important lessons:

  • 🧱 Consistency matters before enrichment
  • πŸ” Standardisation prevents downstream bugs
  • 🧠 Clean formats enable reliable joins and features
  • πŸ›  Processing is a core data-engineering skill

Stable data makes every later step easier.


🏁 Conclusion: Preparing for Enrichment

Processing and Standardising the Combined Project Dataset (Part 2) is a stabilisation milestone:

If Part 1 unified the data, Part 2 makes it trustworthy.

With a consistent dataset in place, the pipeline can now move on to:

  • Enrichment and derived fields
  • Handling missing values
  • Feature engineering and embeddings

This notebook lays the groundwork for everything that follows.


πŸ”— Link to Notebook

Notebook link: Coming Soon

PythonJupyter NotebookData PreparationData ProcessingData StandardisationMachine LearningPandas
Dav/Devs - Full Stack Developer Portfolio