Dav/Devs LogoDav/Devs

Preparing and Enriching the Dataset for ML & AI (Part 3)

Part 3 of a data preparation pipeline, focusing on preparing and enriching a cleaned dataset with ML- and AI-relevant fields to support downstream modelling and embeddings.

ยท4 min read

By Davina Leong

๐Ÿค– Introduction: Shaping Data for Learning Systems

By this stage in the pipeline, the dataset is already:

  • Unified from multiple sources
  • Processed and standardised for consistency

However, machine learning and AI systems require more than clean data โ€” they require informative data.

This notebook represents Part 3, where the dataset is prepared and enriched with ML- and AI-relevant structure, making it suitable for feature engineering, embeddings, and modelling.

This is where data becomes learnable.


๐ŸŽฏ Purpose: Making Data Useful for ML & AI

The goal of this step is to:

  • Prepare the dataset specifically for ML/AI workflows
  • Add or refine fields that improve signal quality
  • Align data formats with downstream modelling needs
  • Reduce ambiguity in features and labels

This step bridges data engineering and machine learning.


๐Ÿง  How It Works: ML-Oriented Dataset Preparation

At a high level, this notebook performs the following:

  1. Load the processed and standardised dataset
  2. Identify fields relevant for ML and AI tasks
  3. Refine or derive features from existing data
  4. Remove noise or non-informative columns
  5. Ensure the dataset structure supports learning workflows

Every transformation is driven by model readiness, not just cleanliness.


๐Ÿงฉ The Technical Part: Preparing Features for Learning

A simplified example of preparation logic might look like this:

df["duration_hours"] = df["duration_minutes"] / 60
df["is_long_task"] = df["duration_hours"] > 2

Other preparation techniques demonstrated include:

  • ๐Ÿงฎ Deriving numeric features
  • ๐Ÿท Creating categorical or boolean indicators
  • ๐Ÿง  Aligning feature naming for clarity
  • ๐Ÿ“ Selecting ML-relevant columns only

These steps reduce friction in later ML stages.


๐Ÿ’ก Key Takeaways: ML Preparation Is Intentional

This notebook reinforces several important ideas:

  • ๐Ÿค– ML datasets require deliberate feature thinking
  • ๐Ÿง  Not all clean data is useful data
  • ๐Ÿ” Preparation improves model performance downstream
  • ๐Ÿ›  Feature readiness is as important as model choice

Well-prepared datasets simplify everything that follows.


๐Ÿ Conclusion: Ready for the Next ML Steps

Preparing and Enriching the Dataset for ML & AI (Part 3) marks a clear transition point:

The dataset is no longer just correct โ€” it is now useful for learning systems.

With this foundation, the pipeline can confidently proceed to:

  • Handling missing values
  • Generating embeddings
  • Model training and validation

This notebook sets the stage for true ML work.


๐Ÿ”— Link to Notebook

Notebook link: Coming Soon

PythonJupyter NotebookMachine LearningAIData PreparationFeature EngineeringPandas
Dav/Devs - Full Stack Developer Portfolio