Dav/Devs LogoDav/Devs

Handling Missing Values and Data Gaps in the Dataset (Part 4)

Part 4 of a multi-step data preparation pipeline, focusing on identifying, handling, and resolving missing values and data gaps to improve dataset reliability for machine learning.

ยท4 min read

By Davina Leong

๐Ÿ•ณ๏ธ Introduction: Real Data Has Gaps

No real-world dataset is complete.

Even after joining, processing, and preparing data for ML and AI, missing values and gaps inevitably remain. Ignoring them can lead to biased models, runtime errors, or misleading insights.

This notebook represents Part 4 of the pipeline, focusing on identifying and handling missing data deliberately and safely.


๐ŸŽฏ Purpose: Making the Dataset Robust

The goal of this step is to:

  • Identify missing or incomplete values
  • Understand why data is missing
  • Apply appropriate filling or replacement strategies
  • Preserve data integrity while reducing noise
  • Ensure the dataset remains usable for downstream ML tasks

Handling missing data is about judgement, not just filling blanks.


๐Ÿง  How It Works: Missing Data as a First-Class Concern

At a high level, the notebook follows this approach:

  1. Inspect the dataset for missing values
  2. Identify patterns of missingness
  3. Decide whether to fill, replace, or leave values untouched
  4. Apply consistent strategies across the dataset
  5. Validate that the resulting data behaves as expected

Each choice is intentional โ€” not automatic.


๐Ÿงฉ The Technical Part: Filling and Managing Gaps

A simplified example of missing-value handling looks like this:

df["duration_hours"] = df["duration_hours"].fillna(0)
df["status"] = df["status"].fillna("UNKNOWN")

Other techniques demonstrated include:

  • ๐Ÿงฎ Filling numeric fields with defaults or computed values
  • ๐Ÿท Filling categorical fields with placeholders
  • ๐Ÿ” Checking for nulls using isna() / notna()
  • ๐Ÿง  Ensuring fills donโ€™t distort downstream analysis

The notebook treats missing data as a data-quality problem, not a syntax issue.


๐Ÿ’ก Key Takeaways: Missing Data Is a Design Choice

This notebook reinforces several important lessons:

  • ๐Ÿ•ณ๏ธ Missing data is expected, not exceptional
  • ๐Ÿง  Filling strategies should match intent
  • ๐Ÿ” Consistency matters more than perfection
  • ๐Ÿ›  Thoughtful handling improves model reliability

Poor missing-value handling is one of the most common sources of ML bugs.


๐Ÿ Conclusion: Closing the Gaps Before Learning

Handling Missing Values and Data Gaps in the Dataset (Part 4) is a stabilising step in the pipeline:

Clean structure enables learning, but robust handling enables trust.

With missing data addressed, the dataset is now ready to move into:

  • Embeddings and representation learning
  • Feature vector generation
  • Final validation and modelling

This notebook ensures the dataset wonโ€™t fall apart later.


๐Ÿ”— Link to Notebook

Notebook link: Coming Soon

PythonJupyter NotebookData PreparationMissing DataData CleaningMachine LearningPandas
Dav/Devs - Full Stack Developer Portfolio