Building digital experiences with passion

"Simplicity is the soul of efficiency." - Austin Freeman

Connect

Github iconGitHubGmail iconEmailLinkedIn iconLinkedIn

© 2026 Abel Sintaro. All rights reserved

Back

The Fundamentals of Machine Learning

Chapter 2

End-to-End Machine Learning Project

Previous2 of 9
Next
The Machine Learning LandscapeClassification

Chapter 2 of Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow walks through a complete machine learning project from start to finish using a real-world housing dataset. Aurélien Géron’s goal is to demonstrate the practical workflow that ML practitioners follow, emphasizing that success comes from disciplined process, not just model selection.

Big Picture First

The chapter begins by stressing the importance of understanding the business objective before touching the data. Géron frames the example problem as predicting housing prices in California to support a company’s decision-making.

Key ideas:

  • Define the problem clearly (supervised regression in this case)
  • Identify performance metrics (e.g., RMSE)
  • Understand how the model will be used in production
  • Establish a baseline expectation

Takeaway: Machine learning projects start with problem framing, not coding.

Get the Data

Next, the dataset is loaded and explored. Géron introduces best practices for data acquisition and versioning.

Important steps:

  • Download and store the dataset
  • Take an initial look at structure and features
  • Identify target variable
  • Note potential data issues

He emphasizes creating reproducible data pipelines early.

Explore the Data (EDA)

Exploratory Data Analysis (EDA) helps build intuition about the dataset.

Key activities:

  • Visualize distributions with histograms
  • Use scatter plots to find correlations
  • Identify geographical patterns
  • Detect anomalies and skewed features

A major insight is that visualization often reveals problems that statistics alone miss.

Create a Test Set

One of the most critical lessons in the chapter is to split the data early to avoid data leakage.

  • Géron demonstrates:
  • Random train/test splitting
  • Stratified sampling based on income categories
  • Why naive random splits can bias results

Takeaway: Protect the test set to ensure honest evaluation.

Data Preparation

This section shows how raw data becomes model-ready.

Major steps include:

Handling Missing Values

Options discussed:

  • Remove rows
  • Remove features
  • Impute values (median is commonly used)

Feature Engineering

Géron creates new features that improve predictive power, such as:

  • Rooms per household
  • Bedrooms per room
  • Population per household

Key insight: Good features often matter more than complex models.

Encoding Categorical Variables

Techniques covered:

  • Ordinal encoding
  • One-hot encoding (preferred in many cases)

Feature Scaling

Why scaling matters for many algorithms:

  • Standardization
  • Min-max scaling

Build a Training Pipeline

A major best practice introduced is using Scikit-Learn pipelines to automate preprocessing and ensure consistency.

Benefits:

  • Reproducibility
  • Cleaner code
  • Reduced risk of data leakage
  • Easier experimentation

This is one of the most practically important lessons in the chapter.

Select and Train Models

Géron trains several models to establish baselines:

  • Linear Regression
  • Decision Tree
  • Random Forest

He demonstrates that:

  • Simple models provide useful baselines
  • Complex models can overfit
  • Performance must be validated properly

Better Evaluation with Cross-Validation

Instead of relying on a single train/test split, the chapter introduces cross-validation.

Benefits:

  • More reliable performance estimates
  • Better model comparison
  • Reduced variance in evaluation

This step reveals overfitting in the Decision Tree model.

Fine-Tune the Model

Hyperparameter tuning is performed using:

  • Grid Search
  • Randomized Search

Géron shows how systematic tuning improves performance and why blind guessing is inefficient.

Analyze the Best Model

After selecting the best model (Random Forest), the chapter demonstrates:

  • Feature importance analysis
  • Error analysis
  • Model interpretation

This helps understand why the model works, not just how well.

Evaluate on the Test Set

Only after all tuning is complete is the model evaluated on the held-out test set.

This provides an unbiased estimate of real-world performance.

Critical principle: 👉 Never touch the test set until the very end.

Key Takeaways

  • Machine learning is an end-to-end engineering process
  • Problem framing comes before modeling
  • Always split data early to avoid leakage
  • Feature engineering is extremely powerful
  • Pipelines improve reliability and reproducibility
  • Cross-validation gives more trustworthy evaluation
  • Hyperparameter tuning should be systematic
  • Final evaluation must use untouched test data

Bottom line: Chapter 2 teaches that successful machine learning depends far more on disciplined workflow and data preparation than on choosing sophisticated algorithms.