The Practical Machine Learning Workflow

Every idea in this course, assembled into one repeatable, disciplined process — the order you do things in, and why that order is the whole game.

You have now met every piece of the machine learning puzzle separately: framing a problem, splitting data, baselines, models, metrics, pipelines, cross-validation, tuning, interpretation. This page is where the pieces snap together. Because here is the secret that no single earlier page could teach you: in machine learning, the order you do things in matters as much as the things themselves. Do the right steps in the wrong order and you will fool yourself spectacularly — a great score that evaporates the moment real data arrives.

A practical workflow is not a checklist you memorize. It is a discipline: a way of working that keeps you honest, makes your results reproducible, and fails loudly instead of silently. Let us walk the whole loop, then run it end to end on a real dataset.

The workflow at a glance

Read that diagram top to bottom and notice three things. The test set is sealed at step 3 and not opened until step 8 — everything in between happens on training data alone. There is a loop: steps 5 through 7 repeat as you experiment, but the loop never reaches into the test set. And interpretation comes after you trust the model's accuracy, not before. We will take the steps one at a time.

Step 1 — Frame the problem

Before a single line of code, answer: what are we predicting, and why? Is the target a number (regression), a category (classification), or is there no target at all (clustering)? What would a useful result look like, and how will you measure it? Crucially — is this even a machine learning problem? Plenty of questions are better answered by a simple query, a chart, or a domain expert than by a model.

The choice of metric belongs here too, decided before you see results so you cannot rationalize a flattering one after the fact. Predicting a rare disease? Accuracy will lie to you (the imbalance chapter showed why); you likely want recall, precision, or the area under the ROC curve.

The most expensive mistakes happen before any modeling

A model that brilliantly answers the wrong question is worthless. Framing — nailing down the target, the metric, and whether ML is even the right tool — is the cheapest step to get right and the most expensive to get wrong. Spend real time here.

Step 2 — Get and inspect the data

Load the data and look at it before you model it. How many rows and columns? What types? Are there missing values, obvious errors, or absurd outliers? How is the target distributed — balanced or lopsided? This is the Pandas-and-plots work you already know, and it is not optional: every modeling decision downstream depends on what you find here.

Inspection also catches disasters early. A target that is 99% one class, a feature that is secretly a copy of the answer, a column that is all one value — you want to discover these now, by looking, not three weeks later by debugging a model that "works too well."

Step 3 — Split, and lock the test set away

The moment you understand the data, split it — before you scale, encode, impute, or engineer anything. Set aside a test set with train_test_split (stratified, for classification) and then forget it exists until the very end. The test set is your one honest measurement of the future, and it is honest only as long as it stays sealed.

This early-split discipline is the front-line defense against data leakage: if you compute anything from the whole dataset before splitting — a mean for scaling, a category list for encoding — information from the test set has already seeped into training, and your evaluation is quietly compromised.

Split before you do anything that learns from data

Every transformation that learns from data — scaling, encoding, imputing, feature selection — must be fitted on the training set only, after the split. Fit it on the full dataset first and you leak the test set into training. The split comes early for exactly this reason, and the `Pipeline` (step 5) makes doing it right automatic.

Step 4 — Establish a simple baseline

Before any fancy model, build the dumbest reasonable one and measure it. For classification, scikit-learn's DummyClassifier always predicts the most common class; for regression, DummyRegressor always predicts the mean. These are deliberately stupid — and that is the point.

A baseline answers the question "is my real model actually adding value, or just clearing a bar a coin flip could clear?" If your dataset is 90% one class, a baseline scores 90% accuracy by doing nothing. A "great" 91% model suddenly looks far less great. Without a baseline, you cannot tell skill from the appearance of skill.

A baseline turns a score into a verdict

A bare accuracy number means nothing in isolation. "0.91" is fantastic against a 0.50 baseline and embarrassing against a 0.93 one. Always know your baseline before you celebrate — it is what converts a raw score into an actual judgment of skill.

Step 5 — Build preprocessing and model in a Pipeline

Now the real model. The non-negotiable rule from the pipelines chapter: bundle preprocessing and the estimator into a single Pipeline. A pipeline is not a convenience; it is the mechanism that makes leak-free evaluation automatic. When you cross-validate a pipeline, scikit-learn refits every preprocessing step on each fold's training portion only — so test-fold statistics never leak into the fit.

For the breast-cancer data, the model is a LogisticRegression, which benefits from scaled features, so the pipeline is a StandardScaler followed by the model.

Step 6 — Evaluate with cross-validation

A single train/validation split gives one noisy number that can swing with luck, especially on small data. Cross-validation rotates the validation role through the training data and averages, giving a far steadier estimate of how the pipeline generalizes — and a spread (a standard deviation) that tells you how much the estimate itself wobbles.

Run it on the training data only. The test set is still sealed.

This mean is your working estimate of quality, and the one you compare across experiments. It already beats the baseline from step 4 by a wide margin — evidence that the model is doing real work.

Step 7 — Tune a little

With a solid pipeline and an honest CV estimate, you can tune hyperparameters — gently. Use GridSearchCV with a small grid (from the tuning chapter), and remember it cross-validates on the training data and never touches the test set. Resist the urge to tune endlessly: a few values of the one or two hyperparameters that matter, and stop. Tuning hard against the CV score risks overfitting the validation folds.

If good enough? is "no," you loop back to step 5: engineer a feature, try a different model, gather more data. Every iteration of that loop stays inside the training data. The test set waits.

The iteration loop must never touch the test set

Steps 5–7 form a loop you may run many times as you experiment. Each lap is judged by cross-validation on the training data. The instant you use the test set to decide what to try next, it stops being unseen and its final number becomes a fiction. Discipline here is the whole point of the workflow.

Step 8 — Evaluate ONCE on the test set

Every decision is now locked: the model, the preprocessing, the hyperparameters. Now you unlock the test set and evaluate the chosen model on it — a single time. This number is your honest estimate of real-world performance, the one you report and the one you can defend.

The CV score and the test score should be in the same neighborhood. If the test score is much lower, that is a warning: you may have overfit the validation folds during tuning, or leaked information somewhere. The gap between these two numbers is one of the most informative diagnostics in all of applied machine learning.

The test set is single-use

Once you have looked at the test score and used it to make any decision — "let me just try one more thing" — the test set is spent. Its honesty comes entirely from having played no part in your choices. If you genuinely need another round of experimentation after peeking, the right move is fresh data, or to have set aside a second held-out set from the start.

Step 9 — Interpret the model

A trusted, accurate model is not the finish line for most real projects — people need to know why it predicts what it does. Now apply the interpretation chapter: read coefficients (scaled), pull feature_importances_ from a forest, or run permutation_importance on the held-out data. Check that the model leans on features that make domain sense, and brief your stakeholders in the language of features, not weights. And keep the discipline from that chapter front of mind: importance explains prediction, never causation.

Step 10 — Ship, monitor, revisit

A model is never truly "done." The world drifts: customer behavior changes, sensors age, last year's patterns fade. A model that was excellent at launch can quietly decay. So the workflow loops at the largest scale too — deploy, monitor performance on fresh data, and when it slips, return to the top and work through the steps again with new data in hand.

The complete worked example, start to finish

Here is the entire workflow in one runnable block — frame (binary tumor classification, scored by accuracy against a baseline), inspect, split, baseline, pipeline, cross-validate, tune, and the single final test evaluation. This is the shape of nearly every honest scikit-learn project.

Read the output as a story: the data is mildly imbalanced, the baseline sits in the low-sixties, the pipeline leaps far above it under cross-validation, a tiny grid search nudges it a touch further, and a single test-set evaluation confirms the gain is real. Every number was earned honestly, because the test set sat untouched until the final line. That is the workflow.

The shape never changes

Frame, inspect, split, baseline, pipeline, cross-validate, tune, test once, interpret, iterate. From a one-line linear regression to a hundred-tree forest, this is the loop. Internalize the order — especially "split early, test once" — and you will avoid the mistakes that quietly sink most beginner projects.

When the workflow bends

The ten steps are a default, not a law. Real projects adapt:

Time-series data replaces the random split with a time-ordered one and cross-validation with forward-chaining splits, so you always train on the past and test on the future.
Tiny datasets may not afford both cross-validation folds and a generous test set; you lean harder on cross-validation and tune very lightly to avoid overfitting the validation data.
Pure-exploration projects (unsupervised clustering to understand a dataset) have no target and no test split in the usual sense — you judge with the cluster-evaluation tools instead.

What never bends is the principle underneath: be honest about what your model has and has not seen. Every adaptation above is just that principle applied to a new situation.

Real-world applications

This loop is the daily rhythm of applied machine learning everywhere. A fraud team frames the problem (catch fraud, scored by recall at a fixed false-positive rate), splits by time, baselines against "flag nothing," builds a pipeline, cross-validates, tunes lightly, and confirms on a held-out period before shipping. A hospital follows the same arc for a diagnostic aid, with interpretation elevated to a first-class requirement. The datasets and metrics change; the disciplined order does not.

Your turn

Walk the core of the workflow on the wine dataset, in order.

The data is loaded as X, y. Split first: 25% test, random_state=0, stratified on y, into X_train, X_test, y_train, y_test.
Build a Pipeline called pipe with two steps named exactly "scaler" (a StandardScaler) and "model" (a LogisticRegression(max_iter=5000)).
Cross-validate pipe on the training data with cv=5 and store the mean accuracy in cv_mean (use cross_val_score(...).mean()).
Fit pipe on the full training data, then evaluate it once on the test set with .score(...), storing the result in test_score.

The hidden tests check the split sizes, that pipe is a two-step pipeline with the right step names, that cv_mean is a believable CV accuracy, and that test_score is a valid accuracy.

Check your understanding

QuestionSelect one

In the workflow, when should you perform the train/test split?

After scaling and encoding all the features

Right after inspecting the data, before any transformation that learns from the data

Only at the very end, just before reporting results

It does not matter when you split, as long as you split eventually

QuestionSelect one

Why establish a simple baseline (such as DummyClassifier) before building a real model?

Because scikit-learn requires a baseline before fitting other models

Because the baseline usually outperforms real models

Because a raw score is meaningless without a reference point, and the baseline reveals whether the real model adds genuine skill

Because baselines are faster to interpret than real models

QuestionSelect one

What is the main reason to bundle preprocessing and the estimator into a single Pipeline before cross-validating?

It makes the code shorter to type

So that each cross-validation fold refits the preprocessing on its own training portion only, preventing leakage from the validation fold

Because pipelines train faster than separate steps

Because only pipelines can be tuned with GridSearchCV

QuestionSelect one

During steps 5–7 you iterate: try a feature, swap a model, retune. Where must all of this experimentation be judged?

On the test set, to get the most accurate feedback

With cross-validation on the training data, leaving the test set untouched

On the full dataset including the test set, for maximum data

It does not need to be judged until after deployment

QuestionSelect one

You finish tuning, evaluate on the test set, and the test accuracy is much lower than your best cross-validated score. What is the most reasonable concern?

The test set must be defective and should be discarded

Nothing is wrong; the two numbers are unrelated

You may have overfit the validation folds while tuning, or leaked information, so the optimistic CV score did not hold up on truly unseen data

The model needs to be retrained directly on the test set

The Practical Machine Learning Workflow

On this page