ML 101

What is Machine Learning?

Traditional programming is rules → data → output. You encode logic explicitly. Machine learning flips this: you give it data + desired outputs → it infers the rules. The "rules" are parameters of a mathematical model, and "learning" is optimizing those parameters to minimize error.

Arthur Samuel's classic definition: "Field of study that gives computers the ability to learn without being explicitly programmed." More precisely — a program learns from experience E with respect to task T and performance P, if its P on T improves with E.

Core Vocabulary

Features (X): Input variables — the raw data your model sees.
Labels / Targets (y): What you're trying to predict (in supervised learning).
Model: A mathematical function f(X; θ) parameterized by θ that maps X → ŷ.
Training: Adjusting θ to minimize the gap between ŷ and y on a training set.
Inference: Running the trained model on new, unseen data.
Generalization: How well the model performs on data it hasn't seen — the real goal.

Three Paradigms

🎯

Supervised Learning

Learn a mapping X→y from labeled examples. Error signal is explicit — you know the right answer.

🔍

Unsupervised Learning

Find structure in unlabeled data. No ground truth — the model discovers patterns, clusters, or representations.

🎮

Reinforcement Learning

An agent takes actions in an environment to maximize cumulative reward. No labeled data — learns by trial and error.

The Data Split — Always

Why you need three splits

Training set (~70%): Model sees and learns from this. Parameters are optimized here.
Validation set (~15%): Hyperparameter tuning and model selection. Model never trains on this, but you peek at it to make decisions — so it's "contaminated" for final evaluation.
Test set (~15%): Touched exactly ONCE at the very end. This is your unbiased estimate of real-world performance.

Key Gotchas

Data leakage: If test-set information bleeds into training (e.g., scaling using the full dataset's mean), your evaluation is optimistically biased. Fit scalers/preprocessors on train, apply to val/test.
i.i.d. assumption: Most ML theory assumes data is independently and identically distributed. Time-series, geo-clustered, or grouped data violates this — use temporal/grouped splits instead of random splits.
More data beats better algorithms more often than you'd think. Before tuning models, verify your data pipeline is clean.

The ML Workflow

Problem framing: Is it regression, classification, clustering? Define the loss you actually care about, not just what's easy to optimize.

Data collection & EDA: Understand distributions, check for nulls, outliers, class imbalance. Garbage in = garbage out.

Feature engineering: Transform raw data into a form the model can use. Often the highest-leverage activity in classical ML.

Model selection & training: Start simple (linear models, trees). Complexity only if needed.

Evaluation: On held-out data. Use the right metric for your problem, not just accuracy.

Iteration: Diagnose errors, improve data or model, repeat. Most time is spent here.