Back to library

ML 101

An exhaustive introduction to machine learning fundamentals. Covers the ML landscape, gradient descent, bias-variance tradeoff, linear regression, logistic regression, decision trees, ensembles, SVMs, neural networks, clustering, dimensionality reduction, and model evaluation metrics.

Progress
1/ 11

Reading time

~90 min

Structure

Single track

Module 01 / Foundations1 of 11

The ML Landscape

What machine learning actually is, and how its three paradigms are fundamentally different.

What is Machine Learning?

Traditional programming is rules → data → output. You encode logic explicitly. Machine learning flips this: you give it data + desired outputs → it infers the rules. The "rules" are parameters of a mathematical model, and "learning" is optimizing those parameters to minimize error.

Arthur Samuel's classic definition: "Field of study that gives computers the ability to learn without being explicitly programmed." More precisely — a program learns from experience E with respect to task T and performance P, if its P on T improves with E.

Core Vocabulary
  • Features (X): Input variables — the raw data your model sees.
  • Labels / Targets (y): What you're trying to predict (in supervised learning).
  • Model: A mathematical function f(X; θ) parameterized by θ that maps X → ŷ.
  • Training: Adjusting θ to minimize the gap between ŷ and y on a training set.
  • Inference: Running the trained model on new, unseen data.
  • Generalization: How well the model performs on data it hasn't seen — the real goal.

Three Paradigms

🎯

Supervised Learning

Learn a mapping X→y from labeled examples. Error signal is explicit — you know the right answer.

🔍

Unsupervised Learning

Find structure in unlabeled data. No ground truth — the model discovers patterns, clusters, or representations.

🎮

Reinforcement Learning

An agent takes actions in an environment to maximize cumulative reward. No labeled data — learns by trial and error.

The Data Split — Always

Why you need three splits
  • Training set (~70%): Model sees and learns from this. Parameters are optimized here.
  • Validation set (~15%): Hyperparameter tuning and model selection. Model never trains on this, but you peek at it to make decisions — so it's "contaminated" for final evaluation.
  • Test set (~15%): Touched exactly ONCE at the very end. This is your unbiased estimate of real-world performance.
Key Gotchas
  • Data leakage: If test-set information bleeds into training (e.g., scaling using the full dataset's mean), your evaluation is optimistically biased. Fit scalers/preprocessors on train, apply to val/test.
  • i.i.d. assumption: Most ML theory assumes data is independently and identically distributed. Time-series, geo-clustered, or grouped data violates this — use temporal/grouped splits instead of random splits.
  • More data beats better algorithms more often than you'd think. Before tuning models, verify your data pipeline is clean.

The ML Workflow

1
Problem framing: Is it regression, classification, clustering? Define the loss you actually care about, not just what's easy to optimize.
2
Data collection & EDA: Understand distributions, check for nulls, outliers, class imbalance. Garbage in = garbage out.
3
Feature engineering: Transform raw data into a form the model can use. Often the highest-leverage activity in classical ML.
4
Model selection & training: Start simple (linear models, trees). Complexity only if needed.
5
Evaluation: On held-out data. Use the right metric for your problem, not just accuracy.
6
Iteration: Diagnose errors, improve data or model, repeat. Most time is spent here.