Math & Stats Cluster
Regression Analysis for Beginners: A Practical Intro
The simplest possible regression is a line drawn through a cloud of points so that the line is "as close as possible" to every point. That sentence describes 95% of what people actually need from regression. Everything else — polynomial terms, interaction effects, regularization, GLMs — is the same basic idea with more decoration. Understanding the simple case deeply makes all the decoration easier to learn when you need it.
This post teaches the simple case, deeply. We will cover what "as close as possible" actually means (least squares), what the resulting line is telling you about your data, what R-squared is and is not, the diagnostic checks that catch bad fits, and when you should absolutely not use linear regression even though the calculator will happily give you a line.
A line through points
Start with a scatterplot — say, daily advertising spend on the x-axis and daily revenue on the y-axis. You can eyeball a line through the points. The question regression answers is: of all possible lines, which one minimizes the overall error between the line and the points? The answer depends on how you define error. Squaring the vertical distance from each point to the line and adding them up — and then finding the line that minimizes that sum — is called ordinary least squares, and it is what "linear regression" almost always means in practice.
Why squared error rather than absolute error? Two reasons. First, squaring makes the optimization problem solvable with a closed-form formula, which matters when you are computing by hand or on a slow machine. Second, squared error penalizes big misses more than small ones, which is usually what you want — a prediction that is off by ten is much worse than two predictions that are each off by five.
The quick path to fitting a line is Linear Regression Calculator. Paste two columns of data and it returns the slope, intercept, R-squared, and residuals without sending your data to any server. Everything happens in the browser.
Least squares in plain language
The least squares line is the one where the sum of squared vertical distances from every point to the line is as small as possible. Two parameters define the line: slope (how much y changes when x increases by one) and intercept (what y equals when x is zero). The math gives you clean formulas for both, expressed in terms of the means and variances of x and y, plus their covariance.
You can compute least squares by hand for small datasets, which is worth doing once just to build intuition. The mean of x, the mean of y, the variance of x, and the covariance of x and y are the only ingredients. Statistics Calculator computes means and variances from raw data, and Standard Deviation Calculator handles the dispersion side when you want to report the residual standard error alongside the fit.
For the theoretical grounding, MIT OpenCourseWare hosts complete introductory statistics lectures that cover least squares from first principles, and Stanford's CS229 course notes work through the linear algebra formulation used in modern machine learning frameworks.
Interpreting the slope and intercept
Once you have a fitted line, the slope tells you: for every one-unit increase in x, y changes by this much on average. "On average" is doing work there — it is the expected change holding every other unknown constant, based on the data you showed the model. If you fit revenue versus ad spend and get a slope of 3.5, you can say that historically, each additional dollar of ad spend has been associated with $3.50 in additional revenue on average.
That sentence contains the word "associated" rather than "caused" for a reason. Regression is about correlation, not causation. The line could reflect a real causal relationship, a confounding variable you did not measure, reverse causation, or pure coincidence. Without an experiment or a carefully designed observational study, the regression does not tell you why the relationship exists. It tells you the relationship is there.
The intercept is the predicted y when x is zero. For many real datasets that prediction is meaningless — if your ad spend data ranges from $100 to $10,000, the "predicted revenue at $0 spend" is extrapolation outside your sample and should not be taken seriously. The intercept matters mathematically (the line has to pass through some y-axis value) but often has no business interpretation.
R-squared and what it does not tell you
R-squared is the proportion of variance in y that the line explains. R-squared of 0 means the line explains nothing (x and y are unrelated). R-squared of 1 means the line explains everything (every point is exactly on the line). In between, you have partial explanation.
Here is what R-squared does not tell you. First, it does not tell you whether the linear model is appropriate. You can have a high R-squared for a relationship that is obviously curved (the line fits the average trend but misses the shape). Second, it does not tell you whether the coefficients are statistically reliable — a line fitted to three points will have R-squared near 1 for almost any data. Third, it does not tell you the fit is good out-of-sample; a high R-squared on training data can still produce terrible predictions on new data if you overfit.
Always look at the residuals alongside R-squared. A good fit has residuals scattered randomly around zero with constant variance. A bad fit has residuals that show a pattern — curvature, funnel shape, or drift over time. The pattern tells you what assumption is failing and what to do next.
Diagnostic checks before you trust a fit
Four checks every linear regression should pass before you use it for anything:
1. Scatterplot the raw data
Before fitting anything, look at x versus y. Is the relationship roughly linear? If it curves, fits a plateau, or has obvious outliers, fix that before fitting. Anscombe's quartet is the classic demonstration that four datasets with identical summary statistics can have wildly different shapes. Always look at the picture.
2. Residuals versus fitted values
Plot the residuals on the y-axis and the fitted values on the x-axis. A good fit shows a featureless cloud centered on zero. Patterns in this plot diagnose problems. A funnel shape means heteroscedasticity — the variance of y depends on x — and the confidence intervals on your slope will be wrong. A curve means you need a non-linear term.
3. Distribution of residuals
Histogram the residuals. They should look roughly normal if you want the usual confidence intervals and p-values on the slope to be trustworthy. Heavy skew means your inference on the coefficients is not reliable even if the point estimate is fine.
4. Check for outliers and influential points
One or two extreme points can drag the line around. Refit the regression without the suspected outlier and see how much the slope changes. If the answer is "a lot," you have an influential point and you need to decide whether it belongs in your dataset.
For the percentage-based summaries that often accompany regression reporting ("revenue grew by X% per $1 of spend"), Percentage Calculator handles the conversions quickly. For testing whether a slope is statistically different from zero, the machinery reduces to a t-test on the coefficient — pipe the numbers through T-Test Calculator or report the standard error directly.
When linear regression is wrong
Linear regression assumes (among other things) that y is a linear function of x, that the errors have constant variance, that observations are independent, and that the errors are roughly normal. When those assumptions fail in ways the diagnostics reveal, the answers linear regression gives you are misleading.
Common situations where you should reach for something else:
- Count data. Use Poisson or negative binomial regression.
- Binary outcomes. Use logistic regression.
- Time series with autocorrelation. Use ARIMA, state-space models, or proper time series methods.
- Heavy-tailed data with occasional huge values. Consider log-transforming y or using robust regression.
- Relationships that flatten out. Use polynomial or spline terms, or a non-linear model entirely.
The NIST/SEMATECH e-Handbook covers each of these alternatives with worked examples. Khan Academy's probability and statistics track is a gentler introduction if you are learning the concepts for the first time.
Adjacent tools worth bookmarking
Tools that pair naturally with regression analysis: Z-Score Calculator for checking individual observations against a fitted model, Confidence Interval Calculator for reporting uncertainty around slope estimates, Sample Size Calculator for planning the minimum data needed to detect a meaningful relationship, and Chi-Square Calculator when the outcome variable is categorical and a regression is the wrong tool entirely.
Related pillar guide
This cluster post is part of the comprehensive tools track. For the broader foundation on choosing and using free online tools, see The Complete Guide to Free Online Tools.
FAQ
What is the difference between correlation and regression?
Correlation is a single number summarizing how tightly two variables move together. Regression fits a line (or surface, in the multivariate case) that lets you predict one variable from another. Correlation is symmetric; regression treats one variable as the predictor and the other as the outcome.
How many data points do I need?
A rough rule of thumb is at least 10 observations per predictor variable, and more if you want reliable p-values or narrow confidence intervals on the coefficients. With only a handful of points, a "best fit" line is almost meaningless because many different lines fit almost equally well.
Should I standardize my variables before regression?
For basic linear regression with one predictor, no — the slope is interpretable in the original units. For multi-variable regression where you want to compare the relative importance of predictors, yes — standardizing lets you compare coefficients directly.
What does a negative R-squared mean?
Usually a bug in your calculation. R-squared as traditionally defined cannot be negative for a properly fit ordinary least squares line. If you see a negative value, you are probably computing an out-of-sample R-squared where the model is worse than just predicting the mean every time.
Can I use regression to prove causation?
Regression alone cannot establish causation. You need experimental design (random assignment), quasi-experimental methods (natural experiments, instrumental variables, regression discontinuity), or strong domain knowledge about the data-generating process. Treating regression coefficients as causal without this scaffolding is a common and costly error.
Closing thought
Fit a line. Look at the residuals. Ask whether the slope makes sense in context. That loop, run honestly, will catch more regression errors than any formula memorization. The tool gives you the line; the interpretation is the part that matters.