Skip to content

Math & Stats Cluster

Statistical Analysis with Free Tools: A Practical Guide

Published April 11, 2026 · 11 min read

Statistics has a reputation problem. People who took one class in college remember the formulas, forgot the intuition, and left with the vague sense that p-values were a trick the instructor wanted them to memorize. That is unfortunate, because the daily decisions you make with data — Is this change really better? Did the marketing test actually work? Are these two groups actually different? — are exactly the questions statistics was invented to answer. You do not need a degree. You need a working knowledge of five or six tests, an honest sense of when each applies, and a calculator that will not mangle the arithmetic.

This guide walks through the tests you will actually use: one-sample and two-sample t-tests, chi-square for categorical data, one-way ANOVA for comparing three or more groups, and confidence intervals for reporting uncertainty. We will cover when each one is appropriate, how to interpret the result without falling into the p-value trap, and how to run them without installing Python or R.

A mental model for hypothesis testing

Every hypothesis test answers the same question: if nothing interesting were actually going on, how surprising is the data I am seeing? The "nothing interesting" scenario is called the null hypothesis. The test computes a probability — the p-value — of seeing data at least this extreme under the null. A small p-value means the data would be surprising if the null were true, which gives you grounds to reject the null. A large p-value means the data is consistent with the null, which does not prove the null but does mean your data has not given you evidence against it.

That framing saves you from two common errors. First, a large p-value does not mean "no effect." It means "your sample did not detect an effect this size." With a bigger sample, you might have. Second, a small p-value does not mean "the effect is big." It means "the effect is detectable." Effect size and p-value are different things, and a tiny effect with a huge sample can produce a "significant" p-value that does not matter practically. The NIST/SEMATECH e-Handbook of Statistical Methods is the most authoritative free reference for the underlying theory, and Wolfram MathWorld gives concise mathematical definitions when you want the formulas.

T-tests: comparing means

The t-test answers: are the means of these two groups actually different, or could I be seeing a difference by chance? There are three flavors.

One-sample t-test

You have a sample and a reference value. "Is the average response time in our app different from the 200ms SLA we committed to?" You compute the sample mean, compare it against 200, and get a p-value for how surprising the difference is under the null hypothesis that the true mean is 200.

Two-sample (independent) t-test

You have two groups drawn independently. "Did users in the treatment group spend more time on the page than users in the control?" The test looks at the difference of means relative to the pooled standard error and asks whether that difference is large enough to be unlikely by chance.

Paired t-test

You have two measurements on the same subjects. "For the same 30 users, did their conversion rate improve after the redesign?" Because the measurements are paired, you analyze the within-subject differences rather than treating the groups as independent, which gives you more statistical power.

Feed your group data into T-Test Calculator for a browser-only computation — nothing leaves the page. For the descriptive statistics that feed into the test (mean, standard deviation, variance, count), Statistics Calculator handles the arithmetic and Standard Deviation Calculator focuses on the dispersion measure t-tests depend on. For quick sanity-checking a single observation, Z-Score Calculator tells you how many standard deviations it sits from a reference mean.

Chi-square: comparing frequencies

Chi-square tests are for categorical data — counts, not measurements. Two flavors cover most real uses.

Chi-square goodness-of-fit

You have observed counts in several categories and a theoretical distribution you expect. "Our survey had 40% respondents from the US, 30% from Europe, 20% from Asia, 10% from elsewhere — does that match our user base distribution?" The test compares observed versus expected counts and returns a p-value.

Chi-square test of independence

You have a two-way contingency table and want to know whether the row variable is associated with the column variable. "Is there a relationship between device type and conversion?" You lay out observed counts, compute expected counts under independence, and the test tells you how far observed diverges from expected.

Assumption check: chi-square requires expected counts of at least 5 in each cell. If cells are too sparse, combine categories or use Fisher's exact test instead. Run the computation in Chi-Square Calculator, which handles both flavors and shows the degrees-of-freedom calculation so you can verify by hand.

ANOVA: more than two groups

When you have three or more groups to compare, running multiple t-tests inflates your false positive rate. Run ten pairwise t-tests at a=0.05 and you would expect roughly one "significant" result by pure chance. ANOVA — analysis of variance — solves this by asking a single question: is the variance between groups large relative to the variance within groups? If yes, at least one group mean differs from at least one other. You then follow up with a post-hoc test (Tukey HSD is the classic) to find which pairs differ.

One-way ANOVA is the place to start. You have one factor with several levels — for example, four versions of a landing page and a conversion metric per visitor. The test returns an F-statistic and a p-value. If the p-value is small, do the post-hoc comparisons. If not, you do not have evidence that any version differs from any other with this sample.

For a conceptual walkthrough with worked examples, the Khan Academy statistics course covers one-way ANOVA without requiring a stats background. MIT OpenCourseWare's introductory statistics lectures are another solid free resource.

Confidence intervals: what people want

A confidence interval is often more useful than a p-value. Instead of answering "is there an effect?" it answers "how large is the effect, with uncertainty?" A 95% confidence interval for a mean difference tells you a range that would contain the true difference in 95% of repeated samples drawn the same way. Narrow intervals indicate a precise estimate. Wide intervals indicate uncertainty, and the right response is usually "collect more data."

Reporting confidence intervals alongside p-values fixes the effect-size blind spot. "The treatment increased conversion by 0.1% (95% CI: 0.05% to 0.15%)" is much more actionable than "p < 0.05." The first tells you the effect is real and tiny; the second tells you only that the effect is detectable. Use Confidence Interval Calculator for means and proportions, and Sample Size Calculator to plan studies that will achieve the precision you want before you start collecting data.

For percentages and ratios that feed into effect-size calculations, Percentage Calculator handles the common conversions cleanly — useful when you are communicating results to a non-technical audience who wants "a 15% lift" instead of "the effect size was 0.12."

The p-value pitfalls

P-hacking

Running many tests and reporting the one that happened to be "significant" is called p-hacking, and it is the most common way people fool themselves and others with statistics. Decide your test in advance, report all tests you ran, and correct for multiple comparisons when you must.

Treating p=0.05 as a bright line

A p-value of 0.049 and a p-value of 0.051 are essentially identical as evidence. Treating them as qualitatively different because they fall on opposite sides of 0.05 is a human error the math does not support. The American Statistical Association's 2016 statement on p-values explicitly warns against this.

Ignoring effect size

Statistical significance is not practical significance. A very large sample can produce a significant p-value for an effect so small you would not ship it. Always report and judge the effect size alongside the p-value.

Ignoring assumptions

T-tests assume roughly normal distributions. Chi-square needs minimum expected counts. ANOVA assumes roughly equal variances. When the assumptions fail, the p-values are not what they claim to be. Non-parametric alternatives (Wilcoxon, Mann-Whitney, Kruskal-Wallis) exist for exactly this reason.

Related pillar guide

This cluster post is part of the comprehensive tools track. For the broader foundation on choosing and using free online tools, see The Complete Guide to Free Online Tools.

FAQ

What sample size do I need?

It depends on the effect size you want to detect and the statistical power you want. For typical two-group comparisons targeting a medium effect (Cohen's d = 0.5) with 80% power and a=0.05, you need about 64 subjects per group. Run a proper sample size calculation before collecting data.

When do I use a non-parametric test?

When your data is clearly not normally distributed (ordinal data, heavy skew, small samples where you cannot check normality) and the assumption-based tests would give you misleading p-values. The cost of non-parametric tests is slightly lower power, but that is better than a wrong answer.

What is Bonferroni correction and when do I need it?

When you run multiple tests at the same time, the chance of at least one false positive grows. Bonferroni divides your alpha by the number of tests to compensate. It is conservative (too cautious), but it is simple and keeps you from claiming results that are just chance.

Can I report "p was almost significant"?

Please do not. The p-value is what it is. If the pre-registered alpha is 0.05 and you got 0.07, your evidence did not clear the bar. Report the actual p-value and the effect size and let the reader decide.

Why report confidence intervals if I already have a p-value?

Because the confidence interval tells you the effect size and its uncertainty, which is what the business actually cares about. The p-value only tells you whether an effect is detectable. Both together give a complete picture; either alone is misleading in common cases.

Closing thought

Statistics is not about memorizing formulas. It is about knowing which question you are asking, picking the test that answers it, and reporting the result with honest uncertainty. Start with two or three tests you actually understand, use them well, and the rest of the field opens up when you need it. The calculators do the arithmetic; your job is the interpretation.