Applied Statistics — STAT 2 · CvSU · BSIT
Nine units, stripped to what matters. Every idea gets a plain definition, the formula, a memory hook, and an example that everyone could follow. Obsessed with what's on the exam. Allergic to filler. The course funnels toward hypothesis testing (the engine) and ANOVA (the final boss) — pace yourself there.
01 / Foundations
Statistics is the science of collecting, organizing, analyzing, and interpreting data to make decisions under uncertainty. Two halves: descriptive (summarize what you have) and inferential (use a sample to make claims about a population).
Population = the whole Pizza 🍕; Sample = one Slice. Parameter ↔ Population, Statistic ↔ Sample. Greek letters (μ, σ) = the truth; Roman letters (x̄, s) = your guess.
Levels = NOIR (Nominal, Ordinal, Interval, Ratio). Discrete = Dots you count; Continuous = a Curve you measure.
You want the average height of all Grade-6 kids in the Philippines — the population, too many to measure. So you measure 100 kids (your sample). The true average of every kid = parameter. Your 100-kid average = statistic. Statistics is the art of trusting the 100 to speak for the millions.
02 / Handling Data
Garbage in, garbage out — how you collect data decides whether any of the later math means anything.
Raw scores are noise. An FDT groups them into classes so a pattern appears.
Build an FDT in order: R-C-W-T — find the Range, decide Classes, get the Width, then Tally.
Thirty kids took a test. Listing all 30 scores is a wall of numbers. Instead group them: 0–10 → 2, 11–20 → 5, 21–30 → 9… Now you can see the class bunches up in the middle. Draw bars over those groups and you've got a histogram — same data, readable at a glance.
03 / The Center
One number to represent the "center." Three flavors — they agree when data is symmetric, and disagree when it's skewed (the interesting case).
All start with M: Mean = add & divide · Median = the Middle (sort first!) · Mode = the Most. And: the Mean is a people-pleaser — one billionaire drags it up. The Median doesn't care.
Five kids' scores: 3, 5, 5, 7, 10.
Mean = (3+5+5+7+10) ÷ 5 = 30 ÷ 5 = 6 Median = middle of sorted list = 5 Mode = appears most often = 5The "average" student is ~5–6. If a sixth kid scored 100, the mean jumps to ~21 (misleading!) but the median barely moves to 6. That's when you report the median.
04 / The Spread
The center tells you where; dispersion tells you how spread out. Two datasets can share a mean and tell completely different stories.
Variance Vexes (weird squared units), SD Saves it (square-root → real units). Divide by n−1 because Samples are Shy by one. CV lets you Compare Variability across different things (₱ vs kg).
The tail tells the tale. Skew is named after the long tail's direction — and the mean chases the tail (pulled toward the outliers).
Two classes both average 80. Class A: everyone 78–82 (tight, tiny SD — average is trustworthy). Class B: 50 to 100 (huge SD — half lost, half bored). Same mean, opposite reality. That's why SD is never optional.
05 / The Keystone
The keystone unit. Understand why sample averages behave predictably and every test in Units VI–IX stops being magic.
Really Smart Stats Cluster → Random, Systematic, Stratified, Cluster. Stratified = slice into LAYERS then sample each. Cluster = pick whole GROUPS (entire classrooms).
The average of averages goes bell-shaped. n ≥ 30 is the magic number. SE shrinks as n grows.
Roll one die: any number 1–6, totally flat, no bell. Now roll five dice and write the average, hundreds of times. Those averages pile up around 3.5 in a bell shape — all-1s or all-6s are rare. The original was flat, yet the averages went bell-curve. That's the CLT, and it's why a sample mean is trustworthy.
06 / The Engine
A courtroom for data. Assume "nothing's going on," then check whether the evidence is strong enough to overturn that assumption.
Steps = H-A-T-C-D (Hypotheses, Alpha, Test-stat, Compare, Decide). p Low → null must Go; p High → null gets by. Tails: ≠ → two-tailed, < or > → one-tailed.
Type I = cry wolf when there's none (false alarm). Type II = miss the wolf that's really there. And: statistically significant ≠ important — a tiny effect looks "significant" if n is huge.
A candy company swears each bag holds 50 pieces → H₀: μ = 50. You suspect shorting → H₁: μ < 50. You count 30 bags; average is 47. The question: is 47 "far enough" below 50 to prove cheating, or just random bag luck? Deep in the tail → reject H₀ → guilty. Type I = accuse an honest company. Type II = let real cheaters walk.
07 / The Line
Correlation measures how tightly two things move together. Regression draws the best straight line so you can predict.
ŷ = a + bx is just y = mx + b in a lab coat. b is the boost — how much y jumps per +1 of x. Square r to get the share of variation explained (r = 0.9 → r² = 0.81 → 81%).
Correlation ≠ causation. Ice-cream sales and drownings rise together — but the SUN causes both, not each other. Always hunt for the hidden third factor.
Plot hours studied (x) against test score (y). The dots drift up-right — more study, higher score: positive correlation. Draw the single best straight line (regression) and predict: "study 5 hours → expect ~85." The causation trap: kids with bigger feet read better — feet don't cause reading, age does (older kids have both).
08 / The Setup
Before you can analyze an experiment (Unit IX), you have to design it so the results actually mean something.
Randomize to be fair, Replicate to be sure, Block to be smart. Roles: Treatment = the WHAT-IF · Unit = WHO-GETS-IT · Response = WHAT-HAPPENS.
Which fertilizer grows the tallest plants? Treatments = fertilizers A, B, C. Units = the pots. Response = height. Replication: many pots per fertilizer (one could be lucky). Randomization: don't put all of A on the sunny sill — assign spots by chance. Blocking: group pots by sunlight first, then compare fertilizers within each sun-group. Now it's fair.
09 / The Final Boss
The biggest block in your syllabus and almost certainly your entire Final. Good news: every design (one-way, two-way, RCBD, factorial, split-plot) is the same template with the Sum of Squares sliced differently.
Comparing 3+ group means with many t-tests is messy and error-prone. ANOVA compares them all at once by asking: is the difference BETWEEN the groups bigger than the random wobble WITHIN each group?
The pipeline is SS → df → MS → F. Logic: Between bigger than Within → big F → groups really differ (reject H₀). One factor = one-way; two factors (and do they team up?) = two-way.
Interaction = "it depends." Coffee helps you focus — but coffee + no sleep = jitters. The combo matters, not just each factor alone.
Three brands of plant food, several pots each. A averages 30 cm, B 32, C 31. Are they really different, or normal plant-to-plant variation? ANOVA stacks the between-brand gap against the within-brand wobble:
Brands differ a lot, plants inside each barely vary → between ≫ within → BIG F → brands genuinely differ ✓ Brands differ a little, plants inside each vary wildly → between ≈ within → small F → it's just noise ✗One F-value, one verdict, no messy pile of t-tests.
★ / Exam Day
| Your goal | Data type | Test | Table |
|---|---|---|---|
| Compare 1 mean to a target number | numeric | z (σ known / large n) or t (σ unknown / small n) | z / t |
| Compare 2 group means | numeric, 2 groups | 2-sample t | t |
| Compare 3+ group means | numeric, 3+ groups | ANOVA (F) | F |
| Test one proportion | yes / no | z for proportion | z |
| Compare two proportions | yes / no, 2 groups | z for two proportions | z |
| Are two categories related? | categorical | Chi-square (independence) | χ² |
| Relationship between two numerics | numeric pairs | correlation / regression | t |
| Quantity | Formula | Note |
|---|---|---|
| Mean | x̄ = Σx / n | — |
| Median position | (n + 1) / 2 | seat, not score |
| Sample variance | s² = [Σx² − (Σx)²/n] / (n−1) | computational form |
| Standard deviation | s = √s² | real units |
| Coeff. of variation | CV = (s / x̄) × 100% | compare across things |
| Standard error | SE = σ / √n | SD of the mean |
| z-score | z = (x − μ) / σ | standardize a value |
| One-sample z | z = (x̄ − μ) / (σ/√n) | σ known / large n |
| One-sample t | t = (x̄ − μ) / (s/√n) | df = n − 1 |
| Correlation r | [nΣxy − ΣxΣy] / √([nΣx²−(Σx)²][nΣy²−(Σy)²]) | −1 to +1 |
| Regression slope | b = [nΣxy − ΣxΣy] / [nΣx²−(Σx)²] | a = ȳ − bx̄ |
| Chi-square | χ² = Σ[(O − E)² / E] | O = observed, E = expected |
| ANOVA F | F = MS_between / MS_within | MS = SS / df |