PCA — Problems & Solutions

5 real-world problems, explained step by step

8→3
E-Commerce
Customer segmentation
256→50
Image
Compression
15→5
Medical
Heart disease
20→5
Stocks
Portfolio analysis
10k→100
Spam
Email detection
Step 1 → Standardize data (mean=0, std=1)
Step 2 → Compute Covariance Matrix
Step 3 → Get Eigenvalues & Eigenvectors
Step 4 → Select top k components (≥ 85–95% variance)
Step 5 → Transform data into new k-dimensional space
E-Commerce
85.9%
8→3
Image
88.7%
256→50
Medical
85.1%
15→5
Stocks
74.9%
20→5
Spam (acc.)
93.1%
10k→100
E-Commerce / Marketing
Problem 1: Customer Segmentation
A company has 8 features per customer. Clustering is slow, features overlap heavily. We need fewer, cleaner dimensions.
age
18–65
annual income
20k–150k
spending score
1–100
purchases
0–500
avg order value
$10–500
days since last
0–365
email open rate
0–100%
page views
0–1000
1
Standardize all 8 features
Scale to mean=0, std=1. Without this, page_views (0–1000) would dominate over age (18–65).
2
Find correlations
annual_income ↔ avg_order_value: r=0.89 (very high). num_purchases ↔ spending_score: r=0.76. These pairs carry redundant info — PCA will merge them.
3
Compute eigenvalues
PC1 = 47.8%, PC2 = 24.2%, PC3 = 13.9%. Together: 85.9% of all information captured.
4
Keep 3 components (85.9%)
Reduce from 8 features to 3 principal components — each has a clear business meaning.
ComponentDriven byBusiness meaning
PC1income, order_value, spending_scorePurchasing Power (wealth)
PC2page_views, email_open_rateDigital Engagement
PC3age, days_since_lastCustomer Lifecycle stage
Result
Dimensions reduced: 8 → 3 (saves 62.5% of features)
Information retained: 85.9%
Clustering is 4× faster
3 clear segments found: VIP customers, Deal hunters, At-risk customers
Computer Vision
Problem 2: Image Compression
A 256×256 grayscale image has 256 features (pixel columns). Adjacent pixels are highly correlated — PCA exploits this to compress.
Original size
65,536
pixel values
Features
256
pixel columns
Pixel correlation
~90%
highly redundant
k=256 (full)
100% variance
1× size
k=100
95.2%
1.3×
k=50 ← best
88.7%
2.6×
k=20
76.3%
6.4×
k=5
45% (blurry)
25.6×
1
Treat each row as a data sample
256 rows = 256 samples, each with 256 pixel intensity features. Normalize pixel values from 0–255 to 0–1.
2
Apply PCA with k=50
Transform 256×256 → 256×50. Only 50 components needed to represent 88.7% of visual information.
3
Reconstruct the image
Use inverse_transform: 256×50 → back to 256×256. The result looks nearly identical to the original.
Sweet spot: k = 50
2.6× smaller file size
88.7% variance retained — looks nearly identical to original
Only 50 numbers per row instead of 256
Healthcare
Problem 3: Medical Diagnosis
15 lab tests per patient to predict heart disease risk. Many tests are correlated — PCA removes redundancy and actually improves accuracy.
Feature pairCorrelationWhat PCA does
total_cholesterol + hdl + ldlr = 0.95Merges into 1 component
blood_pressure + heart_rater = 0.78Partially merges
bmi + waist_hip_ratior = 0.82Merges into 1 component
glucose + triglyceridesr = 0.69Partially merges
PC1
Cholesterol profile — 31.2% variance
Driven by total_cholesterol, HDL, LDL. Represents overall lipid health.
PC2
Cardiovascular stress — 19.8%
Driven by blood_pressure and heart_rate. How hard the heart is working.
PC3
Metabolic syndrome — 14.1%
Driven by BMI, waist_hip_ratio, glucose. Obesity and diabetes risk factors.
PC4
Blood composition — 11.3%
Driven by hemoglobin and white blood cells.
PC5
Kidney markers — 8.7%
Driven by creatinine and uric acid levels.
Accuracy (raw 15)
71%
Accuracy (PCA 5)
79%
AUC-ROC (raw)
0.74
AUC-ROC (PCA)
0.83
Key insight: removing noise improves accuracy
Accuracy improved +8% by removing redundant tests
Training 4.7× faster (4.2s → 0.9s)
5 components map to clinically meaningful health dimensions
Finance
Problem 4: Stock Market Factors
20 stocks, 252 trading days. 190 correlation pairs — impossible to analyze manually. PCA reveals the hidden forces driving prices.
F1
Market-wide movement — 38.5%
All 20 stocks load positively. This is the "market beta" — when the whole market rises or falls together.
F2
Tech vs Energy — 14.2%
AAPL/MSFT load +0.82. XOM/CVX load -0.79. When tech rises, energy often falls — and vice versa.
F3
Growth vs Value — 9.1%
AMZN/NFLX (high growth) vs JPM/PG (stable value). Captures growth-value rotation cycles.
F4
Healthcare vs Consumer — 7.3%
JNJ/PFE vs KO/PEP. Defensive sector rotation during market uncertainty.
F5
Cap size factor — 5.8%
Small cap vs large cap performance divergence over time.
AAPL / MSFT
+0.82 (Tech UP)
AMZN / NFLX
+0.71
JNJ / KO (neutral)
~0
CAT / GE
-0.65 (Energy)
XOM / CVX
-0.79 (Energy DOWN)
Financial insight
Only 5 true independent risk factors exist across 20 stocks — not 190 pairs
Factor 2 reveals the Tech↑ = Energy↓ trade-off — invisible from raw data
True diversification: pick stocks from different factors, not just different sectors
NLP / Text
Problem 5: Spam Email Detection
5,000 emails → TF-IDF → 10,000 word features. Too many, too sparse, too slow. PCA groups similar words into spam "topics".
Raw features
10,000
unique words
Matrix density
< 2%
very sparse
Train time (raw)
3m 12s
way too slow
ComponentTop wordsSpam topic
PC1free, win, prize, click, cash, winnerPRIZE_SPAM
PC2buy, cheap, discount, sale, deal, saveCOMMERCIAL_SPAM
PC3viagra, pills, pharmacy, prescription, drugPHARMA_SPAM
PC4bank, account, verify, password, loginPHISHING
PC5million, inheritance, nigeria, transfer, fundsSCAM_SPAM
k=10,000 raw
89.2% acc
3m12s
k=500
91.4%
48s
k=200
92.7%
19s
k=100 ← best
93.1% — BEST
9s
k=50
91.8%
5s
Surprising result: fewer features = better accuracy
10,000 → 100 features: accuracy went UP from 89.2% to 93.1%
Training 21× faster (3m12s → 9s)
Noise words removed — model learns cleaner spam patterns
PCA on text = LSA (Latent Semantic Analysis) — widely used in NLP