PCA — Problems & Solutions

5 real-world problems, explained step by step

All 5 problems at a glance — click to explore

8→3

E-Commerce

Customer segmentation

256→50

Image

Compression

15→5

Medical

Heart disease

20→5

Stocks

Portfolio analysis

10k→100

Spam

Email detection

Universal PCA formula (same for every problem)

Step 1 → Standardize data (mean=0, std=1)
Step 2 → Compute Covariance Matrix
Step 3 → Get Eigenvalues & Eigenvectors
Step 4 → Select top k components (≥ 85–95% variance)
Step 5 → Transform data into new k-dimensional space

Variance retained / accuracy — comparison

E-Commerce

85.9%

8→3

Image

88.7%

256→50

Medical

85.1%

15→5

Stocks

74.9%

20→5

Spam (acc.)

93.1%

10k→100

E-Commerce / Marketing

Problem 1: Customer Segmentation

A company has 8 features per customer. Clustering is slow, features overlap heavily. We need fewer, cleaner dimensions.

Original features (8 dimensions)

age

18–65

annual income

20k–150k

spending score

1–100

purchases

0–500

avg order value

$10–500

days since last

0–365

email open rate

0–100%

page views

0–1000

Step-by-step solution

Standardize all 8 features

Scale to mean=0, std=1. Without this, page_views (0–1000) would dominate over age (18–65).

Find correlations

annual_income ↔ avg_order_value: r=0.89 (very high). num_purchases ↔ spending_score: r=0.76. These pairs carry redundant info — PCA will merge them.

Compute eigenvalues

PC1 = 47.8%, PC2 = 24.2%, PC3 = 13.9%. Together: 85.9% of all information captured.

Keep 3 components (85.9%)

Reduce from 8 features to 3 principal components — each has a clear business meaning.

What each component means

Component	Driven by	Business meaning
PC1	income, order_value, spending_score	Purchasing Power (wealth)
PC2	page_views, email_open_rate	Digital Engagement
PC3	age, days_since_last	Customer Lifecycle stage

Result

Dimensions reduced: 8 → 3 (saves 62.5% of features)

Information retained: 85.9%

Clustering is 4× faster

3 clear segments found: VIP customers, Deal hunters, At-risk customers

Computer Vision

Problem 2: Image Compression

A 256×256 grayscale image has 256 features (pixel columns). Adjacent pixels are highly correlated — PCA exploits this to compress.

Key numbers

Original size

65,536

pixel values

Features

256

pixel columns

Pixel correlation

~90%

highly redundant

Choosing k — quality vs compression tradeoff

k=256 (full)

100% variance

1× size

k=100

95.2%

1.3×

k=50 ← best

88.7%

2.6×

k=20

76.3%

6.4×

k=5

45% (blurry)

25.6×

Step-by-step process

Treat each row as a data sample

256 rows = 256 samples, each with 256 pixel intensity features. Normalize pixel values from 0–255 to 0–1.

Apply PCA with k=50

Transform 256×256 → 256×50. Only 50 components needed to represent 88.7% of visual information.

Reconstruct the image

Use inverse_transform: 256×50 → back to 256×256. The result looks nearly identical to the original.

Sweet spot: k = 50

2.6× smaller file size

88.7% variance retained — looks nearly identical to original

Only 50 numbers per row instead of 256

Healthcare

Problem 3: Medical Diagnosis

15 lab tests per patient to predict heart disease risk. Many tests are correlated — PCA removes redundancy and actually improves accuracy.

Hidden correlations (why PCA helps)

Feature pair	Correlation	What PCA does
total_cholesterol + hdl + ldl	r = 0.95	Merges into 1 component
blood_pressure + heart_rate	r = 0.78	Partially merges
bmi + waist_hip_ratio	r = 0.82	Merges into 1 component
glucose + triglycerides	r = 0.69	Partially merges

PCA components — each has a clinical meaning

PC1

Cholesterol profile — 31.2% variance

Driven by total_cholesterol, HDL, LDL. Represents overall lipid health.

PC2

Cardiovascular stress — 19.8%

Driven by blood_pressure and heart_rate. How hard the heart is working.

PC3

Metabolic syndrome — 14.1%

Driven by BMI, waist_hip_ratio, glucose. Obesity and diabetes risk factors.

PC4

Blood composition — 11.3%

Driven by hemoglobin and white blood cells.

PC5

Kidney markers — 8.7%

Driven by creatinine and uric acid levels.

Model performance: before vs after PCA

Accuracy (raw 15)

71%

Accuracy (PCA 5)

79%

AUC-ROC (raw)

0.74

AUC-ROC (PCA)

0.83

Key insight: removing noise improves accuracy

Accuracy improved +8% by removing redundant tests

Training 4.7× faster (4.2s → 0.9s)

5 components map to clinically meaningful health dimensions

Finance

Problem 4: Stock Market Factors

20 stocks, 252 trading days. 190 correlation pairs — impossible to analyze manually. PCA reveals the hidden forces driving prices.

5 hidden factors PCA discovers

Market-wide movement — 38.5%

All 20 stocks load positively. This is the "market beta" — when the whole market rises or falls together.

Tech vs Energy — 14.2%

AAPL/MSFT load +0.82. XOM/CVX load -0.79. When tech rises, energy often falls — and vice versa.

Growth vs Value — 9.1%

AMZN/NFLX (high growth) vs JPM/PG (stable value). Captures growth-value rotation cycles.

Healthcare vs Consumer — 7.3%

JNJ/PFE vs KO/PEP. Defensive sector rotation during market uncertainty.

Cap size factor — 5.8%

Small cap vs large cap performance divergence over time.

Factor 2 stock loadings (Tech vs Energy axis)

AAPL / MSFT

+0.82 (Tech UP)

AMZN / NFLX

+0.71

JNJ / KO (neutral)

CAT / GE

-0.65 (Energy)

XOM / CVX

-0.79 (Energy DOWN)

Financial insight

Only 5 true independent risk factors exist across 20 stocks — not 190 pairs

Factor 2 reveals the Tech↑ = Energy↓ trade-off — invisible from raw data

True diversification: pick stocks from different factors, not just different sectors

NLP / Text

Problem 5: Spam Email Detection

5,000 emails → TF-IDF → 10,000 word features. Too many, too sparse, too slow. PCA groups similar words into spam "topics".

The scale of the problem

Raw features

10,000

unique words

Matrix density

< 2%

very sparse

Train time (raw)

3m 12s

way too slow

PCA discovers hidden spam topics

Component	Top words	Spam topic
PC1	free, win, prize, click, cash, winner	PRIZE_SPAM
PC2	buy, cheap, discount, sale, deal, save	COMMERCIAL_SPAM
PC3	viagra, pills, pharmacy, prescription, drug	PHARMA_SPAM
PC4	bank, account, verify, password, login	PHISHING
PC5	million, inheritance, nigeria, transfer, funds	SCAM_SPAM

Accuracy vs number of components k

k=10,000 raw

89.2% acc

3m12s

k=500

91.4%

48s

k=200

92.7%

19s

k=100 ← best

93.1% — BEST

k=50

91.8%

Surprising result: fewer features = better accuracy

10,000 → 100 features: accuracy went UP from 89.2% to 93.1%

Training 21× faster (3m12s → 9s)

Noise words removed — model learns cleaner spam patterns

PCA on text = LSA (Latent Semantic Analysis) — widely used in NLP