Capstone Project: E-Commerce Analytics

Project Overview

You’re a data scientist at an e-commerce company. The CEO wants answers to critical business questions:

Descriptive: What does our sales data look like? Any patterns?
Inferential: Is our new checkout flow actually better?
Predictive: What will sales be next month?
Causal: What factors drive customer lifetime value?

This project integrates everything from the course!

Estimated Time: 6-8 hours
Difficulty: Intermediate
Prerequisites: All previous modules
What You’ll Deliver: Complete analysis report with actionable insights

Part 1: Data Exploration & Descriptive Statistics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from datetime import datetime, timedelta

np.random.seed(42)

# Generate realistic e-commerce data
n_customers = 5000
n_transactions = 20000

# Customer data
customers = pd.DataFrame({
    'customer_id': range(n_customers),
    'signup_date': pd.date_range('2022-01-01', periods=n_customers, freq='2H'),
    'age': np.random.normal(35, 10, n_customers).clip(18, 70).astype(int),
    'gender': np.random.choice(['M', 'F', 'Other'], n_customers, p=[0.48, 0.48, 0.04]),
    'city': np.random.choice(['New York', 'LA', 'Chicago', 'Houston', 'Phoenix'], n_customers),
    'acquisition_source': np.random.choice(['Organic', 'Paid Ads', 'Referral', 'Email'], n_customers, p=[0.3, 0.4, 0.2, 0.1]),
})

# Transaction data (customers make 1-10 purchases)
transactions = []
for cust_id in range(n_customers):
    n_purchases = np.random.poisson(4) + 1
    for _ in range(min(n_purchases, 10)):
        transaction = {
            'customer_id': cust_id,
            'date': customers.loc[cust_id, 'signup_date'] + timedelta(days=np.random.randint(0, 365)),
            'amount': np.random.lognormal(3.5, 1) + 10,  # $10-$500+ range
            'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books', 'Food'], 
                                        p=[0.25, 0.30, 0.20, 0.15, 0.10]),
            'discount_used': np.random.random() < 0.3,
        }
        transactions.append(transaction)

transactions = pd.DataFrame(transactions)
transactions = transactions.sample(n_transactions).reset_index(drop=True)

print("Dataset Overview")
print("=" * 50)
print(f"Customers: {len(customers):,}")
print(f"Transactions: {len(transactions):,}")
print(f"Date range: {transactions['date'].min().date()} to {transactions['date'].max().date()}")

# Merge for analysis
df = transactions.merge(customers, on='customer_id')

print(f"\nTransaction Summary:")
print(df['amount'].describe())

Exploratory Visualizations

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Distribution of purchase amounts
axes[0, 0].hist(df['amount'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].axvline(df['amount'].median(), color='red', linestyle='--', label=f'Median: ${df["amount"].median():.2f}')
axes[0, 0].axvline(df['amount'].mean(), color='green', linestyle='--', label=f'Mean: ${df["amount"].mean():.2f}')
axes[0, 0].set_xlabel('Purchase Amount ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Purchase Amounts')
axes[0, 0].legend()

# Sales by category
category_sales = df.groupby('category')['amount'].sum().sort_values(ascending=True)
axes[0, 1].barh(category_sales.index, category_sales.values)
axes[0, 1].set_xlabel('Total Sales ($)')
axes[0, 1].set_title('Sales by Category')

# Daily transactions over time
daily_sales = df.groupby(df['date'].dt.date)['amount'].sum()
axes[0, 2].plot(daily_sales.index, daily_sales.values, alpha=0.7)
axes[0, 2].set_xlabel('Date')
axes[0, 2].set_ylabel('Daily Sales ($)')
axes[0, 2].set_title('Daily Sales Over Time')
axes[0, 2].tick_params(axis='x', rotation=45)

# Customer age distribution
axes[1, 0].hist(customers['age'], bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Customer Age Distribution')

# Purchases by acquisition source
source_counts = df['acquisition_source'].value_counts()
axes[1, 1].pie(source_counts.values, labels=source_counts.index, autopct='%1.1f%%')
axes[1, 1].set_title('Transactions by Acquisition Source')

# Average spend by age group
df['age_group'] = pd.cut(df['age'], bins=[18, 25, 35, 45, 55, 70], labels=['18-25', '26-35', '36-45', '46-55', '56+'])
age_spend = df.groupby('age_group')['amount'].mean()
axes[1, 2].bar(age_spend.index, age_spend.values)
axes[1, 2].set_xlabel('Age Group')
axes[1, 2].set_ylabel('Average Spend ($)')
axes[1, 2].set_title('Average Spend by Age Group')

plt.tight_layout()
plt.show()

# Key insights
print("\nKey Descriptive Insights:")
print(f"  - Median purchase: ${df['amount'].median():.2f} (mean: ${df['amount'].mean():.2f}) - Right-skewed!")
print(f"  - Top category: {category_sales.idxmax()} (${category_sales.max():,.0f} total)")
print(f"  - {df['discount_used'].mean()*100:.1f}% of purchases used a discount")

Part 2: Inferential Statistics - A/B Test Analysis

The marketing team ran an A/B test on a new checkout flow. Let’s analyze the results.

# Simulate A/B test data
np.random.seed(42)

# Control group (old checkout)
n_control = 5000
control_visitors = n_control
control_conversions = int(n_control * 0.032)  # 3.2% conversion rate
control_revenue = np.random.lognormal(3.8, 0.8, control_conversions)

# Treatment group (new checkout)
n_treatment = 5000
treatment_visitors = n_treatment
treatment_conversions = int(n_treatment * 0.038)  # 3.8% conversion rate
treatment_revenue = np.random.lognormal(3.9, 0.8, treatment_conversions)

print("A/B Test: New Checkout Flow")
print("=" * 50)
print(f"Control (Old):")
print(f"  Visitors: {control_visitors:,}")
print(f"  Conversions: {control_conversions} ({control_conversions/control_visitors:.2%})")
print(f"  Avg Revenue: ${np.mean(control_revenue):.2f}")

print(f"\nTreatment (New):")
print(f"  Visitors: {treatment_visitors:,}")
print(f"  Conversions: {treatment_conversions} ({treatment_conversions/treatment_visitors:.2%})")
print(f"  Avg Revenue: ${np.mean(treatment_revenue):.2f}")

# Statistical tests
# 1. Chi-square test for conversion rate
contingency_table = np.array([
    [control_conversions, control_visitors - control_conversions],
    [treatment_conversions, treatment_visitors - treatment_conversions]
])
chi2, p_value_chi, _, _ = stats.chi2_contingency(contingency_table)

print(f"\nStatistical Tests:")
print(f"1. Chi-Square Test for Conversion Rate:")
print(f"   χ² = {chi2:.4f}, p-value = {p_value_chi:.4f}")
print(f"   {'Significant!' if p_value_chi < 0.05 else 'Not significant'} at α = 0.05")

# 2. Two-sample t-test for revenue (among converters)
t_stat, p_value_t = stats.ttest_ind(treatment_revenue, control_revenue)
print(f"\n2. T-Test for Revenue per Conversion:")
print(f"   t = {t_stat:.4f}, p-value = {p_value_t:.4f}")
print(f"   {'Significant!' if p_value_t < 0.05 else 'Not significant'} at α = 0.05")

# 3. Bayesian analysis
from scipy.stats import beta

# Posterior distributions
posterior_control = beta(1 + control_conversions, 1 + control_visitors - control_conversions)
posterior_treatment = beta(1 + treatment_conversions, 1 + treatment_visitors - treatment_conversions)

# Sample from posteriors
samples_control = posterior_control.rvs(100000)
samples_treatment = posterior_treatment.rvs(100000)

prob_treatment_better = np.mean(samples_treatment > samples_control)
expected_lift = np.mean((samples_treatment - samples_control) / samples_control) * 100

print(f"\n3. Bayesian Analysis:")
print(f"   P(Treatment > Control) = {prob_treatment_better:.1%}")
print(f"   Expected Lift = {expected_lift:.2f}%")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Posterior distributions
x = np.linspace(0.025, 0.045, 500)
axes[0].plot(x, posterior_control.pdf(x), 'b-', linewidth=2, label='Control')
axes[0].plot(x, posterior_treatment.pdf(x), 'r-', linewidth=2, label='Treatment')
axes[0].fill_between(x, posterior_control.pdf(x), alpha=0.3, color='blue')
axes[0].fill_between(x, posterior_treatment.pdf(x), alpha=0.3, color='red')
axes[0].set_xlabel('Conversion Rate')
axes[0].set_ylabel('Probability Density')
axes[0].set_title('Posterior Distributions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Lift distribution
lift = (samples_treatment - samples_control) / samples_control * 100
axes[1].hist(lift, bins=100, density=True, alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=2)
axes[1].axvline(x=np.mean(lift), color='red', linestyle='-', linewidth=2, label=f'Mean: {np.mean(lift):.1f}%')
axes[1].set_xlabel('Lift (%)')
axes[1].set_ylabel('Density')
axes[1].set_title('Distribution of Lift')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Revenue comparison
axes[2].hist(control_revenue, bins=30, alpha=0.5, label='Control', density=True)
axes[2].hist(treatment_revenue, bins=30, alpha=0.5, label='Treatment', density=True)
axes[2].set_xlabel('Revenue per Conversion ($)')
axes[2].set_ylabel('Density')
axes[2].set_title('Revenue Distribution per Conversion')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Business recommendation
print("\n" + "=" * 50)
print("BUSINESS RECOMMENDATION:")
if prob_treatment_better > 0.95:
    revenue_impact = treatment_visitors * 12 * (np.mean(samples_treatment) * np.mean(treatment_revenue) - 
                                                  np.mean(samples_control) * np.mean(control_revenue))
    print(f"   SHIP IT! The new checkout flow is {prob_treatment_better:.1%} likely to be better.")
    print(f"   Estimated annual revenue impact: ${revenue_impact:,.0f}")
else:
    print("   WAIT - Need more data. The difference isn't statistically significant yet.")

Part 3: Regression - Predicting Customer Lifetime Value

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# Calculate customer-level metrics
customer_metrics = df.groupby('customer_id').agg({
    'amount': ['sum', 'mean', 'count'],
    'discount_used': 'mean',
    'date': ['min', 'max']
}).reset_index()

customer_metrics.columns = ['customer_id', 'total_spend', 'avg_order', 'n_orders', 
                            'discount_rate', 'first_purchase', 'last_purchase']

# Merge with customer demographics
customer_data = customer_metrics.merge(customers, on='customer_id')

# Calculate customer tenure (days since signup)
customer_data['tenure_days'] = (customer_data['last_purchase'] - customer_data['signup_date']).dt.days

# Target: total_spend (CLV proxy)
# Features: age, n_orders, avg_order, discount_rate, tenure_days, acquisition_source

feature_cols = ['age', 'n_orders', 'avg_order', 'discount_rate', 'tenure_days']
cat_cols = ['acquisition_source', 'city']

X = customer_data[feature_cols + cat_cols]
y = customer_data['total_spend']

# Preprocessing pipeline
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), feature_cols),
    ('cat', OneHotEncoder(drop='first', sparse_output=False), cat_cols)
])

# Model pipeline
model = Pipeline([
    ('prep', preprocessor),
    ('reg', Ridge(alpha=1.0))
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("CLV Prediction Model")
print("=" * 50)
print(f"R² Score: {r2:.4f}")
print(f"RMSE: ${rmse:.2f}")
print(f"Cross-validation R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Feature importance (from Ridge coefficients)
feature_names = feature_cols + list(model.named_steps['prep'].named_transformers_['cat'].get_feature_names_out(cat_cols))
coefficients = model.named_steps['reg'].coef_

importance_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values('coefficient', key=abs, ascending=True)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Predicted vs Actual
axes[0].scatter(y_test, y_pred, alpha=0.3, s=10)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
axes[0].set_xlabel('Actual CLV ($)')
axes[0].set_ylabel('Predicted CLV ($)')
axes[0].set_title(f'Predicted vs Actual (R² = {r2:.3f})')
axes[0].grid(True, alpha=0.3)

# Residuals
residuals = y_test - y_pred
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].grid(True, alpha=0.3)

# Feature importance
axes[2].barh(importance_df['feature'], importance_df['coefficient'])
axes[2].axvline(x=0, color='black', linewidth=0.5)
axes[2].set_xlabel('Coefficient')
axes[2].set_title('Feature Importance (Ridge Coefficients)')

plt.tight_layout()
plt.show()

# Key drivers
print("\nKey CLV Drivers:")
for _, row in importance_df.tail(5).iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"  - {row['feature']}: {direction} CLV by ${abs(row['coefficient']):.2f} per unit")

Part 4: Time Series - Sales Forecasting

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA

# Aggregate to daily sales
daily_sales = df.groupby(df['date'].dt.date)['amount'].sum()
daily_sales.index = pd.to_datetime(daily_sales.index)
daily_sales = daily_sales.sort_index()

# Fill missing dates with 0
date_range = pd.date_range(daily_sales.index.min(), daily_sales.index.max())
daily_sales = daily_sales.reindex(date_range, fill_value=daily_sales.median())

print("Sales Time Series")
print("=" * 50)
print(f"Period: {daily_sales.index.min().date()} to {daily_sales.index.max().date()}")
print(f"Average daily sales: ${daily_sales.mean():,.2f}")
print(f"Std dev: ${daily_sales.std():,.2f}")

# Decomposition
decomposition = seasonal_decompose(daily_sales, period=7)  # Weekly seasonality

fig, axes = plt.subplots(4, 1, figsize=(14, 12))

axes[0].plot(daily_sales)
axes[0].set_title('Original Sales')
axes[0].set_ylabel('Sales ($)')
axes[0].grid(True, alpha=0.3)

axes[1].plot(decomposition.trend)
axes[1].set_title('Trend')
axes[1].set_ylabel('Sales ($)')
axes[1].grid(True, alpha=0.3)

axes[2].plot(decomposition.seasonal)
axes[2].set_title('Seasonal (Weekly)')
axes[2].set_ylabel('Sales ($)')
axes[2].grid(True, alpha=0.3)

axes[3].plot(decomposition.resid)
axes[3].set_title('Residual')
axes[3].set_ylabel('Sales ($)')
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# ARIMA Forecasting
train = daily_sales[:-30]
test = daily_sales[-30:]

model = ARIMA(train, order=(2, 1, 2))
fitted = model.fit()

forecast = fitted.forecast(steps=30)
forecast_ci = fitted.get_forecast(steps=30).conf_int()

# Plot forecast
plt.figure(figsize=(14, 6))
plt.plot(train[-60:], label='Historical', linewidth=2)
plt.plot(test.index, test, label='Actual', color='green', linewidth=2)
plt.plot(test.index, forecast, label='Forecast', color='red', linewidth=2, linestyle='--')
plt.fill_between(test.index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], alpha=0.2, color='red')
plt.axvline(x=train.index[-1], color='black', linestyle=':', label='Forecast Start')
plt.xlabel('Date')
plt.ylabel('Daily Sales ($)')
plt.title('30-Day Sales Forecast with 95% Confidence Interval')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Accuracy metrics
mape = np.mean(np.abs((test - forecast) / test)) * 100
rmse = np.sqrt(mean_squared_error(test, forecast))

print(f"\nForecast Accuracy (30-day):")
print(f"  RMSE: ${rmse:,.2f}")
print(f"  MAPE: {mape:.1f}%")

# Monthly projection
monthly_forecast = forecast.sum()
monthly_actual = test.sum()
print(f"\n📅 Next Month Projection:")
print(f"  Forecasted: ${monthly_forecast:,.2f}")
print(f"  Actual: ${monthly_actual:,.2f}")
print(f"  Accuracy: {(1 - abs(monthly_forecast - monthly_actual)/monthly_actual)*100:.1f}%")

Part 5: Executive Summary

print("=" * 60)
print("EXECUTIVE SUMMARY: E-COMMERCE ANALYTICS")
print("=" * 60)

print("\n1. DESCRIPTIVE INSIGHTS:")
print(f"   - Average transaction: ${df['amount'].mean():.2f} (median: ${df['amount'].median():.2f})")
print(f"   - Top revenue category: {category_sales.idxmax()}")
print(f"   - Discount usage: {df['discount_used'].mean()*100:.1f}% of transactions")
print(f"   - Highest spending age group: {age_spend.idxmax()}")

print("\n2. A/B TEST RESULTS:")
print(f"   - New checkout improves conversion by ~{expected_lift:.1f}%")
print(f"   - {prob_treatment_better:.1%} probability of being better")
print(f"   - Recommendation: {'SHIP IT' if prob_treatment_better > 0.95 else 'GATHER MORE DATA'}")

print("\n3. CLV PREDICTION MODEL:")
print(f"   - Model R-squared: {r2:.3f}")
print(f"   - Key drivers: {', '.join(importance_df.tail(3)['feature'].values)}")
print(f"   - Use for: Targeting high-value customers, personalization")

print("\n4. SALES FORECAST:")
print(f"   - Next month forecast: ${monthly_forecast:,.0f}")
print(f"   - Forecast accuracy: {(1 - mape/100)*100:.1f}%")
print(f"   - Trend: {'Upward' if decomposition.trend.iloc[-1] > decomposition.trend.iloc[-30] else 'Stable'}")

print("\n5. RECOMMENDED ACTIONS:")
print("   [+] Ship new checkout flow (strong statistical evidence)")
print("   [+] Focus marketing on 36-45 age group (highest spend)")
print("   [+] Increase Electronics category inventory (top revenue)")
print("   [+] Use CLV model for customer acquisition targeting")

print("\n" + "=" * 60)

Submission Checklist

Deliverables

Jupyter Notebook with all analysis code
Executive Summary (1 page) for non-technical stakeholders
Technical Report documenting methodology and assumptions
Slide Deck (5-10 slides) for presentation

Analysis Quality

All statistical assumptions checked and documented
Confidence/credible intervals provided for all estimates
Visualizations are clear and properly labeled
Business implications clearly stated
Limitations acknowledged

Code Quality

Well-commented and readable
Functions documented with docstrings
Reproducible (random seeds set)
Error handling for edge cases

Summary

Module	What You Applied
Descriptive Statistics	Summarized data distributions
Probability	Calculated conditional probabilities
Hypothesis Testing	A/B test significance testing
Bayesian Statistics	Posterior distributions for conversion
Regression	CLV prediction model
Time Series	Sales forecasting

Congratulations! You’ve completed a full statistical analysis pipeline. These skills - from exploratory analysis to predictive modeling - are exactly what data scientists do every day. You’re now ready to tackle real-world problems with confidence!

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Capstone Project: E-Commerce Analytics

​Project Overview

​Part 1: Data Exploration & Descriptive Statistics

​Exploratory Visualizations

​Part 2: Inferential Statistics - A/B Test Analysis

​Part 3: Regression - Predicting Customer Lifetime Value

​Part 4: Time Series - Sales Forecasting

​Part 5: Executive Summary

​Submission Checklist

​Summary

Capstone Project: E-Commerce Analytics

Project Overview

Part 1: Data Exploration & Descriptive Statistics

Exploratory Visualizations

Part 2: Inferential Statistics - A/B Test Analysis

Part 3: Regression - Predicting Customer Lifetime Value

Part 4: Time Series - Sales Forecasting

Part 5: Executive Summary

Submission Checklist

Summary