Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Statistics Capstone Project

Capstone Project: E-Commerce Analytics

Project Overview

You’re a data scientist at an e-commerce company. The CEO wants answers to critical business questions:
  1. Descriptive: What does our sales data look like? Any patterns?
  2. Inferential: Is our new checkout flow actually better?
  3. Predictive: What will sales be next month?
  4. Causal: What factors drive customer lifetime value?
This project integrates everything from the course! Think of this capstone as a dress rehearsal for real data science work. In practice, you never use just one statistical tool — you chain them together. Descriptive statistics tell you what to investigate, inference tells you what to trust, hypothesis testing tells you what to ship, regression tells you what drives outcomes, and time series tells you what comes next. The skill is knowing which tool to reach for at each decision point.
Estimated Time: 6-8 hours
Difficulty: Intermediate
Prerequisites: All previous modules
What You’ll Deliver: Complete analysis report with actionable insights

Part 1: Data Exploration & Descriptive Statistics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from datetime import datetime, timedelta

np.random.seed(42)

# Generate realistic e-commerce data
n_customers = 5000
n_transactions = 20000

# Customer data
customers = pd.DataFrame({
    'customer_id': range(n_customers),
    'signup_date': pd.date_range('2022-01-01', periods=n_customers, freq='2H'),
    'age': np.random.normal(35, 10, n_customers).clip(18, 70).astype(int),
    'gender': np.random.choice(['M', 'F', 'Other'], n_customers, p=[0.48, 0.48, 0.04]),
    'city': np.random.choice(['New York', 'LA', 'Chicago', 'Houston', 'Phoenix'], n_customers),
    'acquisition_source': np.random.choice(['Organic', 'Paid Ads', 'Referral', 'Email'], n_customers, p=[0.3, 0.4, 0.2, 0.1]),
})

# Transaction data (customers make 1-10 purchases)
transactions = []
for cust_id in range(n_customers):
    n_purchases = np.random.poisson(4) + 1
    for _ in range(min(n_purchases, 10)):
        transaction = {
            'customer_id': cust_id,
            'date': customers.loc[cust_id, 'signup_date'] + timedelta(days=np.random.randint(0, 365)),
            'amount': np.random.lognormal(3.5, 1) + 10,  # $10-$500+ range
            'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books', 'Food'], 
                                        p=[0.25, 0.30, 0.20, 0.15, 0.10]),
            'discount_used': np.random.random() < 0.3,
        }
        transactions.append(transaction)

transactions = pd.DataFrame(transactions)
transactions = transactions.sample(n_transactions).reset_index(drop=True)

print("Dataset Overview")
print("=" * 50)
print(f"Customers: {len(customers):,}")
print(f"Transactions: {len(transactions):,}")
print(f"Date range: {transactions['date'].min().date()} to {transactions['date'].max().date()}")

# Merge for analysis
df = transactions.merge(customers, on='customer_id')

print(f"\nTransaction Summary:")
print(df['amount'].describe())

Exploratory Visualizations

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Distribution of purchase amounts
axes[0, 0].hist(df['amount'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].axvline(df['amount'].median(), color='red', linestyle='--', label=f'Median: ${df["amount"].median():.2f}')
axes[0, 0].axvline(df['amount'].mean(), color='green', linestyle='--', label=f'Mean: ${df["amount"].mean():.2f}')
axes[0, 0].set_xlabel('Purchase Amount ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Purchase Amounts')
axes[0, 0].legend()

# Sales by category
category_sales = df.groupby('category')['amount'].sum().sort_values(ascending=True)
axes[0, 1].barh(category_sales.index, category_sales.values)
axes[0, 1].set_xlabel('Total Sales ($)')
axes[0, 1].set_title('Sales by Category')

# Daily transactions over time
daily_sales = df.groupby(df['date'].dt.date)['amount'].sum()
axes[0, 2].plot(daily_sales.index, daily_sales.values, alpha=0.7)
axes[0, 2].set_xlabel('Date')
axes[0, 2].set_ylabel('Daily Sales ($)')
axes[0, 2].set_title('Daily Sales Over Time')
axes[0, 2].tick_params(axis='x', rotation=45)

# Customer age distribution
axes[1, 0].hist(customers['age'], bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Customer Age Distribution')

# Purchases by acquisition source
source_counts = df['acquisition_source'].value_counts()
axes[1, 1].pie(source_counts.values, labels=source_counts.index, autopct='%1.1f%%')
axes[1, 1].set_title('Transactions by Acquisition Source')

# Average spend by age group
df['age_group'] = pd.cut(df['age'], bins=[18, 25, 35, 45, 55, 70], labels=['18-25', '26-35', '36-45', '46-55', '56+'])
age_spend = df.groupby('age_group')['amount'].mean()
axes[1, 2].bar(age_spend.index, age_spend.values)
axes[1, 2].set_xlabel('Age Group')
axes[1, 2].set_ylabel('Average Spend ($)')
axes[1, 2].set_title('Average Spend by Age Group')

plt.tight_layout()
plt.show()

# Key insights
print("\nKey Descriptive Insights:")
print(f"  - Median purchase: ${df['amount'].median():.2f} (mean: ${df['amount'].mean():.2f}) - Right-skewed!")
print(f"  - Top category: {category_sales.idxmax()} (${category_sales.max():,.0f} total)")
print(f"  - {df['discount_used'].mean()*100:.1f}% of purchases used a discount")

Part 2: Inferential Statistics - A/B Test Analysis

The marketing team ran an A/B test on a new checkout flow. Let’s analyze the results.
# Simulate A/B test data
np.random.seed(42)

# Control group (old checkout)
n_control = 5000
control_visitors = n_control
control_conversions = int(n_control * 0.032)  # 3.2% conversion rate
control_revenue = np.random.lognormal(3.8, 0.8, control_conversions)

# Treatment group (new checkout)
n_treatment = 5000
treatment_visitors = n_treatment
treatment_conversions = int(n_treatment * 0.038)  # 3.8% conversion rate
treatment_revenue = np.random.lognormal(3.9, 0.8, treatment_conversions)

print("A/B Test: New Checkout Flow")
print("=" * 50)
print(f"Control (Old):")
print(f"  Visitors: {control_visitors:,}")
print(f"  Conversions: {control_conversions} ({control_conversions/control_visitors:.2%})")
print(f"  Avg Revenue: ${np.mean(control_revenue):.2f}")

print(f"\nTreatment (New):")
print(f"  Visitors: {treatment_visitors:,}")
print(f"  Conversions: {treatment_conversions} ({treatment_conversions/treatment_visitors:.2%})")
print(f"  Avg Revenue: ${np.mean(treatment_revenue):.2f}")

# Statistical tests
# 1. Chi-square test for conversion rate
contingency_table = np.array([
    [control_conversions, control_visitors - control_conversions],
    [treatment_conversions, treatment_visitors - treatment_conversions]
])
chi2, p_value_chi, _, _ = stats.chi2_contingency(contingency_table)

print(f"\nStatistical Tests:")
print(f"1. Chi-Square Test for Conversion Rate:")
print(f"   χ² = {chi2:.4f}, p-value = {p_value_chi:.4f}")
print(f"   {'Significant!' if p_value_chi < 0.05 else 'Not significant'} at α = 0.05")

# 2. Two-sample t-test for revenue (among converters)
t_stat, p_value_t = stats.ttest_ind(treatment_revenue, control_revenue)
print(f"\n2. T-Test for Revenue per Conversion:")
print(f"   t = {t_stat:.4f}, p-value = {p_value_t:.4f}")
print(f"   {'Significant!' if p_value_t < 0.05 else 'Not significant'} at α = 0.05")

# 3. Bayesian analysis
from scipy.stats import beta

# Posterior distributions
posterior_control = beta(1 + control_conversions, 1 + control_visitors - control_conversions)
posterior_treatment = beta(1 + treatment_conversions, 1 + treatment_visitors - treatment_conversions)

# Sample from posteriors
samples_control = posterior_control.rvs(100000)
samples_treatment = posterior_treatment.rvs(100000)

prob_treatment_better = np.mean(samples_treatment > samples_control)
expected_lift = np.mean((samples_treatment - samples_control) / samples_control) * 100

print(f"\n3. Bayesian Analysis:")
print(f"   P(Treatment > Control) = {prob_treatment_better:.1%}")
print(f"   Expected Lift = {expected_lift:.2f}%")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Posterior distributions
x = np.linspace(0.025, 0.045, 500)
axes[0].plot(x, posterior_control.pdf(x), 'b-', linewidth=2, label='Control')
axes[0].plot(x, posterior_treatment.pdf(x), 'r-', linewidth=2, label='Treatment')
axes[0].fill_between(x, posterior_control.pdf(x), alpha=0.3, color='blue')
axes[0].fill_between(x, posterior_treatment.pdf(x), alpha=0.3, color='red')
axes[0].set_xlabel('Conversion Rate')
axes[0].set_ylabel('Probability Density')
axes[0].set_title('Posterior Distributions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Lift distribution
lift = (samples_treatment - samples_control) / samples_control * 100
axes[1].hist(lift, bins=100, density=True, alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=2)
axes[1].axvline(x=np.mean(lift), color='red', linestyle='-', linewidth=2, label=f'Mean: {np.mean(lift):.1f}%')
axes[1].set_xlabel('Lift (%)')
axes[1].set_ylabel('Density')
axes[1].set_title('Distribution of Lift')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Revenue comparison
axes[2].hist(control_revenue, bins=30, alpha=0.5, label='Control', density=True)
axes[2].hist(treatment_revenue, bins=30, alpha=0.5, label='Treatment', density=True)
axes[2].set_xlabel('Revenue per Conversion ($)')
axes[2].set_ylabel('Density')
axes[2].set_title('Revenue Distribution per Conversion')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Business recommendation
print("\n" + "=" * 50)
print("BUSINESS RECOMMENDATION:")
if prob_treatment_better > 0.95:
    revenue_impact = treatment_visitors * 12 * (np.mean(samples_treatment) * np.mean(treatment_revenue) - 
                                                  np.mean(samples_control) * np.mean(control_revenue))
    print(f"   SHIP IT! The new checkout flow is {prob_treatment_better:.1%} likely to be better.")
    print(f"   Estimated annual revenue impact: ${revenue_impact:,.0f}")
else:
    print("   WAIT - Need more data. The difference isn't statistically significant yet.")

Part 3: Regression - Predicting Customer Lifetime Value

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# Calculate customer-level metrics
customer_metrics = df.groupby('customer_id').agg({
    'amount': ['sum', 'mean', 'count'],
    'discount_used': 'mean',
    'date': ['min', 'max']
}).reset_index()

customer_metrics.columns = ['customer_id', 'total_spend', 'avg_order', 'n_orders', 
                            'discount_rate', 'first_purchase', 'last_purchase']

# Merge with customer demographics
customer_data = customer_metrics.merge(customers, on='customer_id')

# Calculate customer tenure (days since signup)
customer_data['tenure_days'] = (customer_data['last_purchase'] - customer_data['signup_date']).dt.days

# Target: total_spend (CLV proxy)
# Features: age, n_orders, avg_order, discount_rate, tenure_days, acquisition_source

feature_cols = ['age', 'n_orders', 'avg_order', 'discount_rate', 'tenure_days']
cat_cols = ['acquisition_source', 'city']

X = customer_data[feature_cols + cat_cols]
y = customer_data['total_spend']

# Preprocessing pipeline
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), feature_cols),
    ('cat', OneHotEncoder(drop='first', sparse_output=False), cat_cols)
])

# Model pipeline
model = Pipeline([
    ('prep', preprocessor),
    ('reg', Ridge(alpha=1.0))
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("CLV Prediction Model")
print("=" * 50)
print(f"R² Score: {r2:.4f}")
print(f"RMSE: ${rmse:.2f}")
print(f"Cross-validation R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Feature importance (from Ridge coefficients)
feature_names = feature_cols + list(model.named_steps['prep'].named_transformers_['cat'].get_feature_names_out(cat_cols))
coefficients = model.named_steps['reg'].coef_

importance_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values('coefficient', key=abs, ascending=True)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Predicted vs Actual
axes[0].scatter(y_test, y_pred, alpha=0.3, s=10)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
axes[0].set_xlabel('Actual CLV ($)')
axes[0].set_ylabel('Predicted CLV ($)')
axes[0].set_title(f'Predicted vs Actual (R² = {r2:.3f})')
axes[0].grid(True, alpha=0.3)

# Residuals
residuals = y_test - y_pred
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].grid(True, alpha=0.3)

# Feature importance
axes[2].barh(importance_df['feature'], importance_df['coefficient'])
axes[2].axvline(x=0, color='black', linewidth=0.5)
axes[2].set_xlabel('Coefficient')
axes[2].set_title('Feature Importance (Ridge Coefficients)')

plt.tight_layout()
plt.show()

# Key drivers
print("\nKey CLV Drivers:")
for _, row in importance_df.tail(5).iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"  - {row['feature']}: {direction} CLV by ${abs(row['coefficient']):.2f} per unit")

Part 4: Time Series - Sales Forecasting

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA

# Aggregate to daily sales
daily_sales = df.groupby(df['date'].dt.date)['amount'].sum()
daily_sales.index = pd.to_datetime(daily_sales.index)
daily_sales = daily_sales.sort_index()

# Fill missing dates with 0
date_range = pd.date_range(daily_sales.index.min(), daily_sales.index.max())
daily_sales = daily_sales.reindex(date_range, fill_value=daily_sales.median())

print("Sales Time Series")
print("=" * 50)
print(f"Period: {daily_sales.index.min().date()} to {daily_sales.index.max().date()}")
print(f"Average daily sales: ${daily_sales.mean():,.2f}")
print(f"Std dev: ${daily_sales.std():,.2f}")

# Decomposition
decomposition = seasonal_decompose(daily_sales, period=7)  # Weekly seasonality

fig, axes = plt.subplots(4, 1, figsize=(14, 12))

axes[0].plot(daily_sales)
axes[0].set_title('Original Sales')
axes[0].set_ylabel('Sales ($)')
axes[0].grid(True, alpha=0.3)

axes[1].plot(decomposition.trend)
axes[1].set_title('Trend')
axes[1].set_ylabel('Sales ($)')
axes[1].grid(True, alpha=0.3)

axes[2].plot(decomposition.seasonal)
axes[2].set_title('Seasonal (Weekly)')
axes[2].set_ylabel('Sales ($)')
axes[2].grid(True, alpha=0.3)

axes[3].plot(decomposition.resid)
axes[3].set_title('Residual')
axes[3].set_ylabel('Sales ($)')
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# ARIMA Forecasting
train = daily_sales[:-30]
test = daily_sales[-30:]

model = ARIMA(train, order=(2, 1, 2))
fitted = model.fit()

forecast = fitted.forecast(steps=30)
forecast_ci = fitted.get_forecast(steps=30).conf_int()

# Plot forecast
plt.figure(figsize=(14, 6))
plt.plot(train[-60:], label='Historical', linewidth=2)
plt.plot(test.index, test, label='Actual', color='green', linewidth=2)
plt.plot(test.index, forecast, label='Forecast', color='red', linewidth=2, linestyle='--')
plt.fill_between(test.index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], alpha=0.2, color='red')
plt.axvline(x=train.index[-1], color='black', linestyle=':', label='Forecast Start')
plt.xlabel('Date')
plt.ylabel('Daily Sales ($)')
plt.title('30-Day Sales Forecast with 95% Confidence Interval')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Accuracy metrics
mape = np.mean(np.abs((test - forecast) / test)) * 100
rmse = np.sqrt(mean_squared_error(test, forecast))

print(f"\nForecast Accuracy (30-day):")
print(f"  RMSE: ${rmse:,.2f}")
print(f"  MAPE: {mape:.1f}%")

# Monthly projection
monthly_forecast = forecast.sum()
monthly_actual = test.sum()
print(f"\n📅 Next Month Projection:")
print(f"  Forecasted: ${monthly_forecast:,.2f}")
print(f"  Actual: ${monthly_actual:,.2f}")
print(f"  Accuracy: {(1 - abs(monthly_forecast - monthly_actual)/monthly_actual)*100:.1f}%")

Part 5: Executive Summary

print("=" * 60)
print("EXECUTIVE SUMMARY: E-COMMERCE ANALYTICS")
print("=" * 60)

print("\n1. DESCRIPTIVE INSIGHTS:")
print(f"   - Average transaction: ${df['amount'].mean():.2f} (median: ${df['amount'].median():.2f})")
print(f"   - Top revenue category: {category_sales.idxmax()}")
print(f"   - Discount usage: {df['discount_used'].mean()*100:.1f}% of transactions")
print(f"   - Highest spending age group: {age_spend.idxmax()}")

print("\n2. A/B TEST RESULTS:")
print(f"   - New checkout improves conversion by ~{expected_lift:.1f}%")
print(f"   - {prob_treatment_better:.1%} probability of being better")
print(f"   - Recommendation: {'SHIP IT' if prob_treatment_better > 0.95 else 'GATHER MORE DATA'}")

print("\n3. CLV PREDICTION MODEL:")
print(f"   - Model R-squared: {r2:.3f}")
print(f"   - Key drivers: {', '.join(importance_df.tail(3)['feature'].values)}")
print(f"   - Use for: Targeting high-value customers, personalization")

print("\n4. SALES FORECAST:")
print(f"   - Next month forecast: ${monthly_forecast:,.0f}")
print(f"   - Forecast accuracy: {(1 - mape/100)*100:.1f}%")
print(f"   - Trend: {'Upward' if decomposition.trend.iloc[-1] > decomposition.trend.iloc[-30] else 'Stable'}")

print("\n5. RECOMMENDED ACTIONS:")
print("   [+] Ship new checkout flow (strong statistical evidence)")
print("   [+] Focus marketing on 36-45 age group (highest spend)")
print("   [+] Increase Electronics category inventory (top revenue)")
print("   [+] Use CLV model for customer acquisition targeting")

print("\n" + "=" * 60)

Submission Checklist

  1. Jupyter Notebook with all analysis code
  2. Executive Summary (1 page) for non-technical stakeholders
  3. Technical Report documenting methodology and assumptions
  4. Slide Deck (5-10 slides) for presentation
  • All statistical assumptions checked and documented
  • Confidence/credible intervals provided for all estimates
  • Visualizations are clear and properly labeled
  • Business implications clearly stated
  • Limitations acknowledged
  • Well-commented and readable
  • Functions documented with docstrings
  • Reproducible (random seeds set)
  • Error handling for edge cases

Summary

ModuleWhat You Applied
Descriptive StatisticsSummarized data distributions
ProbabilityCalculated conditional probabilities
Hypothesis TestingA/B test significance testing
Bayesian StatisticsPosterior distributions for conversion
RegressionCLV prediction model
Time SeriesSales forecasting
Congratulations! You’ve completed a full statistical analysis pipeline. These skills - from exploratory analysis to predictive modeling - are exactly what data scientists do every day. You’re now ready to tackle real-world problems with confidence!