Capstone Project: E-Commerce Analytics
Project Overview
You’re a data scientist at an e-commerce company. The CEO wants answers to critical business questions:- Descriptive: What does our sales data look like? Any patterns?
- Inferential: Is our new checkout flow actually better?
- Predictive: What will sales be next month?
- Causal: What factors drive customer lifetime value?
Estimated Time: 6-8 hours
Difficulty: Intermediate
Prerequisites: All previous modules
What You’ll Deliver: Complete analysis report with actionable insights
Difficulty: Intermediate
Prerequisites: All previous modules
What You’ll Deliver: Complete analysis report with actionable insights
Part 1: Data Exploration & Descriptive Statistics
Copy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from datetime import datetime, timedelta
np.random.seed(42)
# Generate realistic e-commerce data
n_customers = 5000
n_transactions = 20000
# Customer data
customers = pd.DataFrame({
'customer_id': range(n_customers),
'signup_date': pd.date_range('2022-01-01', periods=n_customers, freq='2H'),
'age': np.random.normal(35, 10, n_customers).clip(18, 70).astype(int),
'gender': np.random.choice(['M', 'F', 'Other'], n_customers, p=[0.48, 0.48, 0.04]),
'city': np.random.choice(['New York', 'LA', 'Chicago', 'Houston', 'Phoenix'], n_customers),
'acquisition_source': np.random.choice(['Organic', 'Paid Ads', 'Referral', 'Email'], n_customers, p=[0.3, 0.4, 0.2, 0.1]),
})
# Transaction data (customers make 1-10 purchases)
transactions = []
for cust_id in range(n_customers):
n_purchases = np.random.poisson(4) + 1
for _ in range(min(n_purchases, 10)):
transaction = {
'customer_id': cust_id,
'date': customers.loc[cust_id, 'signup_date'] + timedelta(days=np.random.randint(0, 365)),
'amount': np.random.lognormal(3.5, 1) + 10, # $10-$500+ range
'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books', 'Food'],
p=[0.25, 0.30, 0.20, 0.15, 0.10]),
'discount_used': np.random.random() < 0.3,
}
transactions.append(transaction)
transactions = pd.DataFrame(transactions)
transactions = transactions.sample(n_transactions).reset_index(drop=True)
print("Dataset Overview")
print("=" * 50)
print(f"Customers: {len(customers):,}")
print(f"Transactions: {len(transactions):,}")
print(f"Date range: {transactions['date'].min().date()} to {transactions['date'].max().date()}")
# Merge for analysis
df = transactions.merge(customers, on='customer_id')
print(f"\nTransaction Summary:")
print(df['amount'].describe())
Exploratory Visualizations
Copy
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
# Distribution of purchase amounts
axes[0, 0].hist(df['amount'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].axvline(df['amount'].median(), color='red', linestyle='--', label=f'Median: ${df["amount"].median():.2f}')
axes[0, 0].axvline(df['amount'].mean(), color='green', linestyle='--', label=f'Mean: ${df["amount"].mean():.2f}')
axes[0, 0].set_xlabel('Purchase Amount ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Purchase Amounts')
axes[0, 0].legend()
# Sales by category
category_sales = df.groupby('category')['amount'].sum().sort_values(ascending=True)
axes[0, 1].barh(category_sales.index, category_sales.values)
axes[0, 1].set_xlabel('Total Sales ($)')
axes[0, 1].set_title('Sales by Category')
# Daily transactions over time
daily_sales = df.groupby(df['date'].dt.date)['amount'].sum()
axes[0, 2].plot(daily_sales.index, daily_sales.values, alpha=0.7)
axes[0, 2].set_xlabel('Date')
axes[0, 2].set_ylabel('Daily Sales ($)')
axes[0, 2].set_title('Daily Sales Over Time')
axes[0, 2].tick_params(axis='x', rotation=45)
# Customer age distribution
axes[1, 0].hist(customers['age'], bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Customer Age Distribution')
# Purchases by acquisition source
source_counts = df['acquisition_source'].value_counts()
axes[1, 1].pie(source_counts.values, labels=source_counts.index, autopct='%1.1f%%')
axes[1, 1].set_title('Transactions by Acquisition Source')
# Average spend by age group
df['age_group'] = pd.cut(df['age'], bins=[18, 25, 35, 45, 55, 70], labels=['18-25', '26-35', '36-45', '46-55', '56+'])
age_spend = df.groupby('age_group')['amount'].mean()
axes[1, 2].bar(age_spend.index, age_spend.values)
axes[1, 2].set_xlabel('Age Group')
axes[1, 2].set_ylabel('Average Spend ($)')
axes[1, 2].set_title('Average Spend by Age Group')
plt.tight_layout()
plt.show()
# Key insights
print("\nKey Descriptive Insights:")
print(f" - Median purchase: ${df['amount'].median():.2f} (mean: ${df['amount'].mean():.2f}) - Right-skewed!")
print(f" - Top category: {category_sales.idxmax()} (${category_sales.max():,.0f} total)")
print(f" - {df['discount_used'].mean()*100:.1f}% of purchases used a discount")
Part 2: Inferential Statistics - A/B Test Analysis
The marketing team ran an A/B test on a new checkout flow. Let’s analyze the results.Copy
# Simulate A/B test data
np.random.seed(42)
# Control group (old checkout)
n_control = 5000
control_visitors = n_control
control_conversions = int(n_control * 0.032) # 3.2% conversion rate
control_revenue = np.random.lognormal(3.8, 0.8, control_conversions)
# Treatment group (new checkout)
n_treatment = 5000
treatment_visitors = n_treatment
treatment_conversions = int(n_treatment * 0.038) # 3.8% conversion rate
treatment_revenue = np.random.lognormal(3.9, 0.8, treatment_conversions)
print("A/B Test: New Checkout Flow")
print("=" * 50)
print(f"Control (Old):")
print(f" Visitors: {control_visitors:,}")
print(f" Conversions: {control_conversions} ({control_conversions/control_visitors:.2%})")
print(f" Avg Revenue: ${np.mean(control_revenue):.2f}")
print(f"\nTreatment (New):")
print(f" Visitors: {treatment_visitors:,}")
print(f" Conversions: {treatment_conversions} ({treatment_conversions/treatment_visitors:.2%})")
print(f" Avg Revenue: ${np.mean(treatment_revenue):.2f}")
# Statistical tests
# 1. Chi-square test for conversion rate
contingency_table = np.array([
[control_conversions, control_visitors - control_conversions],
[treatment_conversions, treatment_visitors - treatment_conversions]
])
chi2, p_value_chi, _, _ = stats.chi2_contingency(contingency_table)
print(f"\nStatistical Tests:")
print(f"1. Chi-Square Test for Conversion Rate:")
print(f" χ² = {chi2:.4f}, p-value = {p_value_chi:.4f}")
print(f" {'Significant!' if p_value_chi < 0.05 else 'Not significant'} at α = 0.05")
# 2. Two-sample t-test for revenue (among converters)
t_stat, p_value_t = stats.ttest_ind(treatment_revenue, control_revenue)
print(f"\n2. T-Test for Revenue per Conversion:")
print(f" t = {t_stat:.4f}, p-value = {p_value_t:.4f}")
print(f" {'Significant!' if p_value_t < 0.05 else 'Not significant'} at α = 0.05")
# 3. Bayesian analysis
from scipy.stats import beta
# Posterior distributions
posterior_control = beta(1 + control_conversions, 1 + control_visitors - control_conversions)
posterior_treatment = beta(1 + treatment_conversions, 1 + treatment_visitors - treatment_conversions)
# Sample from posteriors
samples_control = posterior_control.rvs(100000)
samples_treatment = posterior_treatment.rvs(100000)
prob_treatment_better = np.mean(samples_treatment > samples_control)
expected_lift = np.mean((samples_treatment - samples_control) / samples_control) * 100
print(f"\n3. Bayesian Analysis:")
print(f" P(Treatment > Control) = {prob_treatment_better:.1%}")
print(f" Expected Lift = {expected_lift:.2f}%")
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Posterior distributions
x = np.linspace(0.025, 0.045, 500)
axes[0].plot(x, posterior_control.pdf(x), 'b-', linewidth=2, label='Control')
axes[0].plot(x, posterior_treatment.pdf(x), 'r-', linewidth=2, label='Treatment')
axes[0].fill_between(x, posterior_control.pdf(x), alpha=0.3, color='blue')
axes[0].fill_between(x, posterior_treatment.pdf(x), alpha=0.3, color='red')
axes[0].set_xlabel('Conversion Rate')
axes[0].set_ylabel('Probability Density')
axes[0].set_title('Posterior Distributions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Lift distribution
lift = (samples_treatment - samples_control) / samples_control * 100
axes[1].hist(lift, bins=100, density=True, alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=2)
axes[1].axvline(x=np.mean(lift), color='red', linestyle='-', linewidth=2, label=f'Mean: {np.mean(lift):.1f}%')
axes[1].set_xlabel('Lift (%)')
axes[1].set_ylabel('Density')
axes[1].set_title('Distribution of Lift')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Revenue comparison
axes[2].hist(control_revenue, bins=30, alpha=0.5, label='Control', density=True)
axes[2].hist(treatment_revenue, bins=30, alpha=0.5, label='Treatment', density=True)
axes[2].set_xlabel('Revenue per Conversion ($)')
axes[2].set_ylabel('Density')
axes[2].set_title('Revenue Distribution per Conversion')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Business recommendation
print("\n" + "=" * 50)
print("BUSINESS RECOMMENDATION:")
if prob_treatment_better > 0.95:
revenue_impact = treatment_visitors * 12 * (np.mean(samples_treatment) * np.mean(treatment_revenue) -
np.mean(samples_control) * np.mean(control_revenue))
print(f" SHIP IT! The new checkout flow is {prob_treatment_better:.1%} likely to be better.")
print(f" Estimated annual revenue impact: ${revenue_impact:,.0f}")
else:
print(" WAIT - Need more data. The difference isn't statistically significant yet.")
Part 3: Regression - Predicting Customer Lifetime Value
Copy
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
# Calculate customer-level metrics
customer_metrics = df.groupby('customer_id').agg({
'amount': ['sum', 'mean', 'count'],
'discount_used': 'mean',
'date': ['min', 'max']
}).reset_index()
customer_metrics.columns = ['customer_id', 'total_spend', 'avg_order', 'n_orders',
'discount_rate', 'first_purchase', 'last_purchase']
# Merge with customer demographics
customer_data = customer_metrics.merge(customers, on='customer_id')
# Calculate customer tenure (days since signup)
customer_data['tenure_days'] = (customer_data['last_purchase'] - customer_data['signup_date']).dt.days
# Target: total_spend (CLV proxy)
# Features: age, n_orders, avg_order, discount_rate, tenure_days, acquisition_source
feature_cols = ['age', 'n_orders', 'avg_order', 'discount_rate', 'tenure_days']
cat_cols = ['acquisition_source', 'city']
X = customer_data[feature_cols + cat_cols]
y = customer_data['total_spend']
# Preprocessing pipeline
preprocessor = ColumnTransformer([
('num', StandardScaler(), feature_cols),
('cat', OneHotEncoder(drop='first', sparse_output=False), cat_cols)
])
# Model pipeline
model = Pipeline([
('prep', preprocessor),
('reg', Ridge(alpha=1.0))
])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print("CLV Prediction Model")
print("=" * 50)
print(f"R² Score: {r2:.4f}")
print(f"RMSE: ${rmse:.2f}")
print(f"Cross-validation R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
# Feature importance (from Ridge coefficients)
feature_names = feature_cols + list(model.named_steps['prep'].named_transformers_['cat'].get_feature_names_out(cat_cols))
coefficients = model.named_steps['reg'].coef_
importance_df = pd.DataFrame({
'feature': feature_names,
'coefficient': coefficients
}).sort_values('coefficient', key=abs, ascending=True)
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Predicted vs Actual
axes[0].scatter(y_test, y_pred, alpha=0.3, s=10)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
axes[0].set_xlabel('Actual CLV ($)')
axes[0].set_ylabel('Predicted CLV ($)')
axes[0].set_title(f'Predicted vs Actual (R² = {r2:.3f})')
axes[0].grid(True, alpha=0.3)
# Residuals
residuals = y_test - y_pred
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].grid(True, alpha=0.3)
# Feature importance
axes[2].barh(importance_df['feature'], importance_df['coefficient'])
axes[2].axvline(x=0, color='black', linewidth=0.5)
axes[2].set_xlabel('Coefficient')
axes[2].set_title('Feature Importance (Ridge Coefficients)')
plt.tight_layout()
plt.show()
# Key drivers
print("\nKey CLV Drivers:")
for _, row in importance_df.tail(5).iterrows():
direction = "increases" if row['coefficient'] > 0 else "decreases"
print(f" - {row['feature']}: {direction} CLV by ${abs(row['coefficient']):.2f} per unit")
Part 4: Time Series - Sales Forecasting
Copy
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
# Aggregate to daily sales
daily_sales = df.groupby(df['date'].dt.date)['amount'].sum()
daily_sales.index = pd.to_datetime(daily_sales.index)
daily_sales = daily_sales.sort_index()
# Fill missing dates with 0
date_range = pd.date_range(daily_sales.index.min(), daily_sales.index.max())
daily_sales = daily_sales.reindex(date_range, fill_value=daily_sales.median())
print("Sales Time Series")
print("=" * 50)
print(f"Period: {daily_sales.index.min().date()} to {daily_sales.index.max().date()}")
print(f"Average daily sales: ${daily_sales.mean():,.2f}")
print(f"Std dev: ${daily_sales.std():,.2f}")
# Decomposition
decomposition = seasonal_decompose(daily_sales, period=7) # Weekly seasonality
fig, axes = plt.subplots(4, 1, figsize=(14, 12))
axes[0].plot(daily_sales)
axes[0].set_title('Original Sales')
axes[0].set_ylabel('Sales ($)')
axes[0].grid(True, alpha=0.3)
axes[1].plot(decomposition.trend)
axes[1].set_title('Trend')
axes[1].set_ylabel('Sales ($)')
axes[1].grid(True, alpha=0.3)
axes[2].plot(decomposition.seasonal)
axes[2].set_title('Seasonal (Weekly)')
axes[2].set_ylabel('Sales ($)')
axes[2].grid(True, alpha=0.3)
axes[3].plot(decomposition.resid)
axes[3].set_title('Residual')
axes[3].set_ylabel('Sales ($)')
axes[3].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# ARIMA Forecasting
train = daily_sales[:-30]
test = daily_sales[-30:]
model = ARIMA(train, order=(2, 1, 2))
fitted = model.fit()
forecast = fitted.forecast(steps=30)
forecast_ci = fitted.get_forecast(steps=30).conf_int()
# Plot forecast
plt.figure(figsize=(14, 6))
plt.plot(train[-60:], label='Historical', linewidth=2)
plt.plot(test.index, test, label='Actual', color='green', linewidth=2)
plt.plot(test.index, forecast, label='Forecast', color='red', linewidth=2, linestyle='--')
plt.fill_between(test.index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], alpha=0.2, color='red')
plt.axvline(x=train.index[-1], color='black', linestyle=':', label='Forecast Start')
plt.xlabel('Date')
plt.ylabel('Daily Sales ($)')
plt.title('30-Day Sales Forecast with 95% Confidence Interval')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Accuracy metrics
mape = np.mean(np.abs((test - forecast) / test)) * 100
rmse = np.sqrt(mean_squared_error(test, forecast))
print(f"\nForecast Accuracy (30-day):")
print(f" RMSE: ${rmse:,.2f}")
print(f" MAPE: {mape:.1f}%")
# Monthly projection
monthly_forecast = forecast.sum()
monthly_actual = test.sum()
print(f"\n📅 Next Month Projection:")
print(f" Forecasted: ${monthly_forecast:,.2f}")
print(f" Actual: ${monthly_actual:,.2f}")
print(f" Accuracy: {(1 - abs(monthly_forecast - monthly_actual)/monthly_actual)*100:.1f}%")
Part 5: Executive Summary
Copy
print("=" * 60)
print("EXECUTIVE SUMMARY: E-COMMERCE ANALYTICS")
print("=" * 60)
print("\n1. DESCRIPTIVE INSIGHTS:")
print(f" - Average transaction: ${df['amount'].mean():.2f} (median: ${df['amount'].median():.2f})")
print(f" - Top revenue category: {category_sales.idxmax()}")
print(f" - Discount usage: {df['discount_used'].mean()*100:.1f}% of transactions")
print(f" - Highest spending age group: {age_spend.idxmax()}")
print("\n2. A/B TEST RESULTS:")
print(f" - New checkout improves conversion by ~{expected_lift:.1f}%")
print(f" - {prob_treatment_better:.1%} probability of being better")
print(f" - Recommendation: {'SHIP IT' if prob_treatment_better > 0.95 else 'GATHER MORE DATA'}")
print("\n3. CLV PREDICTION MODEL:")
print(f" - Model R-squared: {r2:.3f}")
print(f" - Key drivers: {', '.join(importance_df.tail(3)['feature'].values)}")
print(f" - Use for: Targeting high-value customers, personalization")
print("\n4. SALES FORECAST:")
print(f" - Next month forecast: ${monthly_forecast:,.0f}")
print(f" - Forecast accuracy: {(1 - mape/100)*100:.1f}%")
print(f" - Trend: {'Upward' if decomposition.trend.iloc[-1] > decomposition.trend.iloc[-30] else 'Stable'}")
print("\n5. RECOMMENDED ACTIONS:")
print(" [+] Ship new checkout flow (strong statistical evidence)")
print(" [+] Focus marketing on 36-45 age group (highest spend)")
print(" [+] Increase Electronics category inventory (top revenue)")
print(" [+] Use CLV model for customer acquisition targeting")
print("\n" + "=" * 60)
Submission Checklist
Deliverables
Deliverables
- Jupyter Notebook with all analysis code
- Executive Summary (1 page) for non-technical stakeholders
- Technical Report documenting methodology and assumptions
- Slide Deck (5-10 slides) for presentation
Analysis Quality
Analysis Quality
- All statistical assumptions checked and documented
- Confidence/credible intervals provided for all estimates
- Visualizations are clear and properly labeled
- Business implications clearly stated
- Limitations acknowledged
Code Quality
Code Quality
- Well-commented and readable
- Functions documented with docstrings
- Reproducible (random seeds set)
- Error handling for edge cases
Summary
| Module | What You Applied |
|---|---|
| Descriptive Statistics | Summarized data distributions |
| Probability | Calculated conditional probabilities |
| Hypothesis Testing | A/B test significance testing |
| Bayesian Statistics | Posterior distributions for conversion |
| Regression | CLV prediction model |
| Time Series | Sales forecasting |
Congratulations! You’ve completed a full statistical analysis pipeline. These skills - from exploratory analysis to predictive modeling - are exactly what data scientists do every day. You’re now ready to tackle real-world problems with confidence!