> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Feature Engineering

> Transform raw data into features that models can learn from

# Feature Engineering

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/feature-engineering-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=a1bb657fd51ab2ec1066fcdb45135ac3" alt="Feature Engineering Pipeline" width="1080" height="1080" data-path="images/courses/ml-mastery/feature-engineering-concept.svg" />
</Frame>

## Data Scientists Spend 80% of Time Here

Raw data is messy. Models need clean, meaningful numbers.

**Feature engineering** is the art of transforming raw data into features that help models learn. It's the difference between feeding a model "March 15, 1995" (a string it can't use) and feeding it "30 years old, built pre-2000, winter construction" (numbers that carry meaning).

Here's a truth that surprises beginners: a simple model with great features almost always beats a complex model with raw features. Feature engineering is where domain knowledge meets data science, and it's the single highest-leverage activity in most ML projects.

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/feature-engineering-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=df18f0a0958f55a64a218d88f394ebc4" alt="E-commerce Feature Extraction" width="1080" height="1080" data-path="images/courses/ml-mastery/feature-engineering-real-world.svg" />
</Frame>

***

## The House Price Example

Raw data:

```
Address: "123 Main St, New York, NY 10001"
Built: "March 15, 1995"  
Description: "Cozy 3BR, renovated kitchen, near subway"
Price: $850,000
```

What a model needs:

```python theme={null}
{
    'bedrooms': 3,
    'city_encoded': 45,        # New York
    'year_built': 1995,
    'building_age': 30,
    'is_renovated': 1,
    'near_transit': 1,
    'zip_price_tier': 3        # Expensive area
}
```

***

## Handling Missing Values

```python theme={null}
import pandas as pd
import numpy as np

# Sample data with missing values
df = pd.DataFrame({
    'age': [25, np.nan, 35, 40, np.nan],
    'income': [50000, 60000, np.nan, 80000, 90000],
    'education': ['Bachelor', 'Master', np.nan, 'PhD', 'Bachelor']
})

print("Missing values:")
print(df.isnull().sum())
```

### Strategy 1: Drop Missing Values

```python theme={null}
# Drop rows with ANY missing values -- the nuclear option.
# Only safe when: (1) you have plenty of data, (2) missingness is random,
# and (3) the missing rows aren't systematically different from the rest.
df_clean = df.dropna()

# Safer: drop rows missing only in critical columns
df_clean = df.dropna(subset=['age'])
```

<Warning>
  Only use when you have lots of data and missingness is **truly random** (called MCAR -- Missing Completely At Random). If high-income people tend to skip the income question, dropping those rows biases your model toward low-income profiles. Check this by comparing the distributions of other features between "has missing" and "no missing" groups.
</Warning>

### Strategy 2: Imputation

```python theme={null}
from sklearn.impute import SimpleImputer

# Numeric: fill with mean, median, or constant
numeric_imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent'
df['age'] = numeric_imputer.fit_transform(df[['age']])

# Categorical: fill with mode or 'Unknown'
categorical_imputer = SimpleImputer(strategy='most_frequent')
df['education'] = categorical_imputer.fit_transform(df[['education']])
```

### Strategy 3: Indicator Variables

```python theme={null}
# Create a flag for missing values (can be informative!)
# Why? Because "missing" is often not random -- it carries signal.
# For example, if income is missing, the person might have refused
# to share it, which itself correlates with certain behaviors.
df['age_missing'] = df['age'].isnull().astype(int)
df['age'] = df['age'].fillna(df['age'].median())
```

<Tip>
  **Which imputation strategy should you use?** Use **median** for numeric features with outliers (median is robust to extreme values). Use **mean** for normally distributed features. Use **mode** (most frequent) for categorical features. And always create a missingness indicator -- it's free information and tree-based models will use it if it's predictive.
</Tip>

***

## Encoding Categorical Variables

### Label Encoding (for ordinal categories)

Use this when categories have a natural order -- like education levels, satisfaction ratings, or T-shirt sizes. The numbers you assign should reflect the ranking.

```python theme={null}
from sklearn.preprocessing import LabelEncoder

# Ordinal: order matters -- PhD > Master > Bachelor > High School
# The numeric values (0,1,2,3) encode this ranking, which means
# the model can learn "higher education = different outcome."
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
df['education_encoded'] = df['education'].map({
    'High School': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
})
# Common mistake: using label encoding for nominal categories (like color).
# The model would think red=0 < blue=1 < green=2, which is meaningless.
```

### One-Hot Encoding (for nominal categories)

Use this when categories have no natural order -- like colors, countries, or product types. Each category becomes its own binary column.

```python theme={null}
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Using pandas -- simpler for exploration
df_encoded = pd.get_dummies(df, columns=['color'], prefix='color')

# Using sklearn -- better for pipelines (remembers categories from training)
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# handle_unknown='ignore' is critical: if test data has a category
# the model never saw during training, it won't crash -- it just
# sets all one-hot columns to 0 (an "unknown" embedding).
encoded = encoder.fit_transform(df[['color']])
```

**Before**:

```
id  color
1   red
2   blue
3   green
```

**After**:

```
id  color_red  color_blue  color_green
1   1          0           0
2   0          1           0
3   0          0           1
```

### Target Encoding (for high-cardinality categories)

When a categorical feature has hundreds or thousands of unique values (like zip codes or product IDs), one-hot encoding creates an explosion of columns. Target encoding replaces each category with the average target value for that category -- essentially asking "what's the typical outcome for this group?"

```python theme={null}
# Replace category with mean target value.
# For a city like "San Francisco," this becomes the average house price
# in San Francisco -- a single number that captures location value.
city_means = df.groupby('city')['price'].mean()
df['city_encoded'] = df['city'].map(city_means)
```

<Note>
  **Data leakage warning**: Target encoding uses the target variable to create features, which can leak future information into training. Always compute means on training data only, and consider using smoothed target encoding (blending category mean with global mean) to reduce overfitting on rare categories. Libraries like `category_encoders` handle this correctly.
</Note>

***

## Scaling Numerical Features

### Why Scale?

Many algorithms (SVM, KNN, neural networks) are sensitive to scale:

* Age: 0-100
* Income: 0-1,000,000

Without scaling, income would dominate!

### StandardScaler (Z-score normalization)

$x_{scaled} = \frac{x - \mu}{\sigma}$

Centers each feature at 0, scales to unit variance. The most common choice for algorithms that assume normally distributed features (logistic regression, SVM, neural networks).

```python theme={null}
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # fit on training data, transform both

# Result: mean=0, std=1
print(f"Mean: {X_scaled.mean():.4f}")
print(f"Std:  {X_scaled.std():.4f}")
```

### MinMaxScaler (0-1 normalization)

$x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}$

Maps every feature to \[0, 1]. Best when you need bounded values (e.g., neural network inputs, or when features are already uniformly distributed). Sensitive to outliers -- one extreme value can squash everything else into a narrow range.

```python theme={null}
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Result: values between 0 and 1
print(f"Min: {X_scaled.min():.4f}")
print(f"Max: {X_scaled.max():.4f}")
```

### RobustScaler (for outliers)

Uses median and IQR instead of mean and std. If your data has outliers that you don't want to remove, this is your best bet -- the median and IQR are not affected by extreme values.

```python theme={null}
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
```

<Note>
  **Quick decision guide for scaling:**

  | Scaler             | Use When                                                      | Avoid When                                    |
  | ------------------ | ------------------------------------------------------------- | --------------------------------------------- |
  | **StandardScaler** | Default choice; features are roughly Gaussian                 | Features have many outliers                   |
  | **MinMaxScaler**   | You need bounded \[0,1] values; data is uniformly distributed | Outliers exist (they'll distort the range)    |
  | **RobustScaler**   | Outliers are present but meaningful                           | Your data is already clean                    |
  | **No scaling**     | Using tree-based models (Random Forest, XGBoost)              | Using distance-based or gradient-based models |
</Note>

***

## Creating New Features

### Mathematical Transformations

```python theme={null}
# For skewed distributions (right-skewed data like income, prices)
# Log transform compresses the long tail, making the distribution
# more symmetric. This helps linear models that assume normality.
df['income_log'] = np.log1p(df['income'])  # log(1+x) to handle zeros safely

# Square root -- a milder compression than log, good for count data
df['rooms_sqrt'] = np.sqrt(df['rooms'])

# Polynomial features -- captures non-linear relationships.
# If the relationship between age and target is U-shaped (young and old
# are both high-risk), a linear model can't capture this with age alone.
# Adding age^2 gives it the curvature it needs.
df['age_squared'] = df['age'] ** 2
```

### Interaction Features

These capture relationships between features that the model might not discover on its own. They encode domain knowledge: "the *combination* of these two things matters, not just each one individually."

```python theme={null}
# Combine features -- each ratio tells a specific story
df['price_per_sqft'] = df['price'] / df['sqft']           # Property value density
df['income_per_person'] = df['household_income'] / df['household_size']  # Individual buying power
df['age_income_ratio'] = df['age'] / df['income']          # Career progression proxy
```

### Date Features

```python theme={null}
df['date'] = pd.to_datetime(df['date'])

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['date'].dt.quarter

# Time since event
df['days_since_purchase'] = (pd.Timestamp.now() - df['date']).dt.days
```

### Text Features

```python theme={null}
# Length
df['description_length'] = df['description'].str.len()
df['word_count'] = df['description'].str.split().str.len()

# Contains specific words
df['has_discount'] = df['description'].str.contains('discount|sale|offer', case=False).astype(int)
```

***

## Binning Continuous Variables

```python theme={null}
# Age groups
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 18, 35, 50, 65, 100],
    labels=['youth', 'young_adult', 'middle_age', 'senior', 'elderly']
)

# Equal-frequency binning (quantiles)
df['income_quantile'] = pd.qcut(
    df['income'],
    q=4,
    labels=['low', 'medium', 'high', 'very_high']
)
```

***

## Handling Outliers

```python theme={null}
import numpy as np

def detect_outliers_iqr(data):
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return (data < lower) | (data > upper)

# Cap outliers (winsorization)
def cap_outliers(data, lower_percentile=1, upper_percentile=99):
    lower = np.percentile(data, lower_percentile)
    upper = np.percentile(data, upper_percentile)
    return np.clip(data, lower, upper)

df['income_capped'] = cap_outliers(df['income'])
```

***

## Feature Selection

More features is not always better. Think of it like packing for a trip: bringing everything "just in case" makes your suitcase impossibly heavy and you can never find what you need. Feature selection is choosing to pack only what you'll actually wear. Irrelevant features add noise, slow training, and can even hurt accuracy by diluting the signal.

### Correlation Analysis

```python theme={null}
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation matrix -- the first thing to check.
# Look for: (1) features highly correlated with target (good!)
# and (2) features highly correlated with each other (redundant -- drop one).
corr = df.corr()

# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations')
plt.show()

# Drop highly correlated features
def drop_correlated_features(df, threshold=0.9):
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    return df.drop(columns=to_drop)
```

### Model-Based Selection

```python theme={null}
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier

# Select top k features by ANOVA F-score.
# f_classif tests whether each feature's mean differs across classes.
# It's fast but only catches linear relationships -- a feature with
# a U-shaped relationship to the target might score low.
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected features:", list(selected_features))
```

### Recursive Feature Elimination

RFE works like a talent show elimination: train a model, eliminate the weakest feature, retrain, repeat. It's slower but catches feature interactions that univariate tests miss.

```python theme={null}
from sklearn.feature_selection import RFE

model = RandomForestClassifier(n_estimators=50, random_state=42)
rfe = RFE(model, n_features_to_select=10)
rfe.fit(X, y)

# Get rankings
for name, rank in zip(X.columns, rfe.ranking_):
    print(f"{name}: Rank {rank}")
```

***

## Feature Engineering Pipeline

```python theme={null}
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['education', 'occupation', 'city']

# Create transformers
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline with model -- this is the professional way to build ML systems.
# The pipeline guarantees that preprocessing steps are applied identically
# during training and prediction, eliminating a whole class of production bugs.
from sklearn.ensemble import RandomForestClassifier

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Use it -- raw data goes in, predictions come out.
# Cross-validation, grid search, and joblib.dump all work
# seamlessly with pipelines because they're a single object.
full_pipeline.fit(X_train, y_train)
predictions = full_pipeline.predict(X_test)
```

***

## Common Mistakes

<CardGroup cols={2}>
  <Card title="Data Leakage" icon="droplet">
    **Problem**: Using test data info during training

    **Fix**: Always fit transformers on train data only

    ```python theme={null}
    # Wrong
    scaler.fit(X)  # Uses all data

    # Right
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    ```
  </Card>

  <Card title="Scaling After Split" icon="arrows-split-up-and-left">
    **Problem**: Scaling before train-test split

    **Fix**: Split first, then scale

    ```python theme={null}
    # Right order:
    # 1. Train-test split
    # 2. Fit scaler on train
    # 3. Transform both
    ```
  </Card>
</CardGroup>

***

## 🚀 Mini Projects

<CardGroup cols={2}>
  <Card title="Project 1: E-commerce Feature Engineer" icon="cart-shopping">
    Transform raw transaction data into predictive features
  </Card>

  <Card title="Project 2: Date-Time Feature Factory" icon="calendar">
    Extract powerful temporal features from timestamps
  </Card>

  <Card title="Project 3: Text Feature Extractor" icon="file-lines">
    Convert text data into numerical features
  </Card>

  <Card title="Project 4: Automated Feature Pipeline" icon="gears">
    Build an end-to-end feature engineering pipeline
  </Card>
</CardGroup>

### Project 1: E-commerce Feature Engineer

Transform raw e-commerce transaction data into features that predict customer churn.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import pandas as pd
  import numpy as np
  from sklearn.model_selection import train_test_split
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.metrics import classification_report

  # Step 1: Create sample e-commerce data
  np.random.seed(42)
  n_customers = 1000

  data = {
      'customer_id': range(1, n_customers + 1),
      'total_orders': np.random.poisson(5, n_customers),
      'total_spent': np.random.exponential(500, n_customers),
      'days_since_first_order': np.random.randint(30, 730, n_customers),
      'days_since_last_order': np.random.randint(1, 180, n_customers),
      'avg_order_value': np.random.exponential(100, n_customers),
      'num_returns': np.random.poisson(0.5, n_customers),
      'num_complaints': np.random.poisson(0.2, n_customers),
      'email_opens': np.random.poisson(10, n_customers),
      'loyalty_points': np.random.exponential(200, n_customers)
  }
  df = pd.DataFrame(data)

  # Simulate churn (customers with recent inactivity or complaints more likely to churn)
  churn_prob = (
      0.3 * (df['days_since_last_order'] > 60).astype(int) +
      0.2 * (df['num_complaints'] > 0).astype(int) +
      0.1 * (df['num_returns'] > 2).astype(int) +
      0.1 * (df['total_orders'] < 3).astype(int)
  )
  df['churned'] = (np.random.random(n_customers) < churn_prob).astype(int)

  print("Raw data shape:", df.shape)
  print(df.head())

  # Step 2: Feature engineering
  def engineer_features(df):
      """Create meaningful features from raw data"""
      features = pd.DataFrame()
      
      # Recency features
      features['recency'] = df['days_since_last_order']
      features['customer_age'] = df['days_since_first_order']
      features['recency_ratio'] = df['days_since_last_order'] / (df['days_since_first_order'] + 1)
      
      # Frequency features
      features['order_frequency'] = df['total_orders'] / (df['days_since_first_order'] / 30 + 1)
      features['is_frequent_buyer'] = (features['order_frequency'] > 1).astype(int)
      
      # Monetary features
      features['total_spent'] = df['total_spent']
      features['avg_order_value'] = df['avg_order_value']
      features['value_per_day'] = df['total_spent'] / (df['days_since_first_order'] + 1)
      
      # Customer lifetime value estimate
      features['estimated_clv'] = (
          features['order_frequency'] * 
          df['avg_order_value'] * 
          12  # Annualized
      )
      
      # Engagement features
      features['return_rate'] = df['num_returns'] / (df['total_orders'] + 1)
      features['complaint_rate'] = df['num_complaints'] / (df['total_orders'] + 1)
      features['email_engagement'] = df['email_opens'] / (df['customer_age'] / 7 + 1)
      
      # Loyalty features
      features['points_per_order'] = df['loyalty_points'] / (df['total_orders'] + 1)
      features['points_per_dollar'] = df['loyalty_points'] / (df['total_spent'] + 1)
      
      # Risk indicators
      features['is_at_risk'] = (
          (df['days_since_last_order'] > 60) | 
          (df['num_complaints'] > 1)
      ).astype(int)
      
      # Segment features (binning)
      features['spending_tier'] = pd.cut(
          df['total_spent'], 
          bins=[0, 200, 500, 1000, np.inf],
          labels=[0, 1, 2, 3]
      ).astype(int)
      
      return features

  # Step 3: Apply feature engineering
  X = engineer_features(df)
  y = df['churned']

  print(f"\nEngineered features: {X.shape[1]}")
  print("Features:", list(X.columns))

  # Step 4: Train model and evaluate
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  model = RandomForestClassifier(n_estimators=100, random_state=42)
  model.fit(X_train, y_train)

  print("\n📊 Model Performance:")
  print(classification_report(y_test, model.predict(X_test)))

  # Step 5: Feature importance
  importance = pd.DataFrame({
      'feature': X.columns,
      'importance': model.feature_importances_
  }).sort_values('importance', ascending=False)

  print("\n🏆 Top 10 Most Important Features:")
  for i, row in importance.head(10).iterrows():
      print(f"  {row['feature']:25s}: {row['importance']:.4f}")
  ```

  **What you learned:**

  * RFM (Recency, Frequency, Monetary) features are powerful for churn prediction
  * Ratio features often outperform raw counts
  * Feature engineering can dramatically improve model performance
</details>

### Project 2: Date-Time Feature Factory

Extract powerful temporal features from timestamp data.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import pandas as pd
  import numpy as np
  from sklearn.ensemble import RandomForestRegressor
  from sklearn.model_selection import train_test_split
  from sklearn.metrics import mean_absolute_error

  # Step 1: Create sample time-series data (sales data)
  np.random.seed(42)
  dates = pd.date_range('2022-01-01', '2023-12-31', freq='H')
  n = len(dates)

  # Create base sales with patterns
  base_sales = 100
  hourly_pattern = np.sin(np.arange(n) * 2 * np.pi / 24) * 20
  daily_pattern = np.sin(np.arange(n) * 2 * np.pi / (24*7)) * 30
  seasonal_pattern = np.sin(np.arange(n) * 2 * np.pi / (24*365)) * 50
  noise = np.random.normal(0, 10, n)

  sales = base_sales + hourly_pattern + daily_pattern + seasonal_pattern + noise
  sales = np.maximum(sales, 0)  # No negative sales

  df = pd.DataFrame({'timestamp': dates, 'sales': sales})
  print("Raw data:")
  print(df.head(10))

  # Step 2: Extract datetime features
  def extract_datetime_features(df, timestamp_col='timestamp'):
      """Extract comprehensive datetime features"""
      ts = df[timestamp_col]
      features = pd.DataFrame()
      
      # Basic time components
      features['hour'] = ts.dt.hour
      features['day'] = ts.dt.day
      features['month'] = ts.dt.month
      features['year'] = ts.dt.year
      features['dayofweek'] = ts.dt.dayofweek
      features['dayofyear'] = ts.dt.dayofyear
      features['weekofyear'] = ts.dt.isocalendar().week.astype(int)
      features['quarter'] = ts.dt.quarter
      
      # Cyclical encoding (important for time!)
      features['hour_sin'] = np.sin(2 * np.pi * features['hour'] / 24)
      features['hour_cos'] = np.cos(2 * np.pi * features['hour'] / 24)
      features['day_sin'] = np.sin(2 * np.pi * features['dayofweek'] / 7)
      features['day_cos'] = np.cos(2 * np.pi * features['dayofweek'] / 7)
      features['month_sin'] = np.sin(2 * np.pi * features['month'] / 12)
      features['month_cos'] = np.cos(2 * np.pi * features['month'] / 12)
      
      # Boolean features
      features['is_weekend'] = (features['dayofweek'] >= 5).astype(int)
      features['is_month_start'] = ts.dt.is_month_start.astype(int)
      features['is_month_end'] = ts.dt.is_month_end.astype(int)
      features['is_quarter_start'] = ts.dt.is_quarter_start.astype(int)
      features['is_quarter_end'] = ts.dt.is_quarter_end.astype(int)
      
      # Time of day categories
      features['is_morning'] = ((features['hour'] >= 6) & (features['hour'] < 12)).astype(int)
      features['is_afternoon'] = ((features['hour'] >= 12) & (features['hour'] < 18)).astype(int)
      features['is_evening'] = ((features['hour'] >= 18) & (features['hour'] < 22)).astype(int)
      features['is_night'] = ((features['hour'] >= 22) | (features['hour'] < 6)).astype(int)
      
      # Business hours
      features['is_business_hours'] = (
          (features['hour'] >= 9) & 
          (features['hour'] < 17) & 
          (features['is_weekend'] == 0)
      ).astype(int)
      
      # Lag features (previous values)
      features['sales_lag_1h'] = df['sales'].shift(1)
      features['sales_lag_24h'] = df['sales'].shift(24)
      features['sales_lag_168h'] = df['sales'].shift(168)  # 1 week
      
      # Rolling statistics
      features['sales_rolling_mean_24h'] = df['sales'].rolling(24).mean()
      features['sales_rolling_std_24h'] = df['sales'].rolling(24).std()
      features['sales_rolling_mean_168h'] = df['sales'].rolling(168).mean()
      
      return features

  # Step 3: Apply feature extraction
  features = extract_datetime_features(df)
  features = features.dropna()  # Remove rows with NaN from lag features

  print(f"\nExtracted {features.shape[1]} datetime features")
  print("Features:", list(features.columns))

  # Step 4: Train model to predict sales
  y = df['sales'].iloc[features.index]
  X_train, X_test, y_train, y_test = train_test_split(
      features, y, test_size=0.2, shuffle=False  # Time series: no shuffle
  )

  model = RandomForestRegressor(n_estimators=100, random_state=42)
  model.fit(X_train, y_train)

  y_pred = model.predict(X_test)
  mae = mean_absolute_error(y_test, y_pred)
  print(f"\nModel MAE: {mae:.2f}")

  # Step 5: Feature importance
  importance = pd.DataFrame({
      'feature': features.columns,
      'importance': model.feature_importances_
  }).sort_values('importance', ascending=False)

  print("\n🏆 Top 10 Most Important Time Features:")
  for i, row in importance.head(10).iterrows():
      print(f"  {row['feature']:25s}: {row['importance']:.4f}")

  # Compare with baseline (just using hour)
  baseline = RandomForestRegressor(n_estimators=100, random_state=42)
  baseline.fit(X_train[['hour']], y_train)
  baseline_mae = mean_absolute_error(y_test, baseline.predict(X_test[['hour']]))
  print(f"\n📈 Improvement over baseline (hour only):")
  print(f"   Baseline MAE: {baseline_mae:.2f}")
  print(f"   Full features MAE: {mae:.2f}")
  print(f"   Improvement: {(baseline_mae - mae) / baseline_mae * 100:.1f}%")
  ```

  **What you learned:**

  * Cyclical encoding prevents the model from seeing hour 23 and hour 0 as far apart
  * Lag features capture temporal dependencies
  * Rolling statistics smooth out noise and capture trends
</details>

### Project 3: Text Feature Extractor

Convert text data into numerical features for machine learning.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import pandas as pd
  import numpy as np
  from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
  from sklearn.model_selection import train_test_split
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.metrics import classification_report
  import re

  # Step 1: Create sample text data (product reviews)
  reviews = [
      ("This product is amazing! Works perfectly.", 1),
      ("Terrible quality, broke after one day.", 0),
      ("Great value for money, highly recommend!", 1),
      ("Waste of money, very disappointed.", 0),
      ("Excellent product, exceeded expectations!", 1),
      ("Poor customer service, will not buy again.", 0),
      ("Love it! Best purchase ever.", 1),
      ("Defective item, had to return it.", 0),
      ("Fantastic quality, worth every penny.", 1),
      ("Cheap materials, not worth it.", 0),
      ("Perfect gift, arrived quickly!", 1),
      ("Scam product, nothing like the picture.", 0),
      ("Super happy with this purchase!", 1),
      ("Horrible experience, avoid this seller.", 0),
      ("Outstanding product, will buy again.", 1),
      ("Complete garbage, want my money back.", 0),
  ]

  # Expand dataset
  expanded_reviews = reviews * 20
  np.random.shuffle(expanded_reviews)

  df = pd.DataFrame(expanded_reviews, columns=['review', 'sentiment'])
  print(f"Dataset size: {len(df)}")

  # Step 2: Text preprocessing
  def preprocess_text(text):
      """Clean and preprocess text"""
      # Lowercase
      text = text.lower()
      # Remove special characters
      text = re.sub(r'[^a-z\s]', '', text)
      # Remove extra whitespace
      text = ' '.join(text.split())
      return text

  df['clean_review'] = df['review'].apply(preprocess_text)

  # Step 3: Extract text features
  def extract_text_features(texts):
      """Extract multiple types of text features"""
      features = pd.DataFrame()
      
      # Basic statistics
      features['char_count'] = [len(t) for t in texts]
      features['word_count'] = [len(t.split()) for t in texts]
      features['avg_word_length'] = [
          np.mean([len(w) for w in t.split()]) if t.split() else 0 
          for t in texts
      ]
      features['unique_words'] = [len(set(t.split())) for t in texts]
      features['unique_ratio'] = features['unique_words'] / (features['word_count'] + 1)
      
      # Punctuation features (from original text)
      original = df['review'].values
      features['exclamation_count'] = [t.count('!') for t in original]
      features['question_count'] = [t.count('?') for t in original]
      features['caps_ratio'] = [
          sum(1 for c in t if c.isupper()) / (len(t) + 1) 
          for t in original
      ]
      
      # Sentiment lexicon features
      positive_words = {'great', 'amazing', 'excellent', 'love', 'perfect', 
                       'best', 'fantastic', 'outstanding', 'happy', 'recommend'}
      negative_words = {'terrible', 'poor', 'waste', 'horrible', 'defective',
                       'scam', 'garbage', 'disappointed', 'worst', 'avoid'}
      
      features['positive_words'] = [
          sum(1 for w in t.split() if w in positive_words) for t in texts
      ]
      features['negative_words'] = [
          sum(1 for w in t.split() if w in negative_words) for t in texts
      ]
      features['sentiment_score'] = features['positive_words'] - features['negative_words']
      
      return features

  # Step 4: Combine manual features with TF-IDF
  manual_features = extract_text_features(df['clean_review'])

  # TF-IDF features
  tfidf = TfidfVectorizer(max_features=100, ngram_range=(1, 2))
  tfidf_matrix = tfidf.fit_transform(df['clean_review'])
  tfidf_df = pd.DataFrame(
      tfidf_matrix.toarray(),
      columns=[f'tfidf_{w}' for w in tfidf.get_feature_names_out()]
  )

  # Combine all features
  X = pd.concat([manual_features.reset_index(drop=True), 
                 tfidf_df.reset_index(drop=True)], axis=1)
  y = df['sentiment']

  print(f"\nTotal features: {X.shape[1]}")
  print(f"  - Manual features: {manual_features.shape[1]}")
  print(f"  - TF-IDF features: {tfidf_df.shape[1]}")

  # Step 5: Train and evaluate
  X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=0.2, random_state=42
  )

  model = RandomForestClassifier(n_estimators=100, random_state=42)
  model.fit(X_train, y_train)

  print("\n📊 Classification Report:")
  print(classification_report(y_test, model.predict(X_test)))

  # Step 6: Compare feature sets
  print("\n📈 Comparing Feature Sets:")

  # Manual features only
  model_manual = RandomForestClassifier(n_estimators=100, random_state=42)
  model_manual.fit(X_train[manual_features.columns], y_train)
  acc_manual = model_manual.score(X_test[manual_features.columns], y_test)

  # TF-IDF only
  model_tfidf = RandomForestClassifier(n_estimators=100, random_state=42)
  model_tfidf.fit(X_train[tfidf_df.columns], y_train)
  acc_tfidf = model_tfidf.score(X_test[tfidf_df.columns], y_test)

  # Combined
  acc_combined = model.score(X_test, y_test)

  print(f"  Manual features only: {acc_manual:.4f}")
  print(f"  TF-IDF only:          {acc_tfidf:.4f}")
  print(f"  Combined:             {acc_combined:.4f}")

  # Top features
  importance = pd.DataFrame({
      'feature': X.columns,
      'importance': model.feature_importances_
  }).sort_values('importance', ascending=False)

  print("\n🏆 Top 10 Most Important Features:")
  for i, row in importance.head(10).iterrows():
      print(f"  {row['feature']:30s}: {row['importance']:.4f}")
  ```

  **What you learned:**

  * Combining manual features with TF-IDF often works better than either alone
  * Simple features like word count and sentiment lexicons are powerful
  * Feature engineering captures domain knowledge that pure ML can miss
</details>

### Project 4: Automated Feature Pipeline

Build an end-to-end feature engineering pipeline that handles multiple data types.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import pandas as pd
  import numpy as np
  from sklearn.base import BaseEstimator, TransformerMixin
  from sklearn.pipeline import Pipeline, FeatureUnion
  from sklearn.compose import ColumnTransformer
  from sklearn.preprocessing import StandardScaler, OneHotEncoder
  from sklearn.impute import SimpleImputer
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.model_selection import train_test_split, cross_val_score
  from sklearn.metrics import classification_report

  # Step 1: Create realistic mixed-type dataset
  np.random.seed(42)
  n = 1000

  data = {
      # Numeric features
      'age': np.random.randint(18, 80, n),
      'income': np.random.exponential(50000, n),
      'credit_score': np.random.normal(700, 50, n).clip(300, 850),
      'years_employed': np.random.exponential(5, n),
      
      # Categorical features
      'education': np.random.choice(['high_school', 'bachelor', 'master', 'phd'], n),
      'occupation': np.random.choice(['engineer', 'teacher', 'doctor', 'other'], n),
      'marital_status': np.random.choice(['single', 'married', 'divorced'], n),
      
      # Features with missing values
      'savings': np.where(np.random.random(n) > 0.2, np.random.exponential(10000, n), np.nan),
      'num_dependents': np.where(np.random.random(n) > 0.1, np.random.poisson(1, n), np.nan),
  }

  df = pd.DataFrame(data)

  # Target: loan approval
  approval_prob = (
      0.3 * (df['income'] > 50000).astype(int) +
      0.2 * (df['credit_score'] > 700).astype(int) +
      0.2 * (df['education'].isin(['master', 'phd'])).astype(int) +
      0.1 * (df['years_employed'] > 3).astype(int)
  )
  df['approved'] = (np.random.random(n) < approval_prob / 1.5).astype(int)

  print("Dataset info:")
  print(df.info())
  print("\nMissing values:")
  print(df.isnull().sum())

  # Step 2: Define custom transformers
  class FeatureInteractionCreator(BaseEstimator, TransformerMixin):
      """Create interaction features between numeric columns"""
      def __init__(self, columns):
          self.columns = columns
          
      def fit(self, X, y=None):
          return self
      
      def transform(self, X):
          X = pd.DataFrame(X, columns=self.columns)
          result = pd.DataFrame()
          
          # Ratios
          result['income_per_year_employed'] = X['income'] / (X['years_employed'] + 1)
          result['savings_to_income'] = X['savings'] / (X['income'] + 1)
          result['credit_age_ratio'] = X['credit_score'] / (X['age'] + 1)
          
          # Products
          result['income_credit_product'] = X['income'] * X['credit_score'] / 100000
          
          return result.fillna(0).values

  class BinningTransformer(BaseEstimator, TransformerMixin):
      """Bin continuous variables"""
      def __init__(self, column_idx, n_bins=5):
          self.column_idx = column_idx
          self.n_bins = n_bins
          self.bins_ = None
          
      def fit(self, X, y=None):
          col = X[:, self.column_idx]
          self.bins_ = np.percentile(col[~np.isnan(col)], 
                                     np.linspace(0, 100, self.n_bins + 1))
          return self
      
      def transform(self, X):
          col = X[:, self.column_idx]
          binned = np.digitize(col, self.bins_[1:-1])
          return binned.reshape(-1, 1)

  # Step 3: Define column types
  numeric_cols = ['age', 'income', 'credit_score', 'years_employed', 'savings', 'num_dependents']
  categorical_cols = ['education', 'occupation', 'marital_status']

  # Step 4: Build preprocessing pipelines
  numeric_pipeline = Pipeline([
      ('imputer', SimpleImputer(strategy='median')),
      ('scaler', StandardScaler())
  ])

  categorical_pipeline = Pipeline([
      ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
      ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
  ])

  # Column transformer for base features
  preprocessor = ColumnTransformer([
      ('numeric', numeric_pipeline, numeric_cols),
      ('categorical', categorical_pipeline, categorical_cols)
  ])

  # Step 5: Build full pipeline with feature engineering
  # First, fit a simple preprocessor to get numeric features for interactions
  simple_imputer = ColumnTransformer([
      ('numeric', SimpleImputer(strategy='median'), numeric_cols)
  ])

  # Prepare data
  X = df.drop('approved', axis=1)
  y = df['approved']
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  # Fit the simple imputer
  simple_imputer.fit(X_train)
  X_train_imputed = simple_imputer.transform(X_train)

  # Create interaction features
  interaction_creator = FeatureInteractionCreator(numeric_cols)
  X_train_interactions = interaction_creator.transform(X_train_imputed)

  # Build final preprocessor
  X_train_base = preprocessor.fit_transform(X_train)
  X_test_base = preprocessor.transform(X_test)

  X_test_imputed = simple_imputer.transform(X_test)
  X_test_interactions = interaction_creator.transform(X_test_imputed)

  # Combine base and interaction features
  X_train_final = np.hstack([X_train_base, X_train_interactions])
  X_test_final = np.hstack([X_test_base, X_test_interactions])

  print(f"\nOriginal features: {X.shape[1]}")
  print(f"After preprocessing: {X_train_base.shape[1]}")
  print(f"After adding interactions: {X_train_final.shape[1]}")

  # Step 6: Train and evaluate
  model = RandomForestClassifier(n_estimators=100, random_state=42)
  model.fit(X_train_final, y_train)

  print("\n📊 Model Performance:")
  print(classification_report(y_test, model.predict(X_test_final)))

  # Step 7: Compare with baseline (no feature engineering)
  baseline_model = RandomForestClassifier(n_estimators=100, random_state=42)
  baseline_model.fit(X_train_base, y_train)

  print("\n📈 Comparison:")
  print(f"  Baseline accuracy:          {baseline_model.score(X_test_base, y_test):.4f}")
  print(f"  With interactions accuracy: {model.score(X_test_final, y_test):.4f}")

  # Step 8: Cross-validation for robustness
  cv_scores_base = cross_val_score(
      RandomForestClassifier(n_estimators=100, random_state=42),
      X_train_base, y_train, cv=5
  )
  print(f"\n  Baseline CV: {cv_scores_base.mean():.4f} (+/- {cv_scores_base.std():.4f})")

  cv_scores_full = cross_val_score(
      RandomForestClassifier(n_estimators=100, random_state=42),
      X_train_final, y_train, cv=5
  )
  print(f"  Full features CV: {cv_scores_full.mean():.4f} (+/- {cv_scores_full.std():.4f})")
  ```

  **What you learned:**

  * Sklearn pipelines ensure consistent preprocessing between train and test
  * Custom transformers let you integrate domain knowledge
  * Feature unions combine multiple feature engineering strategies
  * Always compare against a baseline to measure improvement
</details>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Handle Missing Data" icon="circle-question">
    Impute or create indicator variables
  </Card>

  <Card title="Encode Categories" icon="code">
    One-hot for nominal, ordinal for ordered
  </Card>

  <Card title="Scale Features" icon="arrows-left-right-to-line">
    StandardScaler or MinMaxScaler for most algorithms
  </Card>

  <Card title="Create Features" icon="plus">
    Domain knowledge creates the best features
  </Card>
</CardGroup>

***

## 🧹 Real-World Messy Data: Complete Guide

<Accordion title="Handling Every Type of Data Problem" icon="broom">
  ### Missing Values Decision Tree

  ```
  Is data missing?
  ├── < 5% missing
  │   └── Safe to drop rows OR simple imputation (mean/median)
  ├── 5-30% missing
  │   ├── Is missingness random?
  │   │   ├── Yes → Impute with mean/median/mode
  │   │   └── No (informative) → Create "is_missing" indicator + impute
  │   └── Consider multiple imputation for important analyses
  └── > 30% missing
      ├── Is the feature important?
      │   ├── Yes → Advanced imputation (KNN, iterative)
      │   └── No → Consider dropping the feature
      └── Investigate WHY data is missing
  ```

  ```python theme={null}
  # Production-ready missing value handler
  def handle_missing_values(df, missing_threshold=0.3):
      """
      Handle missing values with best practices.
      """
      report = []
      
      for col in df.columns:
          missing_pct = df[col].isnull().mean()
          
          if missing_pct == 0:
              continue
          elif missing_pct > missing_threshold:
              report.append(f"⚠️ {col}: {missing_pct:.1%} missing - consider dropping")
          elif df[col].dtype in ['float64', 'int64']:
              # Numeric: impute with median (robust to outliers)
              df[f'{col}_missing'] = df[col].isnull().astype(int)
              df[col].fillna(df[col].median(), inplace=True)
              report.append(f"✓ {col}: imputed with median, created indicator")
          else:
              # Categorical: impute with mode or 'Unknown'
              df[col].fillna(df[col].mode()[0] if len(df[col].mode()) > 0 else 'Unknown', inplace=True)
              report.append(f"✓ {col}: imputed with mode/Unknown")
      
      return df, report
  ```

  ### Outlier Detection & Treatment

  ```python theme={null}
  import numpy as np
  from scipy import stats

  def detect_and_handle_outliers(df, columns, method='iqr', action='cap'):
      """
      Detect outliers using IQR or Z-score, then handle them.
      
      Parameters:
      - method: 'iqr' (robust) or 'zscore' (assumes normality)
      - action: 'cap' (winsorize), 'remove', or 'flag'
      """
      for col in columns:
          if method == 'iqr':
              Q1, Q3 = df[col].quantile([0.25, 0.75])
              IQR = Q3 - Q1
              lower = Q1 - 1.5 * IQR
              upper = Q3 + 1.5 * IQR
          else:  # zscore
              z = np.abs(stats.zscore(df[col].dropna()))
              lower = df[col].mean() - 3 * df[col].std()
              upper = df[col].mean() + 3 * df[col].std()
          
          outliers = (df[col] < lower) | (df[col] > upper)
          n_outliers = outliers.sum()
          
          if n_outliers > 0:
              if action == 'cap':
                  df[col] = df[col].clip(lower=lower, upper=upper)
                  print(f"  {col}: capped {n_outliers} outliers to [{lower:.2f}, {upper:.2f}]")
              elif action == 'remove':
                  df = df[~outliers]
                  print(f"  {col}: removed {n_outliers} outlier rows")
              else:  # flag
                  df[f'{col}_is_outlier'] = outliers.astype(int)
                  print(f"  {col}: flagged {n_outliers} outliers")
      
      return df
  ```

  ### Handling Skewed Distributions

  ```python theme={null}
  from sklearn.preprocessing import PowerTransformer

  def handle_skewness(df, columns, threshold=1.0):
      """
      Apply log or Yeo-Johnson transform to skewed features.
      """
      from scipy.stats import skew
      
      for col in columns:
          col_skew = skew(df[col].dropna())
          
          if abs(col_skew) > threshold:
              if (df[col] > 0).all():
                  # Log transform for positive data
                  df[f'{col}_log'] = np.log1p(df[col])
                  print(f"  {col}: skew={col_skew:.2f} → log transform applied")
              else:
                  # Yeo-Johnson for any data
                  pt = PowerTransformer(method='yeo-johnson')
                  df[f'{col}_transformed'] = pt.fit_transform(df[[col]])
                  print(f"  {col}: skew={col_skew:.2f} → Yeo-Johnson applied")
      
      return df
  ```
</Accordion>

***

## 🔗 Math → ML Connection

<Note>
  **Feature engineering connects to these mathematical concepts:**

  | Technique               | Math Concept             | Why It Works                           |
  | ----------------------- | ------------------------ | -------------------------------------- |
  | **Standardization**     | Z-scores from statistics | Makes gradient descent converge faster |
  | **One-hot encoding**    | Orthogonal basis vectors | Each category becomes a dimension      |
  | **Log transforms**      | Properties of logarithms | Linearizes exponential relationships   |
  | **Polynomial features** | Polynomial functions     | Captures nonlinear patterns            |
  | **PCA features**        | Eigenvalue decomposition | Finds directions of max variance       |
  | **Interaction terms**   | Cross-products           | Models combined effects                |

  The [Linear Algebra course](/courses/math-for-ml-linear-algebra/02-vectors) covers why these transformations work geometrically.
</Note>

***

## 🚀 Going Deeper (Optional)

<Accordion title="Advanced Feature Engineering Techniques" icon="graduation-cap">
  ### Target Encoding (for High-Cardinality Categoricals)

  When a categorical has 1000+ unique values, one-hot encoding creates too many features:

  ```python theme={null}
  from sklearn.model_selection import KFold

  def target_encode(df, cat_col, target_col, n_splits=5):
      """
      Replace category with mean target value (using cross-validation to prevent leakage).
      """
      df[f'{cat_col}_target_enc'] = np.nan
      kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
      
      for train_idx, val_idx in kf.split(df):
          # Calculate means only on training fold
          means = df.iloc[train_idx].groupby(cat_col)[target_col].mean()
          # Apply to validation fold
          df.loc[df.index[val_idx], f'{cat_col}_target_enc'] = \
              df.iloc[val_idx][cat_col].map(means)
      
      # Fill any remaining NaN with global mean
      df[f'{cat_col}_target_enc'].fillna(df[target_col].mean(), inplace=True)
      
      return df
  ```

  ### Time-Based Features

  ```python theme={null}
  def create_time_features(df, date_col):
      """
      Extract rich features from datetime columns.
      """
      df[date_col] = pd.to_datetime(df[date_col])
      
      # Basic extractions
      df['year'] = df[date_col].dt.year
      df['month'] = df[date_col].dt.month
      df['day'] = df[date_col].dt.day
      df['dayofweek'] = df[date_col].dt.dayofweek  # 0=Monday
      df['hour'] = df[date_col].dt.hour
      
      # Cyclical encoding (preserves continuity: Dec → Jan)
      df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
      df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
      df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
      df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
      
      # Business features
      df['is_weekend'] = (df['dayofweek'] >= 5).astype(int)
      df['is_month_start'] = df[date_col].dt.is_month_start.astype(int)
      df['is_month_end'] = df[date_col].dt.is_month_end.astype(int)
      
      return df
  ```

  ### Automated Feature Engineering

  ```python theme={null}
  # Using featuretools for automated feature generation
  import featuretools as ft

  # Define entity set
  es = ft.EntitySet(id='customers')
  es.add_dataframe(dataframe_name='transactions', dataframe=transactions_df,
                   index='transaction_id', time_index='timestamp')
  es.add_dataframe(dataframe_name='customers', dataframe=customers_df,
                   index='customer_id')

  # Create relationship
  es.add_relationship('customers', 'customer_id', 'transactions', 'customer_id')

  # Automatically generate features
  features, feature_names = ft.dfs(
      entityset=es,
      target_dataframe_name='customers',
      agg_primitives=['sum', 'mean', 'count', 'max', 'min', 'std'],
      trans_primitives=['month', 'year', 'weekday'],
      max_depth=2
  )
  ```
</Accordion>

***

## What's Next?

Now you know how to prepare data. But how do you find the best hyperparameters?

<Card title="Continue to Module 9: Hyperparameter Tuning" icon="arrow-right" href="/courses/ml-mastery/09-hyperparameter-tuning">
  Learn Grid Search, Random Search, and Bayesian Optimization
</Card>
