> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Feature Engineering > Transform raw data into features that models can learn from # Feature Engineering Feature Engineering Pipeline

## Data Scientists Spend 80% of Time Here Raw data is messy. Models need clean, meaningful numbers. **Feature engineering** is the art of transforming raw data into features that help models learn. It's the difference between feeding a model "March 15, 1995" (a string it can't use) and feeding it "30 years old, built pre-2000, winter construction" (numbers that carry meaning). Here's a truth that surprises beginners: a simple model with great features almost always beats a complex model with raw features. Feature engineering is where domain knowledge meets data science, and it's the single highest-leverage activity in most ML projects. E-commerce Feature Extraction

*** ## The House Price Example Raw data: ``` Address: "123 Main St, New York, NY 10001" Built: "March 15, 1995" Description: "Cozy 3BR, renovated kitchen, near subway" Price: $850,000 ``` What a model needs: ```python theme={null} { 'bedrooms': 3, 'city_encoded': 45, # New York 'year_built': 1995, 'building_age': 30, 'is_renovated': 1, 'near_transit': 1, 'zip_price_tier': 3 # Expensive area } ``` *** ## Handling Missing Values ```python theme={null} import pandas as pd import numpy as np # Sample data with missing values df = pd.DataFrame({ 'age': [25, np.nan, 35, 40, np.nan], 'income': [50000, 60000, np.nan, 80000, 90000], 'education': ['Bachelor', 'Master', np.nan, 'PhD', 'Bachelor'] }) print("Missing values:") print(df.isnull().sum()) ``` ### Strategy 1: Drop Missing Values ```python theme={null} # Drop rows with ANY missing values -- the nuclear option. # Only safe when: (1) you have plenty of data, (2) missingness is random, # and (3) the missing rows aren't systematically different from the rest. df_clean = df.dropna() # Safer: drop rows missing only in critical columns df_clean = df.dropna(subset=['age']) ``` Only use when you have lots of data and missingness is **truly random** (called MCAR -- Missing Completely At Random). If high-income people tend to skip the income question, dropping those rows biases your model toward low-income profiles. Check this by comparing the distributions of other features between "has missing" and "no missing" groups. ### Strategy 2: Imputation ```python theme={null} from sklearn.impute import SimpleImputer # Numeric: fill with mean, median, or constant numeric_imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent' df['age'] = numeric_imputer.fit_transform(df[['age']]) # Categorical: fill with mode or 'Unknown' categorical_imputer = SimpleImputer(strategy='most_frequent') df['education'] = categorical_imputer.fit_transform(df[['education']]) ``` ### Strategy 3: Indicator Variables ```python theme={null} # Create a flag for missing values (can be informative!) # Why? Because "missing" is often not random -- it carries signal. # For example, if income is missing, the person might have refused # to share it, which itself correlates with certain behaviors. df['age_missing'] = df['age'].isnull().astype(int) df['age'] = df['age'].fillna(df['age'].median()) ``` **Which imputation strategy should you use?** Use **median** for numeric features with outliers (median is robust to extreme values). Use **mean** for normally distributed features. Use **mode** (most frequent) for categorical features. And always create a missingness indicator -- it's free information and tree-based models will use it if it's predictive. *** ## Encoding Categorical Variables ### Label Encoding (for ordinal categories) Use this when categories have a natural order -- like education levels, satisfaction ratings, or T-shirt sizes. The numbers you assign should reflect the ranking. ```python theme={null} from sklearn.preprocessing import LabelEncoder # Ordinal: order matters -- PhD > Master > Bachelor > High School # The numeric values (0,1,2,3) encode this ranking, which means # the model can learn "higher education = different outcome." education_order = ['High School', 'Bachelor', 'Master', 'PhD'] df['education_encoded'] = df['education'].map({ 'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3 }) # Common mistake: using label encoding for nominal categories (like color). # The model would think red=0 < blue=1 < green=2, which is meaningless. ``` ### One-Hot Encoding (for nominal categories) Use this when categories have no natural order -- like colors, countries, or product types. Each category becomes its own binary column. ```python theme={null} from sklearn.preprocessing import OneHotEncoder import pandas as pd # Using pandas -- simpler for exploration df_encoded = pd.get_dummies(df, columns=['color'], prefix='color') # Using sklearn -- better for pipelines (remembers categories from training) encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # handle_unknown='ignore' is critical: if test data has a category # the model never saw during training, it won't crash -- it just # sets all one-hot columns to 0 (an "unknown" embedding). encoded = encoder.fit_transform(df[['color']]) ``` **Before**: ``` id color 1 red 2 blue 3 green ``` **After**: ``` id color_red color_blue color_green 1 1 0 0 2 0 1 0 3 0 0 1 ``` ### Target Encoding (for high-cardinality categories) When a categorical feature has hundreds or thousands of unique values (like zip codes or product IDs), one-hot encoding creates an explosion of columns. Target encoding replaces each category with the average target value for that category -- essentially asking "what's the typical outcome for this group?" ```python theme={null} # Replace category with mean target value. # For a city like "San Francisco," this becomes the average house price # in San Francisco -- a single number that captures location value. city_means = df.groupby('city')['price'].mean() df['city_encoded'] = df['city'].map(city_means) ``` **Data leakage warning**: Target encoding uses the target variable to create features, which can leak future information into training. Always compute means on training data only, and consider using smoothed target encoding (blending category mean with global mean) to reduce overfitting on rare categories. Libraries like `category_encoders` handle this correctly. *** ## Scaling Numerical Features ### Why Scale? Many algorithms (SVM, KNN, neural networks) are sensitive to scale: * Age: 0-100 * Income: 0-1,000,000 Without scaling, income would dominate! ### StandardScaler (Z-score normalization) $x_{scaled} = \frac{x - \mu}{\sigma}$ Centers each feature at 0, scales to unit variance. The most common choice for algorithms that assume normally distributed features (logistic regression, SVM, neural networks). ```python theme={null} from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # fit on training data, transform both # Result: mean=0, std=1 print(f"Mean: {X_scaled.mean():.4f}") print(f"Std: {X_scaled.std():.4f}") ``` ### MinMaxScaler (0-1 normalization) $x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}$ Maps every feature to \[0, 1]. Best when you need bounded values (e.g., neural network inputs, or when features are already uniformly distributed). Sensitive to outliers -- one extreme value can squash everything else into a narrow range. ```python theme={null} from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X) # Result: values between 0 and 1 print(f"Min: {X_scaled.min():.4f}") print(f"Max: {X_scaled.max():.4f}") ``` ### RobustScaler (for outliers) Uses median and IQR instead of mean and std. If your data has outliers that you don't want to remove, this is your best bet -- the median and IQR are not affected by extreme values. ```python theme={null} from sklearn.preprocessing import RobustScaler scaler = RobustScaler() X_scaled = scaler.fit_transform(X) ``` **Quick decision guide for scaling:** | Scaler | Use When | Avoid When | | ------------------ | ------------------------------------------------------------- | --------------------------------------------- | | **StandardScaler** | Default choice; features are roughly Gaussian | Features have many outliers | | **MinMaxScaler** | You need bounded \[0,1] values; data is uniformly distributed | Outliers exist (they'll distort the range) | | **RobustScaler** | Outliers are present but meaningful | Your data is already clean | | **No scaling** | Using tree-based models (Random Forest, XGBoost) | Using distance-based or gradient-based models | *** ## Creating New Features ### Mathematical Transformations ```python theme={null} # For skewed distributions (right-skewed data like income, prices) # Log transform compresses the long tail, making the distribution # more symmetric. This helps linear models that assume normality. df['income_log'] = np.log1p(df['income']) # log(1+x) to handle zeros safely # Square root -- a milder compression than log, good for count data df['rooms_sqrt'] = np.sqrt(df['rooms']) # Polynomial features -- captures non-linear relationships. # If the relationship between age and target is U-shaped (young and old # are both high-risk), a linear model can't capture this with age alone. # Adding age^2 gives it the curvature it needs. df['age_squared'] = df['age'] ** 2 ``` ### Interaction Features These capture relationships between features that the model might not discover on its own. They encode domain knowledge: "the *combination* of these two things matters, not just each one individually." ```python theme={null} # Combine features -- each ratio tells a specific story df['price_per_sqft'] = df['price'] / df['sqft'] # Property value density df['income_per_person'] = df['household_income'] / df['household_size'] # Individual buying power df['age_income_ratio'] = df['age'] / df['income'] # Career progression proxy ``` ### Date Features ```python theme={null} df['date'] = pd.to_datetime(df['date']) # Extract components df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day_of_week'] = df['date'].dt.dayofweek df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int) df['quarter'] = df['date'].dt.quarter # Time since event df['days_since_purchase'] = (pd.Timestamp.now() - df['date']).dt.days ``` ### Text Features ```python theme={null} # Length df['description_length'] = df['description'].str.len() df['word_count'] = df['description'].str.split().str.len() # Contains specific words df['has_discount'] = df['description'].str.contains('discount|sale|offer', case=False).astype(int) ``` *** ## Binning Continuous Variables ```python theme={null} # Age groups df['age_group'] = pd.cut( df['age'], bins=[0, 18, 35, 50, 65, 100], labels=['youth', 'young_adult', 'middle_age', 'senior', 'elderly'] ) # Equal-frequency binning (quantiles) df['income_quantile'] = pd.qcut( df['income'], q=4, labels=['low', 'medium', 'high', 'very_high'] ) ``` *** ## Handling Outliers ```python theme={null} import numpy as np def detect_outliers_iqr(data): Q1 = np.percentile(data, 25) Q3 = np.percentile(data, 75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR return (data < lower) | (data > upper) # Cap outliers (winsorization) def cap_outliers(data, lower_percentile=1, upper_percentile=99): lower = np.percentile(data, lower_percentile) upper = np.percentile(data, upper_percentile) return np.clip(data, lower, upper) df['income_capped'] = cap_outliers(df['income']) ``` *** ## Feature Selection More features is not always better. Think of it like packing for a trip: bringing everything "just in case" makes your suitcase impossibly heavy and you can never find what you need. Feature selection is choosing to pack only what you'll actually wear. Irrelevant features add noise, slow training, and can even hurt accuracy by diluting the signal. ### Correlation Analysis ```python theme={null} import seaborn as sns import matplotlib.pyplot as plt # Correlation matrix -- the first thing to check. # Look for: (1) features highly correlated with target (good!) # and (2) features highly correlated with each other (redundant -- drop one). corr = df.corr() # Heatmap plt.figure(figsize=(12, 8)) sns.heatmap(corr, annot=True, cmap='coolwarm', center=0) plt.title('Feature Correlations') plt.show() # Drop highly correlated features def drop_correlated_features(df, threshold=0.9): corr_matrix = df.corr().abs() upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) to_drop = [column for column in upper.columns if any(upper[column] > threshold)] return df.drop(columns=to_drop) ``` ### Model-Based Selection ```python theme={null} from sklearn.feature_selection import SelectKBest, f_classif, RFE from sklearn.ensemble import RandomForestClassifier # Select top k features by ANOVA F-score. # f_classif tests whether each feature's mean differs across classes. # It's fast but only catches linear relationships -- a feature with # a U-shaped relationship to the target might score low. selector = SelectKBest(f_classif, k=10) X_selected = selector.fit_transform(X, y) # Get selected feature names selected_features = X.columns[selector.get_support()] print("Selected features:", list(selected_features)) ``` ### Recursive Feature Elimination RFE works like a talent show elimination: train a model, eliminate the weakest feature, retrain, repeat. It's slower but catches feature interactions that univariate tests miss. ```python theme={null} from sklearn.feature_selection import RFE model = RandomForestClassifier(n_estimators=50, random_state=42) rfe = RFE(model, n_features_to_select=10) rfe.fit(X, y) # Get rankings for name, rank in zip(X.columns, rfe.ranking_): print(f"{name}: Rank {rank}") ``` *** ## Feature Engineering Pipeline ```python theme={null} from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer # Define column types numeric_features = ['age', 'income', 'credit_score'] categorical_features = ['education', 'occupation', 'city'] # Create transformers numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')), ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) ]) # Combine preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # Full pipeline with model -- this is the professional way to build ML systems. # The pipeline guarantees that preprocessing steps are applied identically # during training and prediction, eliminating a whole class of production bugs. from sklearn.ensemble import RandomForestClassifier full_pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ]) # Use it -- raw data goes in, predictions come out. # Cross-validation, grid search, and joblib.dump all work # seamlessly with pipelines because they're a single object. full_pipeline.fit(X_train, y_train) predictions = full_pipeline.predict(X_test) ``` *** ## Common Mistakes **Problem**: Using test data info during training **Fix**: Always fit transformers on train data only ```python theme={null} # Wrong scaler.fit(X) # Uses all data # Right scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) ``` **Problem**: Scaling before train-test split **Fix**: Split first, then scale ```python theme={null} # Right order: # 1. Train-test split # 2. Fit scaler on train # 3. Transform both ``` *** ## 🚀 Mini Projects Transform raw transaction data into predictive features Extract powerful temporal features from timestamps Convert text data into numerical features Build an end-to-end feature engineering pipeline ### Project 1: E-commerce Feature Engineer Transform raw e-commerce transaction data into features that predict customer churn.

View Complete Solution

```python theme={null} import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report # Step 1: Create sample e-commerce data np.random.seed(42) n_customers = 1000 data = { 'customer_id': range(1, n_customers + 1), 'total_orders': np.random.poisson(5, n_customers), 'total_spent': np.random.exponential(500, n_customers), 'days_since_first_order': np.random.randint(30, 730, n_customers), 'days_since_last_order': np.random.randint(1, 180, n_customers), 'avg_order_value': np.random.exponential(100, n_customers), 'num_returns': np.random.poisson(0.5, n_customers), 'num_complaints': np.random.poisson(0.2, n_customers), 'email_opens': np.random.poisson(10, n_customers), 'loyalty_points': np.random.exponential(200, n_customers) } df = pd.DataFrame(data) # Simulate churn (customers with recent inactivity or complaints more likely to churn) churn_prob = ( 0.3 * (df['days_since_last_order'] > 60).astype(int) + 0.2 * (df['num_complaints'] > 0).astype(int) + 0.1 * (df['num_returns'] > 2).astype(int) + 0.1 * (df['total_orders'] < 3).astype(int) ) df['churned'] = (np.random.random(n_customers) < churn_prob).astype(int) print("Raw data shape:", df.shape) print(df.head()) # Step 2: Feature engineering def engineer_features(df): """Create meaningful features from raw data""" features = pd.DataFrame() # Recency features features['recency'] = df['days_since_last_order'] features['customer_age'] = df['days_since_first_order'] features['recency_ratio'] = df['days_since_last_order'] / (df['days_since_first_order'] + 1) # Frequency features features['order_frequency'] = df['total_orders'] / (df['days_since_first_order'] / 30 + 1) features['is_frequent_buyer'] = (features['order_frequency'] > 1).astype(int) # Monetary features features['total_spent'] = df['total_spent'] features['avg_order_value'] = df['avg_order_value'] features['value_per_day'] = df['total_spent'] / (df['days_since_first_order'] + 1) # Customer lifetime value estimate features['estimated_clv'] = ( features['order_frequency'] * df['avg_order_value'] * 12 # Annualized ) # Engagement features features['return_rate'] = df['num_returns'] / (df['total_orders'] + 1) features['complaint_rate'] = df['num_complaints'] / (df['total_orders'] + 1) features['email_engagement'] = df['email_opens'] / (df['customer_age'] / 7 + 1) # Loyalty features features['points_per_order'] = df['loyalty_points'] / (df['total_orders'] + 1) features['points_per_dollar'] = df['loyalty_points'] / (df['total_spent'] + 1) # Risk indicators features['is_at_risk'] = ( (df['days_since_last_order'] > 60) | (df['num_complaints'] > 1) ).astype(int) # Segment features (binning) features['spending_tier'] = pd.cut( df['total_spent'], bins=[0, 200, 500, 1000, np.inf], labels=[0, 1, 2, 3] ).astype(int) return features # Step 3: Apply feature engineering X = engineer_features(df) y = df['churned'] print(f"\nEngineered features: {X.shape[1]}") print("Features:", list(X.columns)) # Step 4: Train model and evaluate X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) print("\n📊 Model Performance:") print(classification_report(y_test, model.predict(X_test))) # Step 5: Feature importance importance = pd.DataFrame({ 'feature': X.columns, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) print("\n🏆 Top 10 Most Important Features:") for i, row in importance.head(10).iterrows(): print(f" {row['feature']:25s}: {row['importance']:.4f}") ``` **What you learned:** * RFM (Recency, Frequency, Monetary) features are powerful for churn prediction * Ratio features often outperform raw counts * Feature engineering can dramatically improve model performance

### Project 2: Date-Time Feature Factory Extract powerful temporal features from timestamp data.

View Complete Solution

```python theme={null} import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error # Step 1: Create sample time-series data (sales data) np.random.seed(42) dates = pd.date_range('2022-01-01', '2023-12-31', freq='H') n = len(dates) # Create base sales with patterns base_sales = 100 hourly_pattern = np.sin(np.arange(n) * 2 * np.pi / 24) * 20 daily_pattern = np.sin(np.arange(n) * 2 * np.pi / (24*7)) * 30 seasonal_pattern = np.sin(np.arange(n) * 2 * np.pi / (24*365)) * 50 noise = np.random.normal(0, 10, n) sales = base_sales + hourly_pattern + daily_pattern + seasonal_pattern + noise sales = np.maximum(sales, 0) # No negative sales df = pd.DataFrame({'timestamp': dates, 'sales': sales}) print("Raw data:") print(df.head(10)) # Step 2: Extract datetime features def extract_datetime_features(df, timestamp_col='timestamp'): """Extract comprehensive datetime features""" ts = df[timestamp_col] features = pd.DataFrame() # Basic time components features['hour'] = ts.dt.hour features['day'] = ts.dt.day features['month'] = ts.dt.month features['year'] = ts.dt.year features['dayofweek'] = ts.dt.dayofweek features['dayofyear'] = ts.dt.dayofyear features['weekofyear'] = ts.dt.isocalendar().week.astype(int) features['quarter'] = ts.dt.quarter # Cyclical encoding (important for time!) features['hour_sin'] = np.sin(2 * np.pi * features['hour'] / 24) features['hour_cos'] = np.cos(2 * np.pi * features['hour'] / 24) features['day_sin'] = np.sin(2 * np.pi * features['dayofweek'] / 7) features['day_cos'] = np.cos(2 * np.pi * features['dayofweek'] / 7) features['month_sin'] = np.sin(2 * np.pi * features['month'] / 12) features['month_cos'] = np.cos(2 * np.pi * features['month'] / 12) # Boolean features features['is_weekend'] = (features['dayofweek'] >= 5).astype(int) features['is_month_start'] = ts.dt.is_month_start.astype(int) features['is_month_end'] = ts.dt.is_month_end.astype(int) features['is_quarter_start'] = ts.dt.is_quarter_start.astype(int) features['is_quarter_end'] = ts.dt.is_quarter_end.astype(int) # Time of day categories features['is_morning'] = ((features['hour'] >= 6) & (features['hour'] < 12)).astype(int) features['is_afternoon'] = ((features['hour'] >= 12) & (features['hour'] < 18)).astype(int) features['is_evening'] = ((features['hour'] >= 18) & (features['hour'] < 22)).astype(int) features['is_night'] = ((features['hour'] >= 22) | (features['hour'] < 6)).astype(int) # Business hours features['is_business_hours'] = ( (features['hour'] >= 9) & (features['hour'] < 17) & (features['is_weekend'] == 0) ).astype(int) # Lag features (previous values) features['sales_lag_1h'] = df['sales'].shift(1) features['sales_lag_24h'] = df['sales'].shift(24) features['sales_lag_168h'] = df['sales'].shift(168) # 1 week # Rolling statistics features['sales_rolling_mean_24h'] = df['sales'].rolling(24).mean() features['sales_rolling_std_24h'] = df['sales'].rolling(24).std() features['sales_rolling_mean_168h'] = df['sales'].rolling(168).mean() return features # Step 3: Apply feature extraction features = extract_datetime_features(df) features = features.dropna() # Remove rows with NaN from lag features print(f"\nExtracted {features.shape[1]} datetime features") print("Features:", list(features.columns)) # Step 4: Train model to predict sales y = df['sales'].iloc[features.index] X_train, X_test, y_train, y_test = train_test_split( features, y, test_size=0.2, shuffle=False # Time series: no shuffle ) model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test) mae = mean_absolute_error(y_test, y_pred) print(f"\nModel MAE: {mae:.2f}") # Step 5: Feature importance importance = pd.DataFrame({ 'feature': features.columns, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) print("\n🏆 Top 10 Most Important Time Features:") for i, row in importance.head(10).iterrows(): print(f" {row['feature']:25s}: {row['importance']:.4f}") # Compare with baseline (just using hour) baseline = RandomForestRegressor(n_estimators=100, random_state=42) baseline.fit(X_train[['hour']], y_train) baseline_mae = mean_absolute_error(y_test, baseline.predict(X_test[['hour']])) print(f"\n📈 Improvement over baseline (hour only):") print(f" Baseline MAE: {baseline_mae:.2f}") print(f" Full features MAE: {mae:.2f}") print(f" Improvement: {(baseline_mae - mae) / baseline_mae * 100:.1f}%") ``` **What you learned:** * Cyclical encoding prevents the model from seeing hour 23 and hour 0 as far apart * Lag features capture temporal dependencies * Rolling statistics smooth out noise and capture trends

### Project 3: Text Feature Extractor Convert text data into numerical features for machine learning.

View Complete Solution

```python theme={null} import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report import re # Step 1: Create sample text data (product reviews) reviews = [ ("This product is amazing! Works perfectly.", 1), ("Terrible quality, broke after one day.", 0), ("Great value for money, highly recommend!", 1), ("Waste of money, very disappointed.", 0), ("Excellent product, exceeded expectations!", 1), ("Poor customer service, will not buy again.", 0), ("Love it! Best purchase ever.", 1), ("Defective item, had to return it.", 0), ("Fantastic quality, worth every penny.", 1), ("Cheap materials, not worth it.", 0), ("Perfect gift, arrived quickly!", 1), ("Scam product, nothing like the picture.", 0), ("Super happy with this purchase!", 1), ("Horrible experience, avoid this seller.", 0), ("Outstanding product, will buy again.", 1), ("Complete garbage, want my money back.", 0), ] # Expand dataset expanded_reviews = reviews * 20 np.random.shuffle(expanded_reviews) df = pd.DataFrame(expanded_reviews, columns=['review', 'sentiment']) print(f"Dataset size: {len(df)}") # Step 2: Text preprocessing def preprocess_text(text): """Clean and preprocess text""" # Lowercase text = text.lower() # Remove special characters text = re.sub(r'[^a-z\s]', '', text) # Remove extra whitespace text = ' '.join(text.split()) return text df['clean_review'] = df['review'].apply(preprocess_text) # Step 3: Extract text features def extract_text_features(texts): """Extract multiple types of text features""" features = pd.DataFrame() # Basic statistics features['char_count'] = [len(t) for t in texts] features['word_count'] = [len(t.split()) for t in texts] features['avg_word_length'] = [ np.mean([len(w) for w in t.split()]) if t.split() else 0 for t in texts ] features['unique_words'] = [len(set(t.split())) for t in texts] features['unique_ratio'] = features['unique_words'] / (features['word_count'] + 1) # Punctuation features (from original text) original = df['review'].values features['exclamation_count'] = [t.count('!') for t in original] features['question_count'] = [t.count('?') for t in original] features['caps_ratio'] = [ sum(1 for c in t if c.isupper()) / (len(t) + 1) for t in original ] # Sentiment lexicon features positive_words = {'great', 'amazing', 'excellent', 'love', 'perfect', 'best', 'fantastic', 'outstanding', 'happy', 'recommend'} negative_words = {'terrible', 'poor', 'waste', 'horrible', 'defective', 'scam', 'garbage', 'disappointed', 'worst', 'avoid'} features['positive_words'] = [ sum(1 for w in t.split() if w in positive_words) for t in texts ] features['negative_words'] = [ sum(1 for w in t.split() if w in negative_words) for t in texts ] features['sentiment_score'] = features['positive_words'] - features['negative_words'] return features # Step 4: Combine manual features with TF-IDF manual_features = extract_text_features(df['clean_review']) # TF-IDF features tfidf = TfidfVectorizer(max_features=100, ngram_range=(1, 2)) tfidf_matrix = tfidf.fit_transform(df['clean_review']) tfidf_df = pd.DataFrame( tfidf_matrix.toarray(), columns=[f'tfidf_{w}' for w in tfidf.get_feature_names_out()] ) # Combine all features X = pd.concat([manual_features.reset_index(drop=True), tfidf_df.reset_index(drop=True)], axis=1) y = df['sentiment'] print(f"\nTotal features: {X.shape[1]}") print(f" - Manual features: {manual_features.shape[1]}") print(f" - TF-IDF features: {tfidf_df.shape[1]}") # Step 5: Train and evaluate X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) print("\n📊 Classification Report:") print(classification_report(y_test, model.predict(X_test))) # Step 6: Compare feature sets print("\n📈 Comparing Feature Sets:") # Manual features only model_manual = RandomForestClassifier(n_estimators=100, random_state=42) model_manual.fit(X_train[manual_features.columns], y_train) acc_manual = model_manual.score(X_test[manual_features.columns], y_test) # TF-IDF only model_tfidf = RandomForestClassifier(n_estimators=100, random_state=42) model_tfidf.fit(X_train[tfidf_df.columns], y_train) acc_tfidf = model_tfidf.score(X_test[tfidf_df.columns], y_test) # Combined acc_combined = model.score(X_test, y_test) print(f" Manual features only: {acc_manual:.4f}") print(f" TF-IDF only: {acc_tfidf:.4f}") print(f" Combined: {acc_combined:.4f}") # Top features importance = pd.DataFrame({ 'feature': X.columns, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) print("\n🏆 Top 10 Most Important Features:") for i, row in importance.head(10).iterrows(): print(f" {row['feature']:30s}: {row['importance']:.4f}") ``` **What you learned:** * Combining manual features with TF-IDF often works better than either alone * Simple features like word count and sentiment lexicons are powerful * Feature engineering captures domain knowledge that pure ML can miss

### Project 4: Automated Feature Pipeline Build an end-to-end feature engineering pipeline that handles multiple data types.

View Complete Solution

```python theme={null} import pandas as pd import numpy as np from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import Pipeline, FeatureUnion from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import classification_report # Step 1: Create realistic mixed-type dataset np.random.seed(42) n = 1000 data = { # Numeric features 'age': np.random.randint(18, 80, n), 'income': np.random.exponential(50000, n), 'credit_score': np.random.normal(700, 50, n).clip(300, 850), 'years_employed': np.random.exponential(5, n), # Categorical features 'education': np.random.choice(['high_school', 'bachelor', 'master', 'phd'], n), 'occupation': np.random.choice(['engineer', 'teacher', 'doctor', 'other'], n), 'marital_status': np.random.choice(['single', 'married', 'divorced'], n), # Features with missing values 'savings': np.where(np.random.random(n) > 0.2, np.random.exponential(10000, n), np.nan), 'num_dependents': np.where(np.random.random(n) > 0.1, np.random.poisson(1, n), np.nan), } df = pd.DataFrame(data) # Target: loan approval approval_prob = ( 0.3 * (df['income'] > 50000).astype(int) + 0.2 * (df['credit_score'] > 700).astype(int) + 0.2 * (df['education'].isin(['master', 'phd'])).astype(int) + 0.1 * (df['years_employed'] > 3).astype(int) ) df['approved'] = (np.random.random(n) < approval_prob / 1.5).astype(int) print("Dataset info:") print(df.info()) print("\nMissing values:") print(df.isnull().sum()) # Step 2: Define custom transformers class FeatureInteractionCreator(BaseEstimator, TransformerMixin): """Create interaction features between numeric columns""" def __init__(self, columns): self.columns = columns def fit(self, X, y=None): return self def transform(self, X): X = pd.DataFrame(X, columns=self.columns) result = pd.DataFrame() # Ratios result['income_per_year_employed'] = X['income'] / (X['years_employed'] + 1) result['savings_to_income'] = X['savings'] / (X['income'] + 1) result['credit_age_ratio'] = X['credit_score'] / (X['age'] + 1) # Products result['income_credit_product'] = X['income'] * X['credit_score'] / 100000 return result.fillna(0).values class BinningTransformer(BaseEstimator, TransformerMixin): """Bin continuous variables""" def __init__(self, column_idx, n_bins=5): self.column_idx = column_idx self.n_bins = n_bins self.bins_ = None def fit(self, X, y=None): col = X[:, self.column_idx] self.bins_ = np.percentile(col[~np.isnan(col)], np.linspace(0, 100, self.n_bins + 1)) return self def transform(self, X): col = X[:, self.column_idx] binned = np.digitize(col, self.bins_[1:-1]) return binned.reshape(-1, 1) # Step 3: Define column types numeric_cols = ['age', 'income', 'credit_score', 'years_employed', 'savings', 'num_dependents'] categorical_cols = ['education', 'occupation', 'marital_status'] # Step 4: Build preprocessing pipelines numeric_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) ]) # Column transformer for base features preprocessor = ColumnTransformer([ ('numeric', numeric_pipeline, numeric_cols), ('categorical', categorical_pipeline, categorical_cols) ]) # Step 5: Build full pipeline with feature engineering # First, fit a simple preprocessor to get numeric features for interactions simple_imputer = ColumnTransformer([ ('numeric', SimpleImputer(strategy='median'), numeric_cols) ]) # Prepare data X = df.drop('approved', axis=1) y = df['approved'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Fit the simple imputer simple_imputer.fit(X_train) X_train_imputed = simple_imputer.transform(X_train) # Create interaction features interaction_creator = FeatureInteractionCreator(numeric_cols) X_train_interactions = interaction_creator.transform(X_train_imputed) # Build final preprocessor X_train_base = preprocessor.fit_transform(X_train) X_test_base = preprocessor.transform(X_test) X_test_imputed = simple_imputer.transform(X_test) X_test_interactions = interaction_creator.transform(X_test_imputed) # Combine base and interaction features X_train_final = np.hstack([X_train_base, X_train_interactions]) X_test_final = np.hstack([X_test_base, X_test_interactions]) print(f"\nOriginal features: {X.shape[1]}") print(f"After preprocessing: {X_train_base.shape[1]}") print(f"After adding interactions: {X_train_final.shape[1]}") # Step 6: Train and evaluate model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train_final, y_train) print("\n📊 Model Performance:") print(classification_report(y_test, model.predict(X_test_final))) # Step 7: Compare with baseline (no feature engineering) baseline_model = RandomForestClassifier(n_estimators=100, random_state=42) baseline_model.fit(X_train_base, y_train) print("\n📈 Comparison:") print(f" Baseline accuracy: {baseline_model.score(X_test_base, y_test):.4f}") print(f" With interactions accuracy: {model.score(X_test_final, y_test):.4f}") # Step 8: Cross-validation for robustness cv_scores_base = cross_val_score( RandomForestClassifier(n_estimators=100, random_state=42), X_train_base, y_train, cv=5 ) print(f"\n Baseline CV: {cv_scores_base.mean():.4f} (+/- {cv_scores_base.std():.4f})") cv_scores_full = cross_val_score( RandomForestClassifier(n_estimators=100, random_state=42), X_train_final, y_train, cv=5 ) print(f" Full features CV: {cv_scores_full.mean():.4f} (+/- {cv_scores_full.std():.4f})") ``` **What you learned:** * Sklearn pipelines ensure consistent preprocessing between train and test * Custom transformers let you integrate domain knowledge * Feature unions combine multiple feature engineering strategies * Always compare against a baseline to measure improvement

*** ## Key Takeaways Impute or create indicator variables One-hot for nominal, ordinal for ordered StandardScaler or MinMaxScaler for most algorithms Domain knowledge creates the best features *** ## 🧹 Real-World Messy Data: Complete Guide ### Missing Values Decision Tree ``` Is data missing? ├── < 5% missing │ └── Safe to drop rows OR simple imputation (mean/median) ├── 5-30% missing │ ├── Is missingness random? │ │ ├── Yes → Impute with mean/median/mode │ │ └── No (informative) → Create "is_missing" indicator + impute │ └── Consider multiple imputation for important analyses └── > 30% missing ├── Is the feature important? │ ├── Yes → Advanced imputation (KNN, iterative) │ └── No → Consider dropping the feature └── Investigate WHY data is missing ``` ```python theme={null} # Production-ready missing value handler def handle_missing_values(df, missing_threshold=0.3): """ Handle missing values with best practices. """ report = [] for col in df.columns: missing_pct = df[col].isnull().mean() if missing_pct == 0: continue elif missing_pct > missing_threshold: report.append(f"⚠️ {col}: {missing_pct:.1%} missing - consider dropping") elif df[col].dtype in ['float64', 'int64']: # Numeric: impute with median (robust to outliers) df[f'{col}_missing'] = df[col].isnull().astype(int) df[col].fillna(df[col].median(), inplace=True) report.append(f"✓ {col}: imputed with median, created indicator") else: # Categorical: impute with mode or 'Unknown' df[col].fillna(df[col].mode()[0] if len(df[col].mode()) > 0 else 'Unknown', inplace=True) report.append(f"✓ {col}: imputed with mode/Unknown") return df, report ``` ### Outlier Detection & Treatment ```python theme={null} import numpy as np from scipy import stats def detect_and_handle_outliers(df, columns, method='iqr', action='cap'): """ Detect outliers using IQR or Z-score, then handle them. Parameters: - method: 'iqr' (robust) or 'zscore' (assumes normality) - action: 'cap' (winsorize), 'remove', or 'flag' """ for col in columns: if method == 'iqr': Q1, Q3 = df[col].quantile([0.25, 0.75]) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR else: # zscore z = np.abs(stats.zscore(df[col].dropna())) lower = df[col].mean() - 3 * df[col].std() upper = df[col].mean() + 3 * df[col].std() outliers = (df[col] < lower) | (df[col] > upper) n_outliers = outliers.sum() if n_outliers > 0: if action == 'cap': df[col] = df[col].clip(lower=lower, upper=upper) print(f" {col}: capped {n_outliers} outliers to [{lower:.2f}, {upper:.2f}]") elif action == 'remove': df = df[~outliers] print(f" {col}: removed {n_outliers} outlier rows") else: # flag df[f'{col}_is_outlier'] = outliers.astype(int) print(f" {col}: flagged {n_outliers} outliers") return df ``` ### Handling Skewed Distributions ```python theme={null} from sklearn.preprocessing import PowerTransformer def handle_skewness(df, columns, threshold=1.0): """ Apply log or Yeo-Johnson transform to skewed features. """ from scipy.stats import skew for col in columns: col_skew = skew(df[col].dropna()) if abs(col_skew) > threshold: if (df[col] > 0).all(): # Log transform for positive data df[f'{col}_log'] = np.log1p(df[col]) print(f" {col}: skew={col_skew:.2f} → log transform applied") else: # Yeo-Johnson for any data pt = PowerTransformer(method='yeo-johnson') df[f'{col}_transformed'] = pt.fit_transform(df[[col]]) print(f" {col}: skew={col_skew:.2f} → Yeo-Johnson applied") return df ``` *** ## 🔗 Math → ML Connection **Feature engineering connects to these mathematical concepts:** | Technique | Math Concept | Why It Works | | ----------------------- | ------------------------ | -------------------------------------- | | **Standardization** | Z-scores from statistics | Makes gradient descent converge faster | | **One-hot encoding** | Orthogonal basis vectors | Each category becomes a dimension | | **Log transforms** | Properties of logarithms | Linearizes exponential relationships | | **Polynomial features** | Polynomial functions | Captures nonlinear patterns | | **PCA features** | Eigenvalue decomposition | Finds directions of max variance | | **Interaction terms** | Cross-products | Models combined effects | The [Linear Algebra course](/courses/math-for-ml-linear-algebra/02-vectors) covers why these transformations work geometrically. *** ## 🚀 Going Deeper (Optional) ### Target Encoding (for High-Cardinality Categoricals) When a categorical has 1000+ unique values, one-hot encoding creates too many features: ```python theme={null} from sklearn.model_selection import KFold def target_encode(df, cat_col, target_col, n_splits=5): """ Replace category with mean target value (using cross-validation to prevent leakage). """ df[f'{cat_col}_target_enc'] = np.nan kf = KFold(n_splits=n_splits, shuffle=True, random_state=42) for train_idx, val_idx in kf.split(df): # Calculate means only on training fold means = df.iloc[train_idx].groupby(cat_col)[target_col].mean() # Apply to validation fold df.loc[df.index[val_idx], f'{cat_col}_target_enc'] = \ df.iloc[val_idx][cat_col].map(means) # Fill any remaining NaN with global mean df[f'{cat_col}_target_enc'].fillna(df[target_col].mean(), inplace=True) return df ``` ### Time-Based Features ```python theme={null} def create_time_features(df, date_col): """ Extract rich features from datetime columns. """ df[date_col] = pd.to_datetime(df[date_col]) # Basic extractions df['year'] = df[date_col].dt.year df['month'] = df[date_col].dt.month df['day'] = df[date_col].dt.day df['dayofweek'] = df[date_col].dt.dayofweek # 0=Monday df['hour'] = df[date_col].dt.hour # Cyclical encoding (preserves continuity: Dec → Jan) df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12) df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12) df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24) df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24) # Business features df['is_weekend'] = (df['dayofweek'] >= 5).astype(int) df['is_month_start'] = df[date_col].dt.is_month_start.astype(int) df['is_month_end'] = df[date_col].dt.is_month_end.astype(int) return df ``` ### Automated Feature Engineering ```python theme={null} # Using featuretools for automated feature generation import featuretools as ft # Define entity set es = ft.EntitySet(id='customers') es.add_dataframe(dataframe_name='transactions', dataframe=transactions_df, index='transaction_id', time_index='timestamp') es.add_dataframe(dataframe_name='customers', dataframe=customers_df, index='customer_id') # Create relationship es.add_relationship('customers', 'customer_id', 'transactions', 'customer_id') # Automatically generate features features, feature_names = ft.dfs( entityset=es, target_dataframe_name='customers', agg_primitives=['sum', 'mean', 'count', 'max', 'min', 'std'], trans_primitives=['month', 'year', 'weekday'], max_depth=2 ) ``` *** ## What's Next? Now you know how to prepare data. But how do you find the best hyperparameters? Learn Grid Search, Random Search, and Bayesian Optimization