Raw data is messy. Models need clean, meaningful numbers.Feature engineering is the art of transforming raw data into features that help models learn. It’s the difference between feeding a model “March 15, 1995” (a string it can’t use) and feeding it “30 years old, built pre-2000, winter construction” (numbers that carry meaning).Here’s a truth that surprises beginners: a simple model with great features almost always beats a complex model with raw features. Feature engineering is where domain knowledge meets data science, and it’s the single highest-leverage activity in most ML projects.
# Drop rows with ANY missing values -- the nuclear option.# Only safe when: (1) you have plenty of data, (2) missingness is random,# and (3) the missing rows aren't systematically different from the rest.df_clean = df.dropna()# Safer: drop rows missing only in critical columnsdf_clean = df.dropna(subset=['age'])
Only use when you have lots of data and missingness is truly random (called MCAR — Missing Completely At Random). If high-income people tend to skip the income question, dropping those rows biases your model toward low-income profiles. Check this by comparing the distributions of other features between “has missing” and “no missing” groups.
from sklearn.impute import SimpleImputer# Numeric: fill with mean, median, or constantnumeric_imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent'df['age'] = numeric_imputer.fit_transform(df[['age']])# Categorical: fill with mode or 'Unknown'categorical_imputer = SimpleImputer(strategy='most_frequent')df['education'] = categorical_imputer.fit_transform(df[['education']])
# Create a flag for missing values (can be informative!)# Why? Because "missing" is often not random -- it carries signal.# For example, if income is missing, the person might have refused# to share it, which itself correlates with certain behaviors.df['age_missing'] = df['age'].isnull().astype(int)df['age'] = df['age'].fillna(df['age'].median())
Which imputation strategy should you use? Use median for numeric features with outliers (median is robust to extreme values). Use mean for normally distributed features. Use mode (most frequent) for categorical features. And always create a missingness indicator — it’s free information and tree-based models will use it if it’s predictive.
Use this when categories have a natural order — like education levels, satisfaction ratings, or T-shirt sizes. The numbers you assign should reflect the ranking.
from sklearn.preprocessing import LabelEncoder# Ordinal: order matters -- PhD > Master > Bachelor > High School# The numeric values (0,1,2,3) encode this ranking, which means# the model can learn "higher education = different outcome."education_order = ['High School', 'Bachelor', 'Master', 'PhD']df['education_encoded'] = df['education'].map({ 'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3})# Common mistake: using label encoding for nominal categories (like color).# The model would think red=0 < blue=1 < green=2, which is meaningless.
Use this when categories have no natural order — like colors, countries, or product types. Each category becomes its own binary column.
from sklearn.preprocessing import OneHotEncoderimport pandas as pd# Using pandas -- simpler for explorationdf_encoded = pd.get_dummies(df, columns=['color'], prefix='color')# Using sklearn -- better for pipelines (remembers categories from training)encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')# handle_unknown='ignore' is critical: if test data has a category# the model never saw during training, it won't crash -- it just# sets all one-hot columns to 0 (an "unknown" embedding).encoded = encoder.fit_transform(df[['color']])
When a categorical feature has hundreds or thousands of unique values (like zip codes or product IDs), one-hot encoding creates an explosion of columns. Target encoding replaces each category with the average target value for that category — essentially asking “what’s the typical outcome for this group?”
# Replace category with mean target value.# For a city like "San Francisco," this becomes the average house price# in San Francisco -- a single number that captures location value.city_means = df.groupby('city')['price'].mean()df['city_encoded'] = df['city'].map(city_means)
Data leakage warning: Target encoding uses the target variable to create features, which can leak future information into training. Always compute means on training data only, and consider using smoothed target encoding (blending category mean with global mean) to reduce overfitting on rare categories. Libraries like category_encoders handle this correctly.
xscaled=σx−μCenters each feature at 0, scales to unit variance. The most common choice for algorithms that assume normally distributed features (logistic regression, SVM, neural networks).
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_scaled = scaler.fit_transform(X) # fit on training data, transform both# Result: mean=0, std=1print(f"Mean: {X_scaled.mean():.4f}")print(f"Std: {X_scaled.std():.4f}")
xscaled=xmax−xminx−xminMaps every feature to [0, 1]. Best when you need bounded values (e.g., neural network inputs, or when features are already uniformly distributed). Sensitive to outliers — one extreme value can squash everything else into a narrow range.
from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()X_scaled = scaler.fit_transform(X)# Result: values between 0 and 1print(f"Min: {X_scaled.min():.4f}")print(f"Max: {X_scaled.max():.4f}")
Uses median and IQR instead of mean and std. If your data has outliers that you don’t want to remove, this is your best bet — the median and IQR are not affected by extreme values.
from sklearn.preprocessing import RobustScalerscaler = RobustScaler()X_scaled = scaler.fit_transform(X)
Quick decision guide for scaling:
Scaler
Use When
Avoid When
StandardScaler
Default choice; features are roughly Gaussian
Features have many outliers
MinMaxScaler
You need bounded [0,1] values; data is uniformly distributed
# For skewed distributions (right-skewed data like income, prices)# Log transform compresses the long tail, making the distribution# more symmetric. This helps linear models that assume normality.df['income_log'] = np.log1p(df['income']) # log(1+x) to handle zeros safely# Square root -- a milder compression than log, good for count datadf['rooms_sqrt'] = np.sqrt(df['rooms'])# Polynomial features -- captures non-linear relationships.# If the relationship between age and target is U-shaped (young and old# are both high-risk), a linear model can't capture this with age alone.# Adding age^2 gives it the curvature it needs.df['age_squared'] = df['age'] ** 2
These capture relationships between features that the model might not discover on its own. They encode domain knowledge: “the combination of these two things matters, not just each one individually.”
# Combine features -- each ratio tells a specific storydf['price_per_sqft'] = df['price'] / df['sqft'] # Property value densitydf['income_per_person'] = df['household_income'] / df['household_size'] # Individual buying powerdf['age_income_ratio'] = df['age'] / df['income'] # Career progression proxy
More features is not always better. Think of it like packing for a trip: bringing everything “just in case” makes your suitcase impossibly heavy and you can never find what you need. Feature selection is choosing to pack only what you’ll actually wear. Irrelevant features add noise, slow training, and can even hurt accuracy by diluting the signal.
import seaborn as snsimport matplotlib.pyplot as plt# Correlation matrix -- the first thing to check.# Look for: (1) features highly correlated with target (good!)# and (2) features highly correlated with each other (redundant -- drop one).corr = df.corr()# Heatmapplt.figure(figsize=(12, 8))sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)plt.title('Feature Correlations')plt.show()# Drop highly correlated featuresdef drop_correlated_features(df, threshold=0.9): corr_matrix = df.corr().abs() upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) to_drop = [column for column in upper.columns if any(upper[column] > threshold)] return df.drop(columns=to_drop)
from sklearn.feature_selection import SelectKBest, f_classif, RFEfrom sklearn.ensemble import RandomForestClassifier# Select top k features by ANOVA F-score.# f_classif tests whether each feature's mean differs across classes.# It's fast but only catches linear relationships -- a feature with# a U-shaped relationship to the target might score low.selector = SelectKBest(f_classif, k=10)X_selected = selector.fit_transform(X, y)# Get selected feature namesselected_features = X.columns[selector.get_support()]print("Selected features:", list(selected_features))
RFE works like a talent show elimination: train a model, eliminate the weakest feature, retrain, repeat. It’s slower but catches feature interactions that univariate tests miss.
from sklearn.feature_selection import RFEmodel = RandomForestClassifier(n_estimators=50, random_state=42)rfe = RFE(model, n_features_to_select=10)rfe.fit(X, y)# Get rankingsfor name, rank in zip(X.columns, rfe.ranking_): print(f"{name}: Rank {rank}")
from sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.impute import SimpleImputer# Define column typesnumeric_features = ['age', 'income', 'credit_score']categorical_features = ['education', 'occupation', 'city']# Create transformersnumeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')), ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])# Combinepreprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)])# Full pipeline with model -- this is the professional way to build ML systems.# The pipeline guarantees that preprocessing steps are applied identically# during training and prediction, eliminating a whole class of production bugs.from sklearn.ensemble import RandomForestClassifierfull_pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))])# Use it -- raw data goes in, predictions come out.# Cross-validation, grid search, and joblib.dump all work# seamlessly with pipelines because they're a single object.full_pipeline.fit(X_train, y_train)predictions = full_pipeline.predict(X_test)
Is data missing?├── < 5% missing│ └── Safe to drop rows OR simple imputation (mean/median)├── 5-30% missing│ ├── Is missingness random?│ │ ├── Yes → Impute with mean/median/mode│ │ └── No (informative) → Create "is_missing" indicator + impute│ └── Consider multiple imputation for important analyses└── > 30% missing ├── Is the feature important? │ ├── Yes → Advanced imputation (KNN, iterative) │ └── No → Consider dropping the feature └── Investigate WHY data is missing
# Production-ready missing value handlerdef handle_missing_values(df, missing_threshold=0.3): """ Handle missing values with best practices. """ report = [] for col in df.columns: missing_pct = df[col].isnull().mean() if missing_pct == 0: continue elif missing_pct > missing_threshold: report.append(f"⚠️ {col}: {missing_pct:.1%} missing - consider dropping") elif df[col].dtype in ['float64', 'int64']: # Numeric: impute with median (robust to outliers) df[f'{col}_missing'] = df[col].isnull().astype(int) df[col].fillna(df[col].median(), inplace=True) report.append(f"✓ {col}: imputed with median, created indicator") else: # Categorical: impute with mode or 'Unknown' df[col].fillna(df[col].mode()[0] if len(df[col].mode()) > 0 else 'Unknown', inplace=True) report.append(f"✓ {col}: imputed with mode/Unknown") return df, report
When a categorical has 1000+ unique values, one-hot encoding creates too many features:
from sklearn.model_selection import KFolddef target_encode(df, cat_col, target_col, n_splits=5): """ Replace category with mean target value (using cross-validation to prevent leakage). """ df[f'{cat_col}_target_enc'] = np.nan kf = KFold(n_splits=n_splits, shuffle=True, random_state=42) for train_idx, val_idx in kf.split(df): # Calculate means only on training fold means = df.iloc[train_idx].groupby(cat_col)[target_col].mean() # Apply to validation fold df.loc[df.index[val_idx], f'{cat_col}_target_enc'] = \ df.iloc[val_idx][cat_col].map(means) # Fill any remaining NaN with global mean df[f'{cat_col}_target_enc'].fillna(df[target_col].mean(), inplace=True) return df