> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # The Prediction Game > Your first machine learning model - no libraries, just logic # The Prediction Game ML Prediction Concept - Input to Output

## Starting With Something You Already Know Forget Python. Forget libraries. Forget math notation. Let's play a game. *** ## Round 1: The House Price Game You're a real estate agent. A client asks: *"How much is this house worth?"* They give you some info: | Feature | Value | | ----------- | ----- | | Bedrooms | 3 | | Bathrooms | 2 | | Square Feet | 1,500 | | Age (years) | 10 | | Has Pool | No | What's your guess? ### Your Brain's Algorithm Without realizing it, you do this: 1. Think of similar houses you've seen 2. Remember what they sold for 3. Adjust based on differences 4. Make a guess **That's machine learning.** You learned patterns from past data and applied them to new data. Real World ML - Email Spam Filtering

*** ## Round 2: Let's Be More Systematic What if I told you the average house in your area sells for: * **Base price**: \$200,000 * Each bedroom adds about **\$25,000** * Each bathroom adds about **\$15,000** * Each square foot adds about **\$150** Now you can compute: ``` Base: $200,000 + 3 bedrooms × $25,000: + $75,000 + 2 bathrooms × $15,000: + $30,000 + 1,500 sq ft × $150: + $225,000 ---------- Predicted price: $530,000 ``` You just built your first **linear model**! **The formula you just used:** `price = base + (bedrooms × weight1) + (bathrooms × weight2) + (sqft × weight3)` Those "weights" ($25k, $15k, \$150) are what machine learning learns automatically from data. *** ## Let's Code It (Still No Libraries!) ```python theme={null} # Your first "model" - just a function! def predict_house_price(bedrooms, bathrooms, sqft): base = 200000 bedroom_value = 25000 bathroom_value = 15000 sqft_value = 150 predicted = ( base + bedrooms * bedroom_value + bathrooms * bathroom_value + sqft * sqft_value ) return predicted # Test it house1 = predict_house_price(3, 2, 1500) print(f"House 1 predicted: ${house1:,}") # $530,000 house2 = predict_house_price(4, 3, 2200) print(f"House 2 predicted: ${house2:,}") # $675,000 ``` *** ## The Million Dollar Question But wait... how did we know those weights? * Why $25,000 per bedroom and not $30,000? * Why $150 per sq ft and not $200? **We guessed.** And our guesses might be wrong. **Machine learning answers this**: Given a bunch of houses with known prices, can we *figure out* the best weights automatically? *** ## Real Data, Real Problem Here's actual data (simplified): ```python theme={null} # Past house sales (our "training data") houses = [ # [bedrooms, bathrooms, sqft] -> actual_price {"features": [2, 1, 1000], "price": 250000}, {"features": [3, 2, 1500], "price": 380000}, {"features": [4, 2, 1800], "price": 450000}, {"features": [3, 3, 2000], "price": 520000}, {"features": [5, 4, 3000], "price": 750000}, ] ``` **Our goal**: Find weights that make our predictions match these actual prices as closely as possible. *** ## Step 1: How Wrong Are We? If we use our guessed weights, let's see how we do: ```python theme={null} def predict_house_price(features): bedrooms, bathrooms, sqft = features base = 200000 return base + bedrooms * 25000 + bathrooms * 15000 + sqft * 150 # Check each house for house in houses: predicted = predict_house_price(house["features"]) actual = house["price"] error = predicted - actual print(f"Predicted: ${predicted:,}, Actual: ${actual:,}, Error: ${error:,}") ``` **Output:** ``` Predicted: $430,000, Actual: $250,000, Error: $180,000 (too high!) Predicted: $530,000, Actual: $380,000, Error: $150,000 (too high!) Predicted: $595,000, Actual: $450,000, Error: $145,000 (too high!) Predicted: $620,000, Actual: $520,000, Error: $100,000 (too high!) Predicted: $825,000, Actual: $750,000, Error: $75,000 (too high!) ``` We're consistently too high! Our weights are off. *** ## Step 2: Measure Total "Wrongness" We need a single number that tells us how wrong we are overall. **Simple approach**: Sum of all errors ```python theme={null} total_error = 0 for house in houses: predicted = predict_house_price(house["features"]) actual = house["price"] error = predicted - actual total_error += error print(f"Total error: ${total_error:,}") # $650,000 too high overall ``` **Problem**: What if some errors are positive and some negative? They cancel out! **Better approach**: Sum of squared errors ```python theme={null} total_squared_error = 0 for house in houses: predicted = predict_house_price(house["features"]) actual = house["price"] error = predicted - actual total_squared_error += error ** 2 print(f"Total squared error: {total_squared_error:,.0f}") ``` This is called the **Loss Function** or **Cost Function**. Lower is better! **Why squared?** Think of it like grading a student's exam: 1. **No negative numbers** -- errors can't cancel out (a +$50K overshoot shouldn't "forgive" a -$50K undershoot) 2. **Big errors get penalized more** -- being off by $100K is more than twice as bad as being off by $50K. Squaring enforces this: $100K^2 = 10B$ vs $50K^2 = 2.5B$ (a 4x penalty for a 2x error) 3. **Smooth and differentiable** -- the curve has no sharp corners, so gradient descent can glide smoothly toward the minimum (we'll need this in Module 2) There are alternatives -- Mean Absolute Error (MAE) treats all errors equally and is more robust to outliers. But MSE is the default starting point because its math is cleaner and it punishes the predictions you're most embarrassingly wrong about. *** ## Step 3: Try Different Weights What if we try different values? ```python theme={null} def calculate_total_error(base, bed_weight, bath_weight, sqft_weight): total_squared_error = 0 for house in houses: bedrooms, bathrooms, sqft = house["features"] predicted = base + bedrooms * bed_weight + bathrooms * bath_weight + sqft * sqft_weight actual = house["price"] error = predicted - actual total_squared_error += error ** 2 return total_squared_error # Our original guess error1 = calculate_total_error(200000, 25000, 15000, 150) print(f"Original weights error: {error1:,.0f}") # Try lowering everything error2 = calculate_total_error(100000, 20000, 10000, 100) print(f"Lower weights error: {error2:,.0f}") # Try something else error3 = calculate_total_error(50000, 15000, 25000, 175) print(f"Alternative weights error: {error3:,.0f}") ``` **The challenge**: There are infinite combinations of weights. How do we find the best ones? *** ## The Insight: Systematic Search What if we: 1. Start with random weights 2. Check how wrong we are 3. Slightly adjust weights 4. If error goes down, keep the change 5. Repeat until error stops improving This is the core idea behind **Gradient Descent** - which we'll explore in the next module! Think of it like tuning a radio dial in the dark. Random search is spinning the dial blindly and hoping for a good station. Gradient descent is *listening to the static* -- when it gets quieter, you keep turning that direction. ```python theme={null} # A simple (but slow) approach: try lots of combinations # This is "random search" -- the brute force method. # It works, but it's like trying every combination on a lock # instead of listening for the click. best_error = float('inf') best_weights = None import random for _ in range(10000): # Try 10,000 random combinations base = random.randint(0, 200000) bed = random.randint(5000, 50000) bath = random.randint(5000, 50000) sqft = random.randint(50, 300) error = calculate_total_error(base, bed, bath, sqft) if error < best_error: best_error = error best_weights = (base, bed, bath, sqft) print(f"Best weights found: {best_weights}") print(f"Best error: {best_error:,.0f}") # Note: with 4 weights and infinite possible values, # random search has astronomically low odds of finding # the true optimum. We need something smarter. ``` *** ## What You Just Learned Let's recap with proper ML terminology: | What You Did | ML Term | | ------------------------------ | ------------------------- | | Used past house sales | **Training Data** | | Features like bedrooms, sqft | **Input Features** (X) | | The actual price | **Target/Label** (y) | | The weights ($25k, $15k, etc.) | **Model Parameters** | | The prediction formula | **Model** | | How wrong our predictions were | **Loss/Error** | | Sum of squared errors | **Loss Function** | | Trying to minimize error | **Training/Optimization** | *** ## The Mathematical Connection When you calculated: ``` price = base + (bedrooms × weight1) + (bathrooms × weight2) + (sqft × weight3) ``` In math notation, this is: $$ \hat{y} = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_3 $$ Or in matrix form (from our [Linear Algebra course](/courses/math-for-ml-linear-algebra/03-matrices)): $$ \hat{y} = \mathbf{w} \cdot \mathbf{x} $$ This is a **dot product** - the same operation you do when calculating weighted grades! *** ## 🚀 Mini Projects Build a house price estimator from scratch Create a used car valuation tool Visualize prediction errors and find patterns

**Project 1: House Price Estimator** - Build your first ML predictor

**Objective**: Create a simple house price predictor and manually tune the weights. **Tasks**: 1. Implement a prediction function with adjustable weights 2. Calculate the total squared error 3. Manually tune weights to minimize error 4. Predict prices for new houses ```python theme={null} import numpy as np # Training data: houses with known prices houses = [ {"bedrooms": 2, "bathrooms": 1, "sqft": 1000, "price": 250000}, {"bedrooms": 3, "bathrooms": 2, "sqft": 1500, "price": 380000}, {"bedrooms": 4, "bathrooms": 2, "sqft": 1800, "price": 450000}, {"bedrooms": 3, "bathrooms": 3, "sqft": 2000, "price": 520000}, {"bedrooms": 5, "bathrooms": 4, "sqft": 3000, "price": 750000}, {"bedrooms": 2, "bathrooms": 1, "sqft": 900, "price": 220000}, {"bedrooms": 4, "bathrooms": 3, "sqft": 2500, "price": 620000}, ] def predict_price(house, weights): """Predict house price using weighted features.""" base, w_bed, w_bath, w_sqft = weights return (base + w_bed * house["bedrooms"] + w_bath * house["bathrooms"] + w_sqft * house["sqft"]) def calculate_error(houses, weights): """Calculate total squared error.""" total_error = 0 for house in houses: predicted = predict_price(house, weights) actual = house["price"] total_error += (predicted - actual) ** 2 return total_error # TODO: Try different weights and minimize the error # Start with these weights: initial_weights = [100000, 30000, 20000, 100] # [base, per_bedroom, per_bathroom, per_sqft] # Your goal: Find weights that give low total error # Try adjusting each weight up and down to see the effect ``` **Solution**: ```python theme={null} # Grid search for best weights (simplified brute force) best_error = float('inf') best_weights = None # Search over a range of weight combinations for base in range(50000, 150000, 25000): for w_bed in range(10000, 50000, 10000): for w_bath in range(10000, 40000, 10000): for w_sqft in range(50, 200, 25): weights = [base, w_bed, w_bath, w_sqft] error = calculate_error(houses, weights) if error < best_error: best_error = error best_weights = weights print(f"Best weights: {best_weights}") print(f"Best total error: ${best_error:,.0f}") # Verify predictions print("\n--- Predictions vs Actual ---") for house in houses: pred = predict_price(house, best_weights) actual = house["price"] print(f"Predicted: ${pred:,.0f}, Actual: ${actual:,}, Error: ${abs(pred-actual):,.0f}") # Predict new house new_house = {"bedrooms": 3, "bathrooms": 2, "sqft": 1700} prediction = predict_price(new_house, best_weights) print(f"\nNew house prediction: ${prediction:,.0f}") ```

**Project 2: Used Car Valuation Tool** - Handle negative relationships

**Objective**: Build a car price predictor where some features decrease value (age, mileage). **Key Learning**: Not all features have positive relationships with the target! ```python theme={null} import numpy as np # Car data: age and mileage should decrease price! cars = [ {"age_years": 1, "mileage_k": 10, "horsepower": 200, "price": 35000}, {"age_years": 3, "mileage_k": 35, "horsepower": 180, "price": 25000}, {"age_years": 5, "mileage_k": 60, "horsepower": 220, "price": 22000}, {"age_years": 2, "mileage_k": 25, "horsepower": 300, "price": 45000}, {"age_years": 7, "mileage_k": 90, "horsepower": 160, "price": 12000}, {"age_years": 4, "mileage_k": 45, "horsepower": 250, "price": 28000}, {"age_years": 1, "mileage_k": 8, "horsepower": 350, "price": 55000}, {"age_years": 10, "mileage_k": 150, "horsepower": 180, "price": 8000}, ] def predict_car_price(car, weights): """ Predict car price. Note: age and mileage should have NEGATIVE weights! """ base, w_age, w_mileage, w_hp = weights return (base + w_age * car["age_years"] + w_mileage * car["mileage_k"] + w_hp * car["horsepower"]) # TODO: Find weights where age and mileage are negative # Hint: Age weight might be around -2000 to -4000 per year # Hint: Mileage weight might be around -100 to -300 per 1000 miles ``` **Solution**: ```python theme={null} def calculate_car_error(cars, weights): """Calculate mean squared error.""" total = 0 for car in cars: pred = predict_car_price(car, weights) total += (pred - car["price"]) ** 2 return total / len(cars) # Search with negative weights for age and mileage best_error = float('inf') best_weights = None for base in range(40000, 60000, 5000): for w_age in range(-5000, -1000, 500): # Negative! for w_mileage in range(-300, -50, 50): # Negative! for w_hp in range(50, 200, 25): # Positive weights = [base, w_age, w_mileage, w_hp] error = calculate_car_error(cars, weights) if error < best_error: best_error = error best_weights = weights print(f"Best weights: {best_weights}") print(f" Base price: ${best_weights[0]:,}") print(f" Per year age: ${best_weights[1]:,} (negative = older costs less)") print(f" Per 1K miles: ${best_weights[2]:,} (negative = more miles costs less)") print(f" Per horsepower: ${best_weights[3]:,}") # Interpret print("\n--- Interpretation ---") print(f"Each year of age reduces value by ${abs(best_weights[1]):,}") print(f"Each 10,000 miles reduces value by ${abs(best_weights[2] * 10):,}") print(f"Each horsepower adds ${best_weights[3]:,}") # Test predictions print("\n--- Validation ---") for car in cars[:3]: pred = predict_car_price(car, best_weights) print(f"{car['age_years']}yr old, {car['mileage_k']}k mi, {car['horsepower']}hp") print(f" Predicted: ${pred:,.0f}, Actual: ${car['price']:,}\n") ```

**Project 3: Error Analysis Dashboard** - Visualize and understand errors

**Objective**: Analyze prediction errors to understand model behavior. **Key Learning**: Visualizing errors reveals patterns and helps improve models. ```python theme={null} import numpy as np import matplotlib.pyplot as plt # House data houses = [ {"sqft": 1000, "price": 250000}, {"sqft": 1200, "price": 290000}, {"sqft": 1500, "price": 380000}, {"sqft": 1800, "price": 420000}, {"sqft": 2000, "price": 500000}, {"sqft": 2200, "price": 550000}, {"sqft": 2500, "price": 620000}, {"sqft": 3000, "price": 780000}, ] # Simple model: price = base + sqft * price_per_sqft def analyze_model(base, price_per_sqft): """Analyze a simple linear model.""" sqfts = [h["sqft"] for h in houses] actuals = [h["price"] for h in houses] predictions = [base + price_per_sqft * sqft for sqft in sqfts] errors = [pred - actual for pred, actual in zip(predictions, actuals)] # Statistics mse = np.mean([e**2 for e in errors]) mae = np.mean([abs(e) for e in errors]) return sqfts, actuals, predictions, errors, mse, mae # TODO: Try different base and price_per_sqft values # Analyze which gives best results ``` **Solution**: ```python theme={null} # Test multiple models models = [ {"base": 50000, "price_per_sqft": 200, "name": "Model A: High per-sqft"}, {"base": 100000, "price_per_sqft": 150, "name": "Model B: Balanced"}, {"base": 0, "price_per_sqft": 250, "name": "Model C: No base"}, ] fig, axes = plt.subplots(2, 3, figsize=(15, 10)) for i, model in enumerate(models): sqfts, actuals, predictions, errors, mse, mae = analyze_model( model["base"], model["price_per_sqft"] ) # Top row: Predictions vs Actuals ax1 = axes[0, i] ax1.scatter(sqfts, actuals, label='Actual', s=100) ax1.plot(sqfts, predictions, 'r-', label='Predicted', linewidth=2) ax1.set_xlabel('Square Feet') ax1.set_ylabel('Price ($)') ax1.set_title(f'{model["name"]}\nMSE: ${mse:,.0f}') ax1.legend() # Bottom row: Error distribution ax2 = axes[1, i] colors = ['green' if e < 0 else 'red' for e in errors] ax2.bar(range(len(errors)), errors, color=colors, alpha=0.7) ax2.axhline(y=0, color='black', linestyle='-', linewidth=0.5) ax2.set_xlabel('House Index') ax2.set_ylabel('Error ($)') ax2.set_title(f'MAE: ${mae:,.0f}') plt.tight_layout() plt.savefig('error_analysis.png', dpi=100) print("Saved error_analysis.png") # Find best model best_model = min(models, key=lambda m: analyze_model(m["base"], m["price_per_sqft"])[4]) print(f"\nBest model: {best_model['name']}") print(f"Formula: price = ${best_model['base']:,} + sqft × ${best_model['price_per_sqft']}") # Pattern analysis print("\n--- Error Pattern Analysis ---") _, _, _, errors, _, _ = analyze_model(best_model["base"], best_model["price_per_sqft"]) if all(e < 0 for e in errors[:3]) and all(e > 0 for e in errors[-3:]): print("Pattern: Underpredicting small houses, overpredicting large houses") print("Suggestion: Reduce price_per_sqft, increase base") elif all(e > 0 for e in errors): print("Pattern: Overpredicting all houses") print("Suggestion: Reduce base or price_per_sqft") else: print("Errors are mixed - model is reasonably balanced") ```

*** ## Key Takeaways Find patterns in past data, apply to new data The learned weights encode what matters Lower loss = better predictions Find weights that make predictions best match reality *** ## Practice Challenge Try this on your own: ```python theme={null} # New dataset: Car prices cars = [ # [age_years, mileage_k, horsepower] -> price {"features": [2, 15, 200], "price": 35000}, {"features": [5, 50, 180], "price": 22000}, {"features": [1, 8, 250], "price": 45000}, {"features": [8, 100, 150], "price": 12000}, {"features": [3, 30, 220], "price": 32000}, ] # Your task: # 1. Create a predict_car_price function with guessed weights # 2. Calculate total squared error # 3. Try different weights and find better ones # 4. What patterns do you notice? (age and mileage should be negative!) ``` **Key insight**: Unlike houses where more is usually better, for cars: * **Older** cars are worth **less** (negative weight for age) * **Higher mileage** is worth **less** (negative weight for mileage) * **More horsepower** is worth **more** (positive weight) Try something like: ```python theme={null} price = 50000 - (age * 3000) - (mileage * 200) + (horsepower * 100) ``` *** ## Next Up In the next module, we'll learn: * How to **systematically** find the best weights (not just random guessing) * The key insight of **gradient descent** - following the slope downhill * How this connects to [calculus](/courses/math-for-ml-calculus/01-derivatives) Discover gradient descent - the algorithm that powers all modern ML *** ## 🔗 Math → ML Connection **What you learned in this module connects to formal ML:** | Concept in This Module | Formal ML Term | Where It's Used | | ------------------------------------------ | -------------------- | --------------------------------------------- | | Guessing weights | **Model parameters** | Every ML model has parameters to learn | | Formula: `price = base + weight × feature` | **Linear model** | Neural network layers, linear regression | | Measuring "wrongness" | **Loss function** | Training any model (MSE, cross-entropy, etc.) | | Finding better weights | **Optimization** | Gradient descent, Adam, SGD | | Past data with answers | **Training data** | Supervised learning | **Next module**: We'll replace "random guessing" with a systematic approach called **gradient descent** - the same algorithm that trains ChatGPT! *** ## 🚀 Going Deeper (Optional) **For learners who want the formal treatment:** ### Matrix Formulation What we wrote as: ``` price = base + w1×bedrooms + w2×bathrooms + w3×sqft ``` Can be written in matrix form as: $\hat{y} = X \mathbf{w}$ Where: * $X$ is the **feature matrix** (each row is a house, each column is a feature) * $\mathbf{w}$ is the **weight vector** * $\hat{y}$ is the **prediction vector** ### Why Squared Error? We use squared error (not absolute error) because: 1. It's **differentiable** - we can compute gradients (needed for Module 2) 2. It **penalizes large errors more** - a $100K error is worse than two $50K errors 3. It leads to **closed-form solutions** in linear regression ### Closed-Form Solution For linear regression, there's actually a formula that gives optimal weights directly: $\mathbf{w}^* = (X^T X)^{-1} X^T y$ We'll derive this in [Linear Regression module](/courses/ml-mastery/03-linear-regression). ### Recommended Resources * [3Blue1Brown: Essence of Linear Algebra](https://www.3blue1brown.com/topics/linear-algebra) - Visual intuition * [Our Linear Algebra Course](/courses/math-for-ml-linear-algebra/02-vectors) - Full treatment