> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Structures

> Lists, Dictionaries, Sets, and Tuples

<img src="https://mintcdn.com/devweeekends/X0Fp4X8lMl-ZftoO/images/courses/python-crash-course/python-data-structures.svg?fit=max&auto=format&n=X0Fp4X8lMl-ZftoO&q=85&s=dfbada9241227b804f83276f02a418fe" alt="Python Data Structures" width="1080" height="1080" data-path="images/courses/python-crash-course/python-data-structures.svg" />

# Data Structures

Python's built-in data structures are incredibly powerful. Unlike C++ or Java where you might need to import special libraries for hash maps or dynamic arrays, Python has them built into the core language syntax.

Choosing the right data structure is like choosing the right container in a kitchen. A spice rack (dictionary) gives you instant lookup by name. A queue at a deli counter (list/deque) maintains order. A cookie cutter set (set) guarantees no duplicates. Using the wrong container does not ruin the meal -- but it makes cooking slower and messier than it needs to be.

***

## 1. Lists (Dynamic Arrays)

A `list` is an ordered, mutable sequence of items. It can hold mixed types (integers, strings, other lists). Under the hood, a Python list is a dynamic array -- a contiguous block of pointers to objects. This means index access is O(1), but inserting at the beginning is O(n) because every pointer must shift.

```python theme={null}
fruits = ["apple", "banana", "cherry"]

# Access -- O(1), instant lookup by index
print(fruits[0])   # apple
print(fruits[-1])  # cherry (Negative indexing: -1 is last, -2 is second-to-last)

# Modify
fruits.append("date")      # O(1) amortized -- add to end
fruits.insert(1, "berry")  # O(n) -- shifts everything after index 1
fruits.pop()               # O(1) -- remove and return last item
fruits.pop(0)              # O(n) -- remove from front shifts everything

# If you need fast insert/remove from both ends, use collections.deque instead
```

<Info>
  **Performance Pitfall**: `list.insert(0, x)` and `list.pop(0)` are O(n) operations. If you are building a queue (FIFO), use `collections.deque` which gives O(1) append and popleft. This is a common interview question and a real production performance trap.

  ```python theme={null}
  from collections import deque
  queue = deque()
  queue.append("first")    # Add to right -- O(1)
  queue.appendleft("zero") # Add to left -- O(1)
  queue.popleft()           # Remove from left -- O(1)
  ```
</Info>

### Slicing `[start:end:step]`

Slicing allows you to extract sub-parts of a list. It creates a **new** list.

```python theme={null}
nums = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

print(nums[0:5])   # [0, 1, 2, 3, 4] (Start is inclusive, End is exclusive)
print(nums[:5])    # [0, 1, 2, 3, 4] (Start defaults to 0)
print(nums[5:])    # [5, 6, 7, 8, 9] (End defaults to length)
print(nums[::2])   # [0, 2, 4, 6, 8] (Step 2: every second item)
print(nums[::-1])  # [9, 8, ..., 0]  (Reverse the list!)
```

### List Comprehensions

This is the "Pythonic" way to create lists based on existing iterables. It is not just syntactic sugar -- comprehensions are actually faster than the equivalent for-loop with `.append()` because CPython optimizes the bytecode for this pattern.

```python theme={null}
# The Old Way (explicit loop with append)
squares = []
for x in range(10):
    squares.append(x**2)

# The Pythonic Way -- list comprehension
# Pattern: [expression for item in iterable]
squares = [x**2 for x in range(10)]

# With a condition (filter items)
evens = [x for x in range(10) if x % 2 == 0]

# Nested comprehension (flatten a matrix)
matrix = [[1, 2], [3, 4], [5, 6]]
flat = [num for row in matrix for num in row]  # [1, 2, 3, 4, 5, 6]
# Read it as: "for each row in matrix, for each num in row, give me num"

# With if-else (note: goes BEFORE the for, not after)
labels = ["even" if x % 2 == 0 else "odd" for x in range(5)]
# ["even", "odd", "even", "odd", "even"]
```

<Warning>
  **Comprehension Pitfall**: Do not nest more than two levels deep. A triple-nested comprehension is technically valid but nearly unreadable. If you need three loops, use a regular `for` loop -- readability counts.

  ```python theme={null}
  # Bad -- unreadable triple nesting
  result = [f(x, y, z) for x in xs for y in ys for z in zs if condition(x, y, z)]

  # Good -- explicit loops with clear names
  result = []
  for x in xs:
      for y in ys:
          for z in zs:
              if condition(x, y, z):
                  result.append(f(x, y, z))
  ```
</Warning>

***

## 2. Dictionaries (Hash Maps)

A `dict` stores **Key-Value pairs**. Keys must be **hashable** (immutable types: strings, numbers, tuples of immutables) and unique. Lookups are O(1) average case because dictionaries are implemented as hash tables.

Since Python 3.7, dictionaries **preserve insertion order** as a language guarantee (it was an implementation detail in 3.6). This means you can rely on iteration order matching the order you inserted keys.

```python theme={null}
user = {
    "name": "Alice",
    "age": 30,
    "is_admin": True
}

# Access -- O(1) average case
print(user["name"])
# print(user["email"])  # KeyError! Key does not exist.

# Safe Access -- the Pythonic way to handle missing keys
print(user.get("email"))          # Returns None (no crash)
print(user.get("email", "N/A"))   # Returns "N/A" (custom default)

# Modify
user["email"] = "alice@example.com"  # Add or overwrite a key
del user["is_admin"]                  # Remove a key (KeyError if missing)
user.pop("is_admin", None)            # Safer remove -- returns None if missing

# Iterate -- .items() gives (key, value) pairs
for key, value in user.items():
    print(f"{key}: {value}")
```

### Merging Dictionaries

```python theme={null}
defaults = {"color": "blue", "size": "medium"}
overrides = {"size": "large", "weight": "heavy"}

# Python 3.9+ -- the merge operator (cleanest)
merged = defaults | overrides  # {"color": "blue", "size": "large", "weight": "heavy"}

# Python 3.5+ -- unpacking (works everywhere)
merged = {**defaults, **overrides}

# In-place update (mutates defaults)
defaults.update(overrides)
```

### The `setdefault` and Walrus Operator Patterns

```python theme={null}
# setdefault: get a value, or insert a default if missing (one atomic operation)
# Useful for building up groups without checking "is key present?"
groups = {}
for name, dept in [("Alice", "Eng"), ("Bob", "Eng"), ("Carol", "Sales")]:
    groups.setdefault(dept, []).append(name)
# {"Eng": ["Alice", "Bob"], "Sales": ["Carol"]}

# Walrus operator (:=) with dict -- Python 3.8+
# Assign and test in one expression
if (email := user.get("email")) is not None:
    send_notification(email)
```

### Dict Comprehensions

Just like lists, you can build dictionaries dynamically.

```python theme={null}
# Map number to its square
squares = {x: x**2 for x in range(5)}
# Result: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
```

***

## 3. Sets (Unique Collection)

A `set` is an unordered collection of **unique** elements. It is implemented as a hash table without values -- essentially a dictionary where you only care about the keys.

**Use Cases**: Removing duplicates, fast membership testing (O(1) vs O(n) for lists), and mathematical set operations.

```python theme={null}
numbers = {1, 2, 2, 3, 3, 3}
print(numbers)  # {1, 2, 3} (Duplicates removed automatically)

a = {1, 2, 3}
b = {3, 4, 5}

print(a | b)  # Union: {1, 2, 3, 4, 5} (Items in either)
print(a & b)  # Intersection: {3} (Items in both)
print(a - b)  # Difference: {1, 2} (Items in a but not b)
print(a ^ b)  # Symmetric difference: {1, 2, 4, 5} (Items in one but not both)
```

### When to Reach for a Set

```python theme={null}
# Membership testing: set is O(1), list is O(n)
# This matters when checking against large collections in a loop
valid_ids = {1001, 1002, 1003, 1004, 1005}  # Use a set, not a list
if user_id in valid_ids:                      # O(1) lookup
    grant_access()

# Deduplication while preserving order (Python 3.7+)
items = [3, 1, 4, 1, 5, 9, 2, 6, 5]
unique_ordered = list(dict.fromkeys(items))  # [3, 1, 4, 5, 9, 2, 6]
# dict.fromkeys preserves insertion order and drops duplicates (keys are unique)
```

<Warning>
  **Set Pitfall**: You cannot create an empty set with `{}` -- that creates an empty **dictionary**. Use `set()` instead.

  ```python theme={null}
  empty_dict = {}      # This is a dict, not a set!
  empty_set = set()    # This is the correct way to create an empty set
  type({})             # <class 'dict'>
  type(set())          # <class 'set'>
  ```
</Warning>

***

## 4. Tuples (Immutable Lists)

A `tuple` is like a list, but it cannot be changed after creation (immutable). Think of a tuple as a sealed record -- once created, the structure is frozen. This makes tuples ideal for representing fixed collections of related values, like coordinates, database rows, or function return values.

**Why use tuples?**

1. **Safety**: Guarantees the container has not been modified. Useful for function returns and configuration.
2. **Performance**: Tuples are smaller in memory and slightly faster to create than lists because Python can optimize immutable objects.
3. **Hashable**: Can be used as dictionary keys or set elements (lists cannot, because mutable objects cannot be hashed).

```python theme={null}
point = (10, 20)
# point[0] = 30  # TypeError! Tuples are immutable.

# Unpacking -- assign tuple elements to individual variables
x, y = point
print(x, y)  # 10 20

# Unpacking with * (star) to capture the rest
first, *rest = (1, 2, 3, 4, 5)
print(first)  # 1
print(rest)   # [2, 3, 4, 5] -- note: rest is a list, not a tuple

# Named Tuples -- give your tuples readable field names
from collections import namedtuple

Point = namedtuple("Point", ["x", "y"])
p = Point(10, 20)
print(p.x, p.y)  # 10, 20 -- much clearer than p[0], p[1]

# Modern alternative: typing.NamedTuple (Python 3.6+)
from typing import NamedTuple

class Point(NamedTuple):
    x: float
    y: float
    z: float = 0.0  # Default values supported

p = Point(1.0, 2.0)
print(p)  # Point(x=1.0, y=2.0, z=0.0)
```

<Warning>
  **Tuple Gotcha**: A tuple containing mutable objects is still "immutable" in the sense that you cannot reassign its slots -- but you can mutate the objects inside. The container is frozen; the contents may not be.

  ```python theme={null}
  t = ([1, 2], [3, 4])
  # t[0] = [5, 6]     # TypeError -- cannot reassign the slot
  t[0].append(99)      # This works! The list inside is still mutable.
  print(t)             # ([1, 2, 99], [3, 4])
  ```
</Warning>

***

## 5. The `collections` Module

The standard library `collections` module provides specialized data structures that go beyond the basics.

### `Counter`

A dictionary subclass for counting hashable objects.

```python theme={null}
from collections import Counter

colors = ["red", "blue", "red", "green", "blue", "blue"]
c = Counter(colors)

print(c) # Counter({'blue': 3, 'red': 2, 'green': 1})
print(c.most_common(1)) # [('blue', 3)]
```

### `defaultdict`

A dictionary that provides a default value for missing keys. This avoids `KeyError` and the pattern of checking `if key in dict` before every access. The argument is a **factory function** (a callable that returns the default value) -- not a value itself.

```python theme={null}
from collections import defaultdict

# Default value is list() -> [] (called each time a missing key is accessed)
groups = defaultdict(list)

groups["A"].append("Alice")  # Key "A" did not exist -- created [], then appended
groups["B"].append("Bob")

# Common patterns:
word_count = defaultdict(int)     # Missing keys default to 0
nested = defaultdict(dict)         # Missing keys default to {}

# Counting pattern -- cleaner than manual if/else
for word in ["apple", "banana", "apple", "cherry", "banana", "apple"]:
    word_count[word] += 1
# defaultdict(<class 'int'>, {'apple': 3, 'banana': 2, 'cherry': 1})
```

<Warning>
  **defaultdict Pitfall**: Accessing a missing key **creates** that key with the default value as a side effect. This can cause subtle bugs when you are just trying to check if a key exists.

  ```python theme={null}
  d = defaultdict(list)
  if d["missing"]:          # This check CREATES the key "missing" with value []
      print("found")
  print(dict(d))            # {"missing": []} -- the key now exists!

  # Use "in" to check without side effects:
  if "missing" in d:
      print("found")
  ```
</Warning>

***

## Choosing the Right Data Structure

| Need                                 | Use                       | Why                                    |
| ------------------------------------ | ------------------------- | -------------------------------------- |
| Ordered collection, frequent appends | `list`                    | O(1) append, O(1) index access         |
| Fast lookup by key                   | `dict`                    | O(1) average lookup by key             |
| Unique elements, membership testing  | `set`                     | O(1) membership test, auto-dedup       |
| Fixed record, dictionary key         | `tuple`                   | Immutable, hashable, memory-efficient  |
| Queue (FIFO)                         | `collections.deque`       | O(1) append and popleft                |
| Count occurrences                    | `collections.Counter`     | Built-in counting with `most_common()` |
| Group items by key                   | `collections.defaultdict` | Auto-creates missing keys              |

## Summary

* **Lists**: Ordered, mutable. Use for sequences. Watch out for O(n) inserts at the front.
* **Dicts**: Key-Value with O(1) lookup. Insertion-ordered since Python 3.7. Use `.get()` for safe access.
* **Sets**: Unique elements with O(1) membership testing. Cannot use `{}` for an empty set.
* **Tuples**: Immutable. Use for fixed records and dictionary keys. Consider `NamedTuple` for readability.
* **Comprehensions**: Faster than manual loops. Keep them to one or two levels of nesting.
* **Shallow vs Deep**: `.copy()` and slicing create shallow copies. Use `copy.deepcopy()` for nested structures.

Next, we'll organize our code using **Object-Oriented Programming**.

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="How are Python dictionaries implemented under the hood, and what changed in Python 3.6+ that made them ordered?">
    **Strong Answer:**

    * Python dictionaries are implemented as hash tables. When you insert a key, Python computes the key's hash (via `__hash__`), uses that hash to determine a slot index in an internal array, and stores both the key and value there. Lookups work the same way: hash the key, jump to the slot, compare for equality (via `__eq__`). This gives O(1) average-case lookups, insertions, and deletions.
    * Hash collisions are inevitable -- two different keys can produce the same slot index. CPython uses open addressing with probing (not chaining with linked lists like Java's HashMap). When a collision occurs, it probes to the next available slot using a perturbation scheme that mixes in higher bits of the hash to distribute entries more evenly.
    * Before Python 3.6, dictionaries did not preserve insertion order. The internal layout was a single sparse array where most slots were empty (hash tables need \~33-66% empty space to maintain O(1) performance). This wasted memory because each empty slot still reserved space for a hash, key pointer, and value pointer.
    * In Python 3.6, the implementation was redesigned into a compact dict. It uses two arrays: a dense array that stores entries in insertion order (hash, key, value), and a sparse indices array that maps hash slots to positions in the dense array. This reduced memory usage by 20-25% and made insertion order preservation a side effect of the implementation. In Python 3.7, this ordering guarantee was elevated from an implementation detail to a language specification.
    * The practical implication is that modern Python dicts are simultaneously a hash map and an ordered sequence, which eliminated most use cases for `collections.OrderedDict`. However, `OrderedDict` still has a niche: it supports `move_to_end()`, its equality comparison considers order (two OrderedDicts with the same keys in different order are not equal, whereas regular dicts with the same key-value pairs are always equal regardless of insertion order), and it is a subclass you can use for type-based dispatch.

    **Follow-up: What makes a good dictionary key, and what happens if you use a mutable object as a key?**

    * A dictionary key must be hashable, which means it must implement `__hash__` and `__eq__`, and the hash must remain constant for the object's lifetime. All immutable built-in types (str, int, float, tuple of immutables, frozenset) are hashable. Mutable types (list, dict, set) are not hashable because their contents can change, which would change their hash after insertion, making the key unfindable in the table.
    * If you define a custom class, instances are hashable by default -- they inherit `__hash__` from `object`, which returns a value derived from `id()` (memory address). But if you define `__eq__`, Python automatically sets `__hash__ = None` (making instances unhashable) because the contract requires that objects which compare equal must have the same hash. You must explicitly define a consistent `__hash__` if you define `__eq__` and want your objects to be usable as dict keys.
    * A subtle production gotcha: using floats as dict keys. `hash(1) == hash(1.0)` and `1 == 1.0`, so `d = {1: "int"}; d[1.0] = "float"` overwrites the entry. This can cause extremely confusing bugs when mixing int and float keys.
  </Accordion>

  <Accordion title="When would you use a list versus a tuple versus a set in production code? Walk me through the performance and semantic differences.">
    **Strong Answer:**

    * The choice is driven by two axes: mutability (do you need to modify it?) and semantics (what does the data represent?).
    * Lists are for ordered, mutable sequences where you might append, remove, or reorder elements. They are backed by a dynamic array (contiguous memory with amortized O(1) append, O(n) insert/delete at arbitrary positions). Use them when the data is a collection of similar things: a list of user IDs, a queue of tasks, search results.
    * Tuples are for ordered, immutable records. They use less memory than lists (no over-allocation for potential growth) and are hashable (so they can be dict keys or set members). Use them when the data is a fixed structure: a coordinate `(x, y)`, a database row, or a function returning multiple values. The immutability is a semantic signal: "this will not change."
    * Sets are for unordered collections of unique elements backed by a hash table. They provide O(1) membership testing (`x in s`), which is dramatically faster than `x in list` (O(n)). Use them for deduplication, membership checks, and mathematical set operations (union, intersection, difference).
    * Performance comparison for membership testing: checking `x in my_list` scans every element -- 1 million elements means up to 1 million comparisons. `x in my_set` does a hash lookup -- constant time regardless of size. I have seen production code with response times drop from 800ms to 2ms by converting a list to a set for a membership check inside a loop. This is one of the most common performance wins in Python code reviews.
    * A nuance: tuples are not just "immutable lists." They have different optimization paths in CPython. Small tuples (length 1-20) are cached in a free list, so creating them is faster than creating lists. Tuple packing/unpacking is also used internally by Python for multiple assignment, function returns, and the iteration protocol, making it a first-class citizen in the bytecode.

    **Follow-up: What is the time complexity of `list.sort()` versus `sorted()`, and when would you pick one over the other?**

    * Both use Timsort, a hybrid merge-sort/insertion-sort algorithm with O(n log n) worst case and O(n) best case (when the data is already partially sorted). Timsort is specifically designed to exploit existing order in real-world data, which is why it was adopted by Java and Android as well.
    * `list.sort()` sorts in place and returns `None`. It is more memory-efficient because it does not allocate a new list. `sorted()` returns a new sorted list, leaving the original unchanged. Use `sort()` when you own the list and do not need the original order. Use `sorted()` when you need to preserve the original or when you are sorting a non-list iterable (generator, set, dict keys).
    * A subtle point: `list.sort()` is slightly faster because it can use an optimized path that temporarily decouples the list from the interpreter (the list is marked as "being sorted" and raises `ValueError` if you try to mutate it during sort). `sorted()` must first materialize the iterable into a list and then sort that.
  </Accordion>

  <Accordion title="Explain list comprehensions versus generator expressions. When does the memory difference actually matter, and when is it negligible?">
    **Strong Answer:**

    * A list comprehension (`[x**2 for x in range(n)]`) eagerly evaluates every element and stores them all in memory as a list. A generator expression (`(x**2 for x in range(n))`) produces elements one at a time, on demand. The generator holds only the current state (roughly the loop variable and the expression to evaluate), not the entire result set.
    * For `range(100)`, the difference is negligible -- 100 integers is roughly 800 bytes. But for `range(100_000_000)`, the list comprehension allocates about 800MB of memory, while the generator expression uses a constant \~100 bytes regardless of how many elements it will produce. This is the difference between your process running normally and getting OOM-killed in production.
    * The memory difference matters when three conditions are met: (1) the input is large, (2) you only need to iterate once, and (3) you do not need random access or the `len()`. Processing lines from a large log file, streaming rows from a database cursor, or piping data through a transformation chain are all generator territory.
    * However, generators have overhead per element (they must suspend and resume the frame on each `yield`/`next()` call), so for small collections, list comprehensions are actually faster. If you are iterating over the result multiple times, a list is also better -- generators are exhausted after one pass.
    * A common pattern in production data pipelines is chaining generators: `filtered = (line for line in lines if "ERROR" in line); parsed = (json.loads(line) for line in filtered); timestamps = (entry["ts"] for entry in parsed)`. Each step adds near-zero memory overhead, and no intermediate lists are created. The entire pipeline processes one line at a time, which means you can process a 50GB log file in constant memory.

    **Follow-up: What is the `yield from` syntax, and how does it differ from iterating and yielding manually?**

    * `yield from iterable` delegates iteration to a sub-generator or any iterable. Instead of writing `for item in iterable: yield item`, you write `yield from iterable`. It is not just syntactic sugar -- it properly propagates `.send()`, `.throw()`, and `.close()` calls to the sub-generator, which is essential for coroutine-based code.
    * Without `yield from`, if you have a recursive generator (like flattening a nested structure), you lose the ability to send values into or throw exceptions into the inner generator. `yield from` creates a transparent bidirectional channel between the caller and the sub-generator.
    * Practically, `yield from` is also slightly faster than a manual loop because it avoids the overhead of repeatedly calling `__next__` on the sub-iterator from Python code -- the delegation happens at the C level in CPython. For deeply nested generators (tree traversal, recursive flattening), this performance difference compounds.
  </Accordion>

  <Accordion title="You have a function that receives a list of 10 million user records and needs to find all users in a specific city. The current implementation is slow. How do you optimize it?">
    **Strong Answer:**

    * The first question is: what does "slow" mean and where is the bottleneck? If the current code is `[u for u in users if u["city"] == target_city]`, that is O(n) -- a single pass over 10 million records. That is about 1-2 seconds in pure Python, which may or may not be acceptable depending on the context.
    * If this query runs frequently with different cities, the answer is to build an index. A `defaultdict(list)` keyed by city, built once: `by_city = defaultdict(list); for u in users: by_city[u["city"]].append(u)`. Now lookups are O(1) for the hash lookup plus O(k) to retrieve k matching records. This is essentially what a database index does.
    * If the data does not fit in memory, this is no longer a data-structure problem -- it is an architecture problem. Move the data to a database (PostgreSQL, SQLite for local use) and let it handle indexing, query optimization, and disk-based storage. Python's `sqlite3` module is built in and handles millions of rows efficiently.
    * If the data is in memory and you need maximum speed, consider Pandas or Polars. `df[df["city"] == target_city]` in Pandas uses vectorized C operations and is 10-50x faster than a Python loop. Polars is even faster because it uses Rust and supports lazy evaluation with query optimization.
    * The meta-lesson: the answer to "how do I make this Python code faster" is almost never "rewrite the loop to be cleverer." It is usually "use the right data structure" (dict/set for lookups), "use a vectorized library" (NumPy/Pandas/Polars), or "use a database." Optimizing the algorithm and data access pattern gives orders-of-magnitude improvements. Micro-optimizing Python syntax gives single-digit percentage improvements.

    **Follow-up: What is the difference between `collections.defaultdict` and using `dict.setdefault()`, and when would you choose one over the other?**

    * `defaultdict(list)` creates the default value automatically on any missing key access, including reads via `d[key]`. `dict.setdefault(key, [])` only creates the default when you explicitly call `setdefault`. The behavioral difference is that `defaultdict` modifies itself on read -- accessing `d[missing_key]` adds that key to the dict, which can be surprising.
    * `setdefault` is better for one-off cases where you want to populate a default and immediately use it: `d.setdefault(key, []).append(value)`. `defaultdict` is better when you are building up a data structure in a loop and want clean syntax throughout.
    * A gotcha with `defaultdict`: if you pass it to code that checks `key in d` after accidentally accessing `d[key]`, the key will now exist (with the default value). This can mask bugs. Converting a `defaultdict` to a regular `dict` before passing it to external code is a defensive practice.
  </Accordion>
</AccordionGroup>