Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Data Structures
Python’s built-in data structures are incredibly powerful. Unlike C++ or Java where you might need to import special libraries for hash maps or dynamic arrays, Python has them built into the core language syntax. Choosing the right data structure is like choosing the right container in a kitchen. A spice rack (dictionary) gives you instant lookup by name. A queue at a deli counter (list/deque) maintains order. A cookie cutter set (set) guarantees no duplicates. Using the wrong container does not ruin the meal — but it makes cooking slower and messier than it needs to be.1. Lists (Dynamic Arrays)
Alist is an ordered, mutable sequence of items. It can hold mixed types (integers, strings, other lists). Under the hood, a Python list is a dynamic array — a contiguous block of pointers to objects. This means index access is O(1), but inserting at the beginning is O(n) because every pointer must shift.
Performance Pitfall:
list.insert(0, x) and list.pop(0) are O(n) operations. If you are building a queue (FIFO), use collections.deque which gives O(1) append and popleft. This is a common interview question and a real production performance trap.Slicing [start:end:step]
Slicing allows you to extract sub-parts of a list. It creates a new list.
List Comprehensions
This is the “Pythonic” way to create lists based on existing iterables. It is not just syntactic sugar — comprehensions are actually faster than the equivalent for-loop with.append() because CPython optimizes the bytecode for this pattern.
2. Dictionaries (Hash Maps)
Adict stores Key-Value pairs. Keys must be hashable (immutable types: strings, numbers, tuples of immutables) and unique. Lookups are O(1) average case because dictionaries are implemented as hash tables.
Since Python 3.7, dictionaries preserve insertion order as a language guarantee (it was an implementation detail in 3.6). This means you can rely on iteration order matching the order you inserted keys.
Merging Dictionaries
The setdefault and Walrus Operator Patterns
Dict Comprehensions
Just like lists, you can build dictionaries dynamically.3. Sets (Unique Collection)
Aset is an unordered collection of unique elements. It is implemented as a hash table without values — essentially a dictionary where you only care about the keys.
Use Cases: Removing duplicates, fast membership testing (O(1) vs O(n) for lists), and mathematical set operations.
When to Reach for a Set
4. Tuples (Immutable Lists)
Atuple is like a list, but it cannot be changed after creation (immutable). Think of a tuple as a sealed record — once created, the structure is frozen. This makes tuples ideal for representing fixed collections of related values, like coordinates, database rows, or function return values.
Why use tuples?
- Safety: Guarantees the container has not been modified. Useful for function returns and configuration.
- Performance: Tuples are smaller in memory and slightly faster to create than lists because Python can optimize immutable objects.
- Hashable: Can be used as dictionary keys or set elements (lists cannot, because mutable objects cannot be hashed).
5. The collections Module
The standard library collections module provides specialized data structures that go beyond the basics.
Counter
A dictionary subclass for counting hashable objects.
defaultdict
A dictionary that provides a default value for missing keys. This avoids KeyError and the pattern of checking if key in dict before every access. The argument is a factory function (a callable that returns the default value) — not a value itself.
Choosing the Right Data Structure
| Need | Use | Why |
|---|---|---|
| Ordered collection, frequent appends | list | O(1) append, O(1) index access |
| Fast lookup by key | dict | O(1) average lookup by key |
| Unique elements, membership testing | set | O(1) membership test, auto-dedup |
| Fixed record, dictionary key | tuple | Immutable, hashable, memory-efficient |
| Queue (FIFO) | collections.deque | O(1) append and popleft |
| Count occurrences | collections.Counter | Built-in counting with most_common() |
| Group items by key | collections.defaultdict | Auto-creates missing keys |
Summary
- Lists: Ordered, mutable. Use for sequences. Watch out for O(n) inserts at the front.
- Dicts: Key-Value with O(1) lookup. Insertion-ordered since Python 3.7. Use
.get()for safe access. - Sets: Unique elements with O(1) membership testing. Cannot use
{}for an empty set. - Tuples: Immutable. Use for fixed records and dictionary keys. Consider
NamedTuplefor readability. - Comprehensions: Faster than manual loops. Keep them to one or two levels of nesting.
- Shallow vs Deep:
.copy()and slicing create shallow copies. Usecopy.deepcopy()for nested structures.
Interview Deep-Dive
How are Python dictionaries implemented under the hood, and what changed in Python 3.6+ that made them ordered?
How are Python dictionaries implemented under the hood, and what changed in Python 3.6+ that made them ordered?
Strong Answer:
- Python dictionaries are implemented as hash tables. When you insert a key, Python computes the key’s hash (via
__hash__), uses that hash to determine a slot index in an internal array, and stores both the key and value there. Lookups work the same way: hash the key, jump to the slot, compare for equality (via__eq__). This gives O(1) average-case lookups, insertions, and deletions. - Hash collisions are inevitable — two different keys can produce the same slot index. CPython uses open addressing with probing (not chaining with linked lists like Java’s HashMap). When a collision occurs, it probes to the next available slot using a perturbation scheme that mixes in higher bits of the hash to distribute entries more evenly.
- Before Python 3.6, dictionaries did not preserve insertion order. The internal layout was a single sparse array where most slots were empty (hash tables need ~33-66% empty space to maintain O(1) performance). This wasted memory because each empty slot still reserved space for a hash, key pointer, and value pointer.
- In Python 3.6, the implementation was redesigned into a compact dict. It uses two arrays: a dense array that stores entries in insertion order (hash, key, value), and a sparse indices array that maps hash slots to positions in the dense array. This reduced memory usage by 20-25% and made insertion order preservation a side effect of the implementation. In Python 3.7, this ordering guarantee was elevated from an implementation detail to a language specification.
- The practical implication is that modern Python dicts are simultaneously a hash map and an ordered sequence, which eliminated most use cases for
collections.OrderedDict. However,OrderedDictstill has a niche: it supportsmove_to_end(), its equality comparison considers order (two OrderedDicts with the same keys in different order are not equal, whereas regular dicts with the same key-value pairs are always equal regardless of insertion order), and it is a subclass you can use for type-based dispatch.
- A dictionary key must be hashable, which means it must implement
__hash__and__eq__, and the hash must remain constant for the object’s lifetime. All immutable built-in types (str, int, float, tuple of immutables, frozenset) are hashable. Mutable types (list, dict, set) are not hashable because their contents can change, which would change their hash after insertion, making the key unfindable in the table. - If you define a custom class, instances are hashable by default — they inherit
__hash__fromobject, which returns a value derived fromid()(memory address). But if you define__eq__, Python automatically sets__hash__ = None(making instances unhashable) because the contract requires that objects which compare equal must have the same hash. You must explicitly define a consistent__hash__if you define__eq__and want your objects to be usable as dict keys. - A subtle production gotcha: using floats as dict keys.
hash(1) == hash(1.0)and1 == 1.0, sod = {1: "int"}; d[1.0] = "float"overwrites the entry. This can cause extremely confusing bugs when mixing int and float keys.
When would you use a list versus a tuple versus a set in production code? Walk me through the performance and semantic differences.
When would you use a list versus a tuple versus a set in production code? Walk me through the performance and semantic differences.
Strong Answer:
- The choice is driven by two axes: mutability (do you need to modify it?) and semantics (what does the data represent?).
- Lists are for ordered, mutable sequences where you might append, remove, or reorder elements. They are backed by a dynamic array (contiguous memory with amortized O(1) append, O(n) insert/delete at arbitrary positions). Use them when the data is a collection of similar things: a list of user IDs, a queue of tasks, search results.
- Tuples are for ordered, immutable records. They use less memory than lists (no over-allocation for potential growth) and are hashable (so they can be dict keys or set members). Use them when the data is a fixed structure: a coordinate
(x, y), a database row, or a function returning multiple values. The immutability is a semantic signal: “this will not change.” - Sets are for unordered collections of unique elements backed by a hash table. They provide O(1) membership testing (
x in s), which is dramatically faster thanx in list(O(n)). Use them for deduplication, membership checks, and mathematical set operations (union, intersection, difference). - Performance comparison for membership testing: checking
x in my_listscans every element — 1 million elements means up to 1 million comparisons.x in my_setdoes a hash lookup — constant time regardless of size. I have seen production code with response times drop from 800ms to 2ms by converting a list to a set for a membership check inside a loop. This is one of the most common performance wins in Python code reviews. - A nuance: tuples are not just “immutable lists.” They have different optimization paths in CPython. Small tuples (length 1-20) are cached in a free list, so creating them is faster than creating lists. Tuple packing/unpacking is also used internally by Python for multiple assignment, function returns, and the iteration protocol, making it a first-class citizen in the bytecode.
list.sort() versus sorted(), and when would you pick one over the other?- Both use Timsort, a hybrid merge-sort/insertion-sort algorithm with O(n log n) worst case and O(n) best case (when the data is already partially sorted). Timsort is specifically designed to exploit existing order in real-world data, which is why it was adopted by Java and Android as well.
list.sort()sorts in place and returnsNone. It is more memory-efficient because it does not allocate a new list.sorted()returns a new sorted list, leaving the original unchanged. Usesort()when you own the list and do not need the original order. Usesorted()when you need to preserve the original or when you are sorting a non-list iterable (generator, set, dict keys).- A subtle point:
list.sort()is slightly faster because it can use an optimized path that temporarily decouples the list from the interpreter (the list is marked as “being sorted” and raisesValueErrorif you try to mutate it during sort).sorted()must first materialize the iterable into a list and then sort that.
Explain list comprehensions versus generator expressions. When does the memory difference actually matter, and when is it negligible?
Explain list comprehensions versus generator expressions. When does the memory difference actually matter, and when is it negligible?
Strong Answer:
- A list comprehension (
[x**2 for x in range(n)]) eagerly evaluates every element and stores them all in memory as a list. A generator expression ((x**2 for x in range(n))) produces elements one at a time, on demand. The generator holds only the current state (roughly the loop variable and the expression to evaluate), not the entire result set. - For
range(100), the difference is negligible — 100 integers is roughly 800 bytes. But forrange(100_000_000), the list comprehension allocates about 800MB of memory, while the generator expression uses a constant ~100 bytes regardless of how many elements it will produce. This is the difference between your process running normally and getting OOM-killed in production. - The memory difference matters when three conditions are met: (1) the input is large, (2) you only need to iterate once, and (3) you do not need random access or the
len(). Processing lines from a large log file, streaming rows from a database cursor, or piping data through a transformation chain are all generator territory. - However, generators have overhead per element (they must suspend and resume the frame on each
yield/next()call), so for small collections, list comprehensions are actually faster. If you are iterating over the result multiple times, a list is also better — generators are exhausted after one pass. - A common pattern in production data pipelines is chaining generators:
filtered = (line for line in lines if "ERROR" in line); parsed = (json.loads(line) for line in filtered); timestamps = (entry["ts"] for entry in parsed). Each step adds near-zero memory overhead, and no intermediate lists are created. The entire pipeline processes one line at a time, which means you can process a 50GB log file in constant memory.
yield from syntax, and how does it differ from iterating and yielding manually?yield from iterabledelegates iteration to a sub-generator or any iterable. Instead of writingfor item in iterable: yield item, you writeyield from iterable. It is not just syntactic sugar — it properly propagates.send(),.throw(), and.close()calls to the sub-generator, which is essential for coroutine-based code.- Without
yield from, if you have a recursive generator (like flattening a nested structure), you lose the ability to send values into or throw exceptions into the inner generator.yield fromcreates a transparent bidirectional channel between the caller and the sub-generator. - Practically,
yield fromis also slightly faster than a manual loop because it avoids the overhead of repeatedly calling__next__on the sub-iterator from Python code — the delegation happens at the C level in CPython. For deeply nested generators (tree traversal, recursive flattening), this performance difference compounds.
You have a function that receives a list of 10 million user records and needs to find all users in a specific city. The current implementation is slow. How do you optimize it?
You have a function that receives a list of 10 million user records and needs to find all users in a specific city. The current implementation is slow. How do you optimize it?
Strong Answer:
- The first question is: what does “slow” mean and where is the bottleneck? If the current code is
[u for u in users if u["city"] == target_city], that is O(n) — a single pass over 10 million records. That is about 1-2 seconds in pure Python, which may or may not be acceptable depending on the context. - If this query runs frequently with different cities, the answer is to build an index. A
defaultdict(list)keyed by city, built once:by_city = defaultdict(list); for u in users: by_city[u["city"]].append(u). Now lookups are O(1) for the hash lookup plus O(k) to retrieve k matching records. This is essentially what a database index does. - If the data does not fit in memory, this is no longer a data-structure problem — it is an architecture problem. Move the data to a database (PostgreSQL, SQLite for local use) and let it handle indexing, query optimization, and disk-based storage. Python’s
sqlite3module is built in and handles millions of rows efficiently. - If the data is in memory and you need maximum speed, consider Pandas or Polars.
df[df["city"] == target_city]in Pandas uses vectorized C operations and is 10-50x faster than a Python loop. Polars is even faster because it uses Rust and supports lazy evaluation with query optimization. - The meta-lesson: the answer to “how do I make this Python code faster” is almost never “rewrite the loop to be cleverer.” It is usually “use the right data structure” (dict/set for lookups), “use a vectorized library” (NumPy/Pandas/Polars), or “use a database.” Optimizing the algorithm and data access pattern gives orders-of-magnitude improvements. Micro-optimizing Python syntax gives single-digit percentage improvements.
collections.defaultdict and using dict.setdefault(), and when would you choose one over the other?defaultdict(list)creates the default value automatically on any missing key access, including reads viad[key].dict.setdefault(key, [])only creates the default when you explicitly callsetdefault. The behavioral difference is thatdefaultdictmodifies itself on read — accessingd[missing_key]adds that key to the dict, which can be surprising.setdefaultis better for one-off cases where you want to populate a default and immediately use it:d.setdefault(key, []).append(value).defaultdictis better when you are building up a data structure in a loop and want clean syntax throughout.- A gotcha with
defaultdict: if you pass it to code that checkskey in dafter accidentally accessingd[key], the key will now exist (with the default value). This can mask bugs. Converting adefaultdictto a regulardictbefore passing it to external code is a defensive practice.