> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # String Algorithms > Master hashing, KMP, Z-function, and Trie for pattern matching # String Algorithms ## String Problems in CP String algorithms unlock pattern matching, palindrome detection, and text processing problems. The key is choosing the right tool. **Why specialized algorithms?** Naive string matching (check every position) is O(n\*m). For n = m = 10^6, that is 10^12 operations--far too slow. KMP, Z-function, and hashing each bring this down to O(n + m) by cleverly reusing work from previous comparisons. The intuition behind all of them: when a mismatch occurs, you have already matched some characters, and that partial match tells you where to look next without starting over. **Pattern Recognition Signals**: * "Find pattern in text" → KMP, Z-function, or Hashing * "Count occurrences of pattern" → KMP or Hashing * "Longest palindrome" → Manacher's algorithm * "Prefix queries on dictionary" → Trie * "Compare substrings" → Hashing with binary search *** ## Algorithm Selection Guide | Problem Type | Best Algorithm | Complexity | | ----------------------------------- | ---------------- | -------------- | | Single pattern search | KMP / Z-function | O(n + m) | | Multiple pattern search | Aho-Corasick | O(n + m + k) | | Substring comparison | Rolling Hash | O(1) per query | | Longest palindromic substring | Manacher | O(n) | | Prefix/suffix queries | Trie | O(length) | | Lexicographically smallest rotation | Duval / Booth | O(n) | *** ## String Hashing Convert strings to numbers for fast comparison. **The Problem**: Comparing two strings of length n takes O(n). For many comparisons, this is too slow. **The Insight**: Map each string to a number (hash). If hashes differ, strings differ. If hashes match, strings are *probably* equal. **How it works**: Treat the string as a number in base B: $H(s) = s[0] \cdot B^{n-1} + s[1] \cdot B^{n-2} + ... + s[n-1] \cdot B^0 \pmod{MOD}$ **Example**: "abc" with B=31: $H = 1 \cdot 31^2 + 2 \cdot 31^1 + 3 \cdot 31^0 = 961 + 62 + 3 = 1026$ **Substring Hash in O(1)**: Using prefix hashes: $H(s[l..r]) = H(s[0..r]) - H(s[0..l-1]) \cdot B^{r-l+1}$ **Collision Risk**: Two different strings can have the same hash. Use: * Large prime MOD (10^9 + 7) * **Double hashing**: Two different (BASE, MOD) pairs--collision probability becomes \~1/10^18 **Contest edge case**: On Codeforces, some problem setters specifically add anti-hash tests for single-hash solutions. If your hashing solution gets WA on a string problem, switching to double hashing (or using a different base/mod pair) often fixes it. Always prefer double hashing for problems with large test cases. ```cpp theme={null} class StringHash { public: const long long MOD = 1e9 + 7; const long long BASE = 31; int n; vector hash_, power; StringHash(const string& s) : n(s.size()), hash_(n + 1), power(n + 1) { power[0] = 1; for (int i = 0; i < n; i++) { hash_[i + 1] = (hash_[i] * BASE + (s[i] - 'a' + 1)) % MOD; power[i + 1] = (power[i] * BASE) % MOD; } } // Get hash of s[l..r] (0-indexed, inclusive) long long getHash(int l, int r) { long long h = (hash_[r + 1] - hash_[l] * power[r - l + 1] % MOD + MOD) % MOD; return h; } }; // Usage string s = "abcabc"; StringHash sh(s); cout << (sh.getHash(0, 2) == sh.getHash(3, 5)); // true: "abc" == "abc" ``` ### Double Hashing (Reduces Collision) ```cpp theme={null} class DoubleHash { public: const long long MOD1 = 1e9 + 7, MOD2 = 1e9 + 9; const long long BASE1 = 31, BASE2 = 37; int n; vector h1, h2, p1, p2; DoubleHash(const string& s) : n(s.size()), h1(n + 1), h2(n + 1), p1(n + 1), p2(n + 1) { p1[0] = p2[0] = 1; for (int i = 0; i < n; i++) { int c = s[i] - 'a' + 1; h1[i + 1] = (h1[i] * BASE1 + c) % MOD1; h2[i + 1] = (h2[i] * BASE2 + c) % MOD2; p1[i + 1] = p1[i] * BASE1 % MOD1; p2[i + 1] = p2[i] * BASE2 % MOD2; } } pair getHash(int l, int r) { long long hash1 = (h1[r + 1] - h1[l] * p1[r - l + 1] % MOD1 + MOD1) % MOD1; long long hash2 = (h2[r + 1] - h2[l] * p2[r - l + 1] % MOD2 + MOD2) % MOD2; return {hash1, hash2}; } }; ``` *** ## KMP Algorithm Find all occurrences of pattern in text in O(n + m). **The Problem**: Naive search is O(n·m)—for each position in text, compare entire pattern. **KMP's Insight**: When a mismatch occurs, we've already matched some characters. Use this information to skip ahead instead of starting over. ### Computing Failure Function (LPS Array) **LPS\[i]** = length of the **L**ongest **P**roper **P**refix of pattern\[0..i] which is also a **S**uffix. **Example**: pattern = "ABABAC" ``` i=0: "A" → No proper prefix → LPS[0] = 0 i=1: "AB" → No match → LPS[1] = 0 i=2: "ABA" → "A" = "A" → LPS[2] = 1 i=3: "ABAB" → "AB" = "AB" → LPS[3] = 2 i=4: "ABABA" → "ABA" = "ABA" → LPS[4] = 3 i=5: "ABABAC" → No suffix matches prefix ending with C → LPS[5] = 0 LPS = [0, 0, 1, 2, 3, 0] ``` **Why this helps**: When mismatch at position j, we know pattern\[0..j-1] matched. The LPS tells us the longest prefix we don't need to re-check. ```cpp theme={null} vector computeLPS(const string& pattern) { int m = pattern.size(); vector lps(m, 0); // Longest Proper Prefix which is also Suffix int len = 0; int i = 1; while (i < m) { if (pattern[i] == pattern[len]) { len++; lps[i] = len; i++; } else { if (len != 0) { len = lps[len - 1]; } else { lps[i] = 0; i++; } } } return lps; } ``` ### Pattern Matching ```cpp theme={null} vector kmpSearch(const string& text, const string& pattern) { int n = text.size(), m = pattern.size(); vector lps = computeLPS(pattern); vector matches; int i = 0, j = 0; while (i < n) { if (pattern[j] == text[i]) { i++; j++; } if (j == m) { matches.push_back(i - j); // Match found at index i-j j = lps[j - 1]; } else if (i < n && pattern[j] != text[i]) { if (j != 0) { j = lps[j - 1]; } else { i++; } } } return matches; } ``` *** ## Z-Function **z\[i]** = length of longest substring starting from i that matches a prefix of s. **Example**: s = "aabxaab" ``` i=0: (undefined, by convention z[0]=0 or n) i=1: s[1..] = "abxaab" vs s = "aabxaab" → "a" matches → z[1] = 1 i=2: s[2..] = "bxaab" vs s → No match → z[2] = 0 i=3: s[3..] = "xaab" vs s → No match → z[3] = 0 i=4: s[4..] = "aab" vs s → "aab" matches → z[4] = 3 i=5: s[5..] = "ab" vs s → "a" matches → z[5] = 1 i=6: s[6..] = "b" vs s → No match → z[6] = 0 Z = [0, 1, 0, 0, 3, 1, 0] ``` **The Optimization** (Z-box technique): Maintain \[l, r)—the rightmost segment that matches a prefix. If i \< r, we can reuse previously computed values! **Pattern Matching Trick**: To find pattern P in text T: 1. Create combined = P + "$" + T ($ is a separator not in P or T) 2. Compute Z-function 3. Wherever z\[i] == len(P), we found a match at position i - len(P) - 1 in T ```cpp theme={null} vector zFunction(const string& s) { int n = s.size(); vector z(n, 0); int l = 0, r = 0; for (int i = 1; i < n; i++) { if (i < r) { z[i] = min(r - i, z[i - l]); } while (i + z[i] < n && s[z[i]] == s[i + z[i]]) { z[i]++; } if (i + z[i] > r) { l = i; r = i + z[i]; } } return z; } // Pattern matching using Z-function vector findPattern(const string& text, const string& pattern) { string combined = pattern + "$" + text; vector z = zFunction(combined); vector matches; int m = pattern.size(); for (int i = m + 1; i < combined.size(); i++) { if (z[i] == m) { matches.push_back(i - m - 1); } } return matches; } ``` *** ## Trie (Prefix Tree) Efficient for prefix-based queries on a dictionary. **Structure**: A tree where: * Each edge is labeled with a character * Each node represents a prefix (path from root) * Words sharing a prefix share that path **Visual Example** with words \["cat", "car", "card", "dog"]: ``` (root) / \ c d | | a o / \ | t r g* * * | d* * = word ends here ``` **Operations**: * Insert/Search: O(length of word) * Count words with prefix: O(length of prefix) * Autocomplete: O(prefix length + number of suggestions) **When to use**: * Dictionary with prefix queries * XOR maximization (binary trie) * Counting distinct prefixes ```cpp theme={null} class Trie { public: struct Node { int children[26] = {}; int count = 0; // Words ending here int prefixCount = 0; // Words with this prefix }; vector nodes; Trie() { nodes.push_back(Node()); // Root } void insert(const string& s) { int cur = 0; for (char c : s) { int idx = c - 'a'; if (!nodes[cur].children[idx]) { nodes[cur].children[idx] = nodes.size(); nodes.push_back(Node()); } cur = nodes[cur].children[idx]; nodes[cur].prefixCount++; } nodes[cur].count++; } bool search(const string& s) { int cur = 0; for (char c : s) { int idx = c - 'a'; if (!nodes[cur].children[idx]) return false; cur = nodes[cur].children[idx]; } return nodes[cur].count > 0; } int countPrefix(const string& prefix) { int cur = 0; for (char c : prefix) { int idx = c - 'a'; if (!nodes[cur].children[idx]) return 0; cur = nodes[cur].children[idx]; } return nodes[cur].prefixCount; } }; ``` *** ## Pattern 1: Longest Palindromic Substring ### Using Hashing + Binary Search ```cpp theme={null} int longestPalindrome(const string& s) { int n = s.size(); string rev = s; reverse(rev.begin(), rev.end()); StringHash sh(s); StringHash rh(rev); auto isPalindrome = [&](int l, int r) { // Compare s[l..r] with reverse return sh.getHash(l, r) == rh.getHash(n - 1 - r, n - 1 - l); }; int ans = 1; // Odd length palindromes for (int i = 0; i < n; i++) { int lo = 0, hi = min(i, n - 1 - i); while (lo < hi) { int mid = (lo + hi + 1) / 2; if (isPalindrome(i - mid, i + mid)) { lo = mid; } else { hi = mid - 1; } } ans = max(ans, 2 * lo + 1); } // Even length palindromes (similar) return ans; } ``` ### Manacher's Algorithm (O(n)) **The Problem**: Find all palindromes in O(n) time. **The Trick**: Insert '#' between characters to handle even-length palindromes uniformly. * "abba" becomes "#a#b#b#a#" * Now all palindromes have odd length with a center **The Optimization**: Similar to Z-function, maintain the rightmost palindrome \[c-r, c+r]. If i is inside this palindrome, we can use the mirror position 2c-i to get a starting value for p\[i]. **Visual**: ``` String: #a#b#a# p[i]: 0 1 0 3 0 1 0 ^ center (full palindrome "#a#b#a#" has radius 3) Radius 3 in transformed string = palindrome length 3 in original "aba" ``` **Key Insight**: p\[i] in the transformed string equals the length of the palindrome centered at position i/2 in the original string. ```cpp theme={null} vector manacher(const string& s) { string t = "#"; for (char c : s) { t += c; t += '#'; } int n = t.size(); vector p(n, 0); // p[i] = radius of palindrome centered at i int c = 0, r = 0; // Center and right boundary of rightmost palindrome for (int i = 0; i < n; i++) { if (i < r) { p[i] = min(r - i, p[2 * c - i]); } while (i - p[i] - 1 >= 0 && i + p[i] + 1 < n && t[i - p[i] - 1] == t[i + p[i] + 1]) { p[i]++; } if (i + p[i] > r) { c = i; r = i + p[i]; } } return p; // p[i] is the length of palindrome in original string } ``` *** ## Pattern 2: Distinct Substrings **Problem**: Count number of distinct substrings. ```cpp theme={null} long long countDistinct(const string& s) { int n = s.size(); set> seen; // Use double hash DoubleHash dh(s); for (int i = 0; i < n; i++) { for (int j = i; j < n; j++) { seen.insert(dh.getHash(i, j)); } } return seen.size(); } // More efficient: Use suffix array // Number of distinct substrings = n*(n+1)/2 - sum of LCP values ``` *** ## Pattern 3: Longest Common Substring ```cpp theme={null} int longestCommonSubstring(const string& s, const string& t) { DoubleHash hs(s), ht(t); auto check = [&](int len) -> bool { set> hashes; for (int i = 0; i + len <= s.size(); i++) { hashes.insert(hs.getHash(i, i + len - 1)); } for (int i = 0; i + len <= t.size(); i++) { if (hashes.count(ht.getHash(i, i + len - 1))) { return true; } } return false; }; int lo = 0, hi = min(s.size(), t.size()); while (lo < hi) { int mid = (lo + hi + 1) / 2; if (check(mid)) { lo = mid; } else { hi = mid - 1; } } return lo; } ``` *** ## Pattern 4: Aho-Corasick (Multiple Pattern Matching) For searching multiple patterns simultaneously. ```cpp theme={null} class AhoCorasick { public: struct Node { int children[26] = {}; int fail = 0; int output = -1; // Index of pattern ending here }; vector nodes; AhoCorasick() { nodes.push_back(Node()); } void insert(const string& s, int idx) { int cur = 0; for (char c : s) { int i = c - 'a'; if (!nodes[cur].children[i]) { nodes[cur].children[i] = nodes.size(); nodes.push_back(Node()); } cur = nodes[cur].children[i]; } nodes[cur].output = idx; } void build() { queue q; for (int i = 0; i < 26; i++) { if (nodes[0].children[i]) { q.push(nodes[0].children[i]); } } while (!q.empty()) { int u = q.front(); q.pop(); for (int i = 0; i < 26; i++) { if (nodes[u].children[i]) { int v = nodes[u].children[i]; int f = nodes[u].fail; while (f && !nodes[f].children[i]) { f = nodes[f].fail; } nodes[v].fail = nodes[f].children[i]; if (nodes[v].fail == v) nodes[v].fail = 0; q.push(v); } } } } vector> search(const string& text) { vector> matches; // {position, pattern_idx} int cur = 0; for (int i = 0; i < text.size(); i++) { int c = text[i] - 'a'; while (cur && !nodes[cur].children[c]) { cur = nodes[cur].fail; } cur = nodes[cur].children[c]; int temp = cur; while (temp) { if (nodes[temp].output != -1) { matches.push_back({i, nodes[temp].output}); } temp = nodes[temp].fail; } } return matches; } }; ``` *** ## Common Mistakes **Mistake 1: Hash collision** Use double hashing for important problems. Single hash can collide. On Codeforces, problem setters often add anti-hash tests that specifically target common single-hash parameters. **Mistake 2: Wrong base/mod** * Base should be > alphabet size (26 for lowercase letters, so use 31+) * Mod should be large prime (10^9 + 7 or 10^9 + 9) * Using base = 26 and mapping 'a' to 0 is dangerous: all strings starting with 'a' have hash 0 for the first character, increasing collision risk * Safe mapping: `s[i] - 'a' + 1` (so 'a' maps to 1, not 0) **Mistake 3: Negative hash values** In C++, the `%` operator can return negative values for negative operands. Always add MOD before taking modulo: ```cpp theme={null} // WRONG -- can produce negative hash long long h = (hash_[r+1] - hash_[l] * power[r-l+1]) % MOD; // CORRECT -- add MOD to ensure non-negative long long h = (hash_[r+1] - hash_[l] * power[r-l+1] % MOD + MOD) % MOD; ``` **Mistake 4: Off-by-one in substring hashing** When computing hash of s\[l..r] (inclusive), the formula uses `power[r - l + 1]` (the length of the substring). Double-check whether your getHash function uses inclusive or exclusive bounds -- mismatches cause subtle bugs. *** ## Practice Problems ### Beginner (1000-1300) | Problem | Pattern | Link | | --------------- | ------- | -------------------------------------------- | | String Matching | KMP/Z | [CSES](https://cses.fi/problemset/task/1753) | | Finding Borders | KMP LPS | [CSES](https://cses.fi/problemset/task/1732) | ### Intermediate (1300-1600) | Problem | Pattern | Link | | ------------------ | ---------- | -------------------------------------------- | | Finding Periods | Z-function | [CSES](https://cses.fi/problemset/task/1733) | | Minimal Rotation | Duval | [CSES](https://cses.fi/problemset/task/1110) | | Longest Palindrome | Manacher | [CSES](https://cses.fi/problemset/task/1111) | ### Advanced (1600-1900) | Problem | Pattern | Link | | ------------------- | ------------ | -------------------------------------------- | | Pattern Positions | KMP | [CSES](https://cses.fi/problemset/task/2102) | | Distinct Substrings | Suffix Array | [CSES](https://cses.fi/problemset/task/2105) | *** ## Key Takeaways O(1) substring comparison after O(n) preprocessing. O(n + m) pattern matching without false positives. Efficient prefix queries and autocomplete. O(n) for all palindromic substrings. *** ## Next Up Master bitwise operations, bitmasks, and XOR tricks.