Knuth-Morris-Pratt (KMP) String Matching

24-01-2025string-matching

KMP algorithm efficiently finds all occurrences of a pattern in text in O(n + m) time by using a failure table to avoid redundant comparisons when mismatches occur.

Problem

You need to find all occurrences of a pattern string with a larger text.

For example, finding "ABABC" in "ABABDABABCABAB" should return position 5 (0-indexed).

The naive approach compares the pattern at every position in the text:

  • For each position i in text, check if pattern matches starting at i
  • If a mismatch occurs, move to position i+1 and start over
  • This takes O(n*m) time in the worst case, where n is the text length and m is the pattern length

The problem with the naive approach is that it discards information. When we find "ABABD" but fail at 'D', we've learned that "ABAB" matched, but we throw away this knowledge and restart from scratch.

KMP solves this by remembering what we've matched, achieving O(n + m) time complexity.

The Solution

When a mismatch occurs, we don't need to go back to the beginning. Instead, we can skip ahead based on the pattern's internal structure.

Consider searching for "ABABC" in "ABABDABABC":

Text:    A B A B D A B A B C
Pattern: A B A B C
         ✓ ✓ ✓ ✓ ✗

We matched "ABAB" before failing at 'D'. Notice that "ABAB" has an internal overlap: the prefix "AB" equals the suffix "AB".

Instead of restarting from position 1, we can shift the pattern to align this overlap:

Text:    A B A B D A B A B C
Pattern:     A B A B C
             ↑
         (resume checking here)

We've already verified that positions 2-3 in the text are "AB", so we don't need to check them again. This is the core of KMP's efficiency.

The Failure Table

The failure table (also called LPS - Longest Proper Prefix which is also Suffix) preprocesses the pattern to encode these skip distances.

For each position i in the pattern, b[i] tells us: "If we've matched up to position i-1 and then fail, what's the length of the longest proper prefix of pattern[0...i-1] that's also a suffix?"

Let's build the table for "ABABC":

ipattern[0...i-1]b[i]Explanation
0""-1Empty string, special case
1"A"0No proper prefix equals suffix
2"AB"0No proper prefix equals suffix
3"ABA"1"A" is both prefix and suffix
4"ABAB"2"AB" is both prefix and suffix
5"ABABC"0No proper prefix equals suffix

Key observations:

  • b[i] = j means pattern[0...j-1] matches pattern[i-j...i-1]
  • When we fail at position i, we can resume matching at position b[i]
  • The value -1 indicates we should move to the next text position

Building the Failure Table

The failure table is built using KMP on itself! We match the pattern against itself:

Step-by-step for "ABABC":

i=0, j=-1: b[0]=-1 (base case)

i=1, j=0: pattern[0]='A' vs pattern[0]='A' → match
         b[1]=0

i=2, j=0: pattern[1]='B' vs pattern[0]='A' → mismatch
         j=b[0]=-1, then increment
         b[2]=0

i=3, j=0: pattern[2]='A' vs pattern[0]='A' → match
         b[3]=1

i=4, j=1: pattern[3]='B' vs pattern[1]='B' → match
         b[4]=2

i=5, j=2: pattern[4]='C' vs pattern[2]='A' → mismatch
         j=b[2]=0: pattern[4]='C' vs pattern[0]='A' → mismatch
         j=b[0]=-1, then increment
         b[5]=0

Playground: Failure table builder

Failure table builder

Step through how the failure table (b array) is built, including the fallback jumps after mismatches.

Length: 5Steps: 14i=, j=Step: 1 / 14

Set b[0] = -1. Start with i=0, j=-1.

Pattern
i=0A
i=1B
i=2A
i=3B
i=4C
b[i]
i=0-1
i=1
i=2
i=3
i=4
i=5

Searching with KMP

Once we have the failure table, searching proceeds similarly:

Search for "ABABC" in "ABABDABABC":

i=0, j=0: text[0]='A' vs pattern[0]='A' → match, j=1
i=1, j=1: text[1]='B' vs pattern[1]='B' → match, j=2
i=2, j=2: text[2]='A' vs pattern[2]='A' → match, j=3
i=3, j=3: text[3]='B' vs pattern[3]='B' → match, j=4
i=4, j=4: text[4]='D' vs pattern[4]='C' → mismatch
         j=b[4]=2 (skip to position 2)
i=4, j=2: text[4]='D' vs pattern[2]='A' → mismatch
         j=b[2]=0
i=4, j=0: text[4]='D' vs pattern[0]='A' → mismatch
         j=b[0]=-1
i=5, j=0: text[5]='A' vs pattern[0]='A' → match, j=1
i=6, j=1: text[6]='B' vs pattern[1]='B' → match, j=2
i=7, j=2: text[7]='A' vs pattern[2]='A' → match, j=3
i=8, j=3: text[8]='B' vs pattern[3]='B' → match, j=4
i=9, j=4: text[9]='C' vs pattern[4]='C' → match, j=5
         j==m, found match at position 5!

Playground: KMP search walkthrough

KMP search walkthrough

Walk through the matching phase and watch how the failure table shifts the pattern on mismatches.

Text length: 14Pattern length: 5Matches: 0i=0, j=0

Start at i=0, j=0.

Failure table
i=0-1
i=10
i=20
i=31
i=42
i=50
Text
i=0A
i=1B
i=2A
i=3B
i=4D
i=5A
i=6B
i=7A
i=8B
i=9C
i=10A
i=11B
i=12A
i=13B
Pattern
i=0A
i=1B
i=2A
i=3B
i=4C

Time Complexity Analysis

Building the failure table: O(m)

  • Each iteration either increments i or decrements j
  • i increases from 0 to m, contributing m increments
  • j can only decrease when it's positive, and each increment of i increases j by at most 1
  • Therefore j can decrease at most m times total
  • Total operations: O(m)

Searching: O(n)

  • Similar amortized analysis: i goes from 0 to n, j can decrease at most n times
  • Total operations: O(n)

Total: O(n + m) with O(m) space for the failure table.

Code (C++)

// Returns a vector of all positions in 'text' where 'pattern' begins.
std::vector<int> kmp_search(const std::string &text, const std::string &pattern) {
    int n = text.size(), m = pattern.size();
    if (m == 0) return {};
    // build “failure” table
    std::vector<int> b(m+1);
    b[0] = -1;
    int i = 0, j = -1;
    while (i < m) {
        while (j >= 0 && pattern[i] != pattern[j]) j = b[j];
        ++i; ++j;
        b[i] = j;
    }
    // search
    std::vector<int> matches;
    i = 0; j = 0;
    while (i < n) {
        while (j >= 0 && text[i] != pattern[j]) j = b[j];
        ++i; ++j;
        if (j == m) {
            matches.push_back(i - j);
            j = b[j];
        }
    }
    return matches;
}

Code (Python)

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    if m == 0:
        return []
    
    # Build failure table
    b = [0] * (m + 1)
    b[0] = -1
    i, j = 0, -1
    while i < m:
        while j >= 0 and pattern[i] != pattern[j]:
            j = b[j]
        i += 1
        j += 1
        b[i] = j
    
    # Search
    matches = []
    i, j = 0, 0
    while i < n:
        while j >= 0 and text[i] != pattern[j]:
            j = b[j]
        i += 1
        j += 1
        if j == m:
            matches.append(i - j)
            j = b[j]
    
    return matches