Luke a Pro

Luke Sun

Developer & Marketer

đŸ‡ș🇩

KMP Algorithm

| , 4 minutes reading.

KMP Algorithm: The One-Way Dancer

The Story: The Deja Vu

Imagine you are reading a book, looking for the word “ABABAC”. You read: “A
 B
 A
 B
 A
” Great! You have matched 5 characters. You are excited. Then you read the 6th character
 it’s “D”. “ABABAD”. It doesn’t match “ABABAC”.

In a naive algorithm, you would sigh, go back to the 2nd character of the book, and start all over again. But KMP is smarter. It thinks: “Wait. I just matched ‘ABABA’. I know what I saw. The last part ‘ABA’ could be the start of a new match! I don’t need to go back. I just need to shift my pattern so that the beginning ‘ABA’ aligns with the ending ‘ABA’ I just saw.”

This is KMP (Knuth-Morris-Pratt). It never moves backwards in the text pointer. It uses a pre-calculated table to slide the pattern forward as much as possible.

Why do we need it?

  • Ctrl+F: Searching for a keyword in a massive document.
  • DNA Sequencing: Finding a specific gene pattern in a genome of billions of base pairs.
  • Streaming Data: KMP can search a stream of data (like keyboard input) in real-time because it never needs to “rewind” the stream.

How the Algorithm “Thinks”

The algorithm has two phases:

  1. The Preparation (LPS Table): Before reading the book, KMP analyzes the Pattern itself. It builds an array called LPS (Longest Prefix which is also Suffix).

    • For “ABABAC”:
    • LPS Table says: “If you fail at index 5, you can safely jump to index 3.”
  2. The Dance (Search):

    • It matches characters one by one.
    • If characters match: Move forward.
    • If mismatch: Don’t move the text pointer back. Look at the LPS table to see where to reset the pattern pointer.

Engineering Context: Naive vs. KMP

  • Naive Search: Worst case is O(NimesM)O(N imes M). If you search for “AAAAA” in “AAAAAAAAA”, it’s very slow.
  • KMP: Guaranteed O(N+M)O(N + M). The time depends only on the length of the text and the pattern.

Implementation (Python)

def compute_lps(pattern):
    # Longest Prefix Suffix table
    lps = [0] * len(pattern)
    length = 0 # Length of the previous longest prefix suffix
    i = 1
    
    while i < len(pattern):
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                # Tricky part: Fall back to previous LPS
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

def kmp_search(text, pattern):
    if not pattern: return 0
    lps = compute_lps(pattern)
    i = 0 # Index for text
    j = 0 # Index for pattern
    
    while i < len(text):
        if pattern[j] == text[i]:
            i += 1
            j += 1
        
        if j == len(pattern):
            return i - j # Match found at this index
            # j = lps[j-1] # To find multiple matches
        
        elif i < len(text) and pattern[j] != text[i]:
            if j != 0:
                j = lps[j-1] # The KMP Jump!
            else:
                i += 1
    return -1

# Usage
print(kmp_search("ABABDABACDABABCABAB", "ABABCABAB"))

Summary

KMP teaches us that failure contains information. Instead of treating a mismatch as a total loss, KMP uses it as a clue to skip ahead. It reminds us that we don’t always need to start from scratch; if we know ourselves (the pattern), we can recover from errors gracefully.