Luke a Pro

Luke Sun

Developer & Marketer

đŸ‡ș🇩

Boyer-Moore Algorithm

| , 4 minutes reading.

Boyer-Moore: The Prophet Who Looks Backwards

The Story: The Efficiency of Exclusion

Imagine you are looking for the word apple in a long sentence. The standard way is to look at the first letter: a. But in 1977, Robert S. Boyer and J Strother Moore thought of something radical: What if we look at the last letter first?

If the 5th letter of the text is x, and you are looking for apple, you instantly know two things:

  1. x is not e. (Match failed)
  2. x isn’t even in the word apple.

Because of point #2, you don’t need to move forward by 1 character. You can jump forward by 5 characters instantly. You just skipped 4 letters without even looking at them.

This is the Heuristic of Exclusion. It is the reason why Boyer-Moore is often the fastest string-searching algorithm in practice. It is the brain behind the “Find” (Ctrl+F) command in most text editors and browsers.

Why do we need it?

Boyer-Moore is the king of Sub-linear Search. In many cases, it doesn’t even need to read every character of the text to find the pattern.

  • Text Editors: When you search a 50MB file, Boyer-Moore makes it feel instantaneous.
  • Grep: The famous Unix tool uses a simplified version of this logic.
  • Large Alphabets: It works exceptionally well with English text because the probability of a “mismatch” triggering a giant jump is very high.

How the Algorithm “Thinks”

The algorithm is a bold strategist who uses two rules to decide how far to jump:

  1. The Bad Character Rule: If the character in the text doesn’t match the one in the pattern, Boyer-Moore looks at the pattern: “Do I contain this character anywhere else?”

    • If no: Jump the entire length of the pattern.
    • If yes: Align the pattern so the character matches, and try again.
  2. The Good Suffix Rule: If you matched the end of the word (the “suffix”) but failed just before it, the algorithm asks: “Does this suffix appear anywhere else in my word?” It uses this knowledge to skip ahead to the next potential alignment.

By combining these two rules, Boyer-Moore spends its time not reading the text.

Engineering Context: The Real-World Champion

While KMP is theoretically elegant, Boyer-Moore is usually faster in Real-World Text.

  • Complexity: Its worst-case is O(N⋅M)O(N \cdot M), but its average case is O(N/M)O(N / M). Yes, you read that right: the longer your pattern, the faster the search becomes (because your jumps get bigger).
  • Implementation: It is harder to implement than KMP, which is why many engineers use a simplified version called Boyer-Moore-Horspool.

Implementation (Python - Simplified Horspool Version)

def horspool_search(text, pattern):
    n, m = len(text), len(pattern)
    if m > n: return -1
    
    # 1. Precompute the Bad Character Table
    # "How far can I jump if I see this character?"
    skip = {char: m for char in set(text)}
    for i in range(m - 1):
        skip[pattern[i]] = m - i - 1
        
    k = m - 1 # Position in text
    while k < n:
        j = m - 1 # Position in pattern (Right to Left!)
        i = k
        
        while j >= 0 and text[i] == pattern[j]:
            i -= 1
            j -= 1
            
        if j == -1:
            return i + 1 # Match found!
            
        # 2. The Leap of Faith
        k += skip.get(text[k], m)
        
    return -1

# Example Usage
# text = "TRUST_HARD_WORK_AND_LUCK"
# pattern = "WORK"
# print(horspool_search(text, pattern)) # Output: 11

Summary

Boyer-Moore teaches us that knowing what to ignore is the highest form of intelligence. By starting at the end and looking for reasons to skip, it proves that we don’t need to see everything to understand the whole. In the world of data, the fastest way to get to the answer is often the path that skips the most questions.