Luke a Pro

Luke Sun

Developer & Marketer

šŸ‡ŗšŸ‡¦

Levenshtein Distance

| , 4 minutes reading.

Levenshtein Distance: The Error Measure

The Story: The Typos

Imagine you are a teacher grading a spelling test.

  • The student wrote: ā€œKITTENā€
  • The correct word is: ā€œSITTINGā€

How many points should you deduct?

  1. Substitute ā€˜K’ with ā€˜S’ -> SITTEN
  2. Insert ā€˜I’ after ā€˜S’ -> SIITTEN (Wait, that’s wrong…)
  3. Let’s try again.
    • K -> S (Substitute)
    • I -> I (Match)
    • T -> T (Match)
    • T -> T (Match)
    • E -> I (Substitute)
    • N -> N (Match)
    • End -> Add G (Insert)

Total operations: 3. This count, 3, is the Levenshtein Distance. It is the minimum number of Insertions, Deletions, or Substitutions required to transform one string into another.

Why do we need it?

  • Spell Checkers: ā€œDid you mean…?ā€ If you type ā€œGogleā€, the distance to ā€œGoogleā€ is 1. The distance to ā€œAppleā€ is 4. The computer suggests ā€œGoogleā€.
  • DNA Alignment: Comparing two gene sequences to see how closely related two species are.
  • Plagiarism Detection: Measuring how similar two essays are, even if the student changed a few words.

How the Algorithm ā€œThinksā€

The algorithm uses Dynamic Programming. It builds a grid (matrix) where cell[i][j] represents the distance between the first i characters of Word A and the first j characters of Word B.

The Rules of the Grid:

  • If characters match: No cost. cell[i][j] = cell[i-1][j-1]
  • If mismatch: We take the minimum of three options + 1 cost:
    1. Insert: Look Left (cell[i][j-1])
    2. Delete: Look Up (cell[i-1][j])
    3. Replace: Look Diagonal (cell[i-1][j-1])

Engineering Context: The Complexity

  • Time: O(NimesM)O(N imes M) where N and M are string lengths.
  • Space: O(NimesM)O(N imes M) for the full matrix.
  • Optimization: You only need the previous row to calculate the current row. So space can be reduced to O(min(N,M))O(min(N, M)).

Implementation (Python)

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    
    # Create a matrix of size (m+1) x (n+1)
    # dp[i][j] = distance between s1[:i] and s2[:j]
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    # Initialize first row and column
    for i in range(m + 1):
        dp[i][0] = i # Transforming s1[:i] to "" requires i deletions
    for j in range(n + 1):
        dp[0][j] = j # Transforming "" to s2[:j] requires j insertions

    # Fill the matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                # Characters match, no cost
                dp[i][j] = dp[i - 1][j - 1]
            else:
                # Mismatch, take min of Insert, Delete, Replace
                dp[i][j] = 1 + min(
                    dp[i][j - 1],    # Insert
                    dp[i - 1][j],    # Delete
                    dp[i - 1][j - 1] # Replace
                )
    
    return dp[m][n]

# Usage
print(f"Distance: {levenshtein_distance('kitten', 'sitting')}") # Output: 3

Summary

Levenshtein Distance teaches us that difference can be quantified. By breaking down vague concepts like ā€œsimilarityā€ into concrete atomic actions (insert, delete, replace), we can teach computers to understand human error. It reminds us that in a world full of noise, the ability to measure the gap between expectation and reality is the first step to bridging it.