Luke a Pro

Luke Sun

Developer & Marketer

đŸ‡ș🇩

Locality Sensitive Hashing (LSH)

| , 4 minutes reading.

Locality Sensitive Hashing: Birds of a Feather

The Story: The Logic of the Artist’s Palette

Imagine you are an artist with 1 million paintings. Someone brings you a new sketch and asks: “Which of your paintings is most similar to this?”

In a normal database (a Map), you would hash the sketch. But if the sketch has one extra pencil stroke, the hash changes completely. You would find nothing.

In 1998, Indyk and Motwani designed a “broken” hash function on purpose. Instead of maximizing the difference, they wanted to maximize the Collision. They imagined projecting the paintings onto a low-dimensional wall. If two paintings look similar from 10 different angles, they are probably the same painting.

This is Locality Sensitive Hashing (LSH). It is the art of grouping “Close Enough” items together without checking every pair (O(N2)O(N^2)), which would be impossible at scale.

Why do we need it?

LSH is the engine for Similarity Search.

  • Recommendation Systems: “Find users who have similar movie tastes to Luke.”
  • Copyright Detection: “Is this YouTube video a slightly cropped version of a copyrighted movie?”
  • Plagiarism Check: “Is this essay 90% similar to an existing article?”
  • Image Search: Finding “Visually similar” photos.

How the Algorithm “Thinks”

The algorithm is a creative grouper.

  1. Vectorization: It turns data (text, images) into a list of numbers (Vectors).
  2. Projection (The Shadow): It projects these vectors onto random hyperplanes. Think of it as taking a “snapshot” of the data from a specific angle.
  3. The Signature: It records the result of these projections as a short signature.
  4. The Bucket: Items with the same signature are put into the same “Bucket.”
  5. The Search: To find something similar, you only look inside the same bucket. You ignore the billions of items in other buckets.

Engineering Context: The Precision-Recall Trade-off

  • Speed: Searching a bucket of 1,000 items is much faster than searching a database of 1 billion.
  • Accuracy: You might miss some similar items (Recall) or find some slightly different items (Precision).
  • Optimization: Engineers use multiple “Bands” of LSH. If two items match in any band, they are considered candidates. This is the MinHash + LSH combo used by Google and Amazon.

Implementation (Python Sketch)

import numpy as np

class SimpleLSH:
    def __init__(self, input_dim, num_projections=10):
        # Create random hyperplanes (the 'angles' we look from)
        self.projections = np.random.randn(num_projections, input_dim)
        self.buckets = {}

    def _get_hash(self, vector):
        # Project vector and get a binary signature (above/below plane)
        sig = (np.dot(self.projections, vector) > 0).astype(int)
        return "".join(sig.astype(str))

    def add(self, label, vector):
        h = self._get_hash(vector)
        if h not in self.buckets:
            self.buckets[h] = []
        self.buckets[h].append(label)

    def query(self, vector):
        h = self._get_hash(vector)
        return self.buckets.get(h, [])

# Usage
lsh = SimpleLSH(input_dim=128, num_projections=8)
vec_luke = np.random.randn(128)
vec_similar = vec_luke + 0.1 * np.random.randn(128) # Slightly different

lsh.add("Luke", vec_luke)
print(f"Similar to query: {lsh.query(vec_similar)}") # Likely returns ['Luke']

Summary

LSH teaches us that sometimes, the details don’t matter. By blurring the lines of identity, we gain the ability to find connections in a world of overwhelming complexity. It reminds us that in the digital age, being “Similar” is often more useful than being “Equal.”