Luke a Pro

Luke Sun

Developer & Marketer

🇺🇦
EN||

Inverted Index

| , 4 minutes reading.

Inverted Index: The World Flipped Upside Down

The Story: The Index Cards of the 19th Century

Before computers, historians and researchers faced a nightmare. If you wanted to find every mention of “Liberty” in a library of 10,000 books, you would have to read every single one.

To solve this, librarians began creating Concordances. They would take a book, list every unique word, and write down every page number where that word appeared on a small index card.

Instead of: Book A \to [Word 1, Word 2, Word 3…] They flipped it to: Word 1 \to [Book A, Book B, Book D…]

This “flip” changed the history of human knowledge. It allowed a researcher to find a needle in a haystack in seconds. Today, this 19th-century technique—now called the Inverted Index—is the core technology behind Elasticsearch, Lucene, and every search bar you use.

Why do we need it?

In a standard database (like SQL), you usually search for a row by its ID. But in a search engine, you search for a “Document” by its content.

  • The Problem: If you have 100 million blog posts and you search for “Common Algorithms,” a standard database has to scan every single word of every single post (O(N)O(N)). This takes minutes.
  • The Solution: The Inverted Index. When a document is saved, we break it into words (tokens) and store a map of which word belongs to which document. When you search, we just look at the map (O(1)O(1)).

How the Algorithm “Thinks”

The algorithm is a meticulous librarian who performs a “Reverse Mapping”:

  1. Tokenization (The Breakdown): It takes a sentence like “To be or not to be” and breaks it into individual words: [to, be, or, not]. It ignores capital letters and punctuation.

  2. The Great Flip (Inversion): Instead of remembering “Document 1 has these words,” it builds a giant table where each Word is the key.

    • to \to [Doc 1, Doc 2, Doc 5]
    • be \to [Doc 1, Doc 3]
    • or \to [Doc 1]
  3. The Intersection (The Search): When you search for “to be,” the librarian looks at the lists for to and be and finds the Intersection: the documents that appear in both lists.

Engineering Context: The Cost of Speed

The Inverted Index is why Elasticsearch is so fast at searching but relatively slow at writing.

  • The Trade-off: Every time you add a document, the system has to update thousands of lists in the index. This is computationally expensive and uses a lot of disk space.
  • Compression: Because these lists (Postings Lists) can be massive, engineers use clever bit-packing techniques to keep the index small enough to fit in RAM.

Implementation (Python)

from collections import defaultdict

def create_inverted_index(documents):
    """
    documents: Dictionary {doc_id: "content string"}
    """
    index = defaultdict(set)
    
    for doc_id, content in documents.items():
        # 1. Tokenization: Lowercase and split into words
        words = content.lower().split()
        
        # 2. Inversion: Map word -> doc_id
        for word in words:
            index[word].add(doc_id)
            
    return index

def search(query, index):
    query_words = query.lower().split()
    if not query_words:
        return set()
    
    # 3. Intersection: Find docs that contain ALL query words
    results = index.get(query_words[0], set())
    for word in query_words[1:]:
        results = results.intersection(index.get(word, set()))
        
    return results

# Example Usage
# docs = {
#     1: "Common algorithms for engineers",
#     2: "The soul of a new machine",
#     3: "How to find common ground"
# }
# idx = create_inverted_index(docs)
# print(search("common", idx)) # Output: {1, 3}

Summary

The Inverted Index teaches us that if you want to find something fast, you must stop looking through the lens of “what I have” and start looking through the lens of “what people want.” By flipping the world upside down, we turned the impossible task of reading everything into the simple task of looking at a map.