Luke a Pro

Luke Sun

Developer & Marketer

đŸ‡ș🇩

Hash Functions

| , 3 minutes reading.

Hash Functions: The Identity Thief

The Story: The Logic of the Seal

In ancient times, kings used a wax seal to prove that a letter hadn’t been tampered with. If the seal was broken or different, the message was compromised.

In 1953, Hans Peter Luhn, a researcher at IBM, was looking for a way to search through chemical formulas. He realized that if he could convert complex formulas into small numbers (hashes), he could find them instantly in a table.

A modern Hash Function is like a high-tech version of that wax seal. It takes a book, a photo, or a password, and “digests” it into a short string of characters. If even one pixel in the photo changes, the “Seal” (Hash) changes completely.

Why do we need it?

Hash functions are the “DNA” of the digital world.

  • Data Integrity: How do you know the 2GB file you just downloaded isn’t corrupted? You check its MD5 or SHA256 hash.
  • Security: Databases should NEVER store passwords. They store the hash of the password. If a hacker steals the database, they only see fingerprints, not the actual keys.
  • Efficiency: Before comparing two 100MB files, you compare their 32-byte hashes. If the hashes don’t match, the files definitely don’t match.

How the Algorithm “Thinks”

The algorithm is a mathematical blender.

  1. Absorption: It takes an input of any length.
  2. Churning (Mixing): It subjects the data to a series of mathematical “rounds”—bit shifting, XORing, and modular arithmetic. It mixes the bits so thoroughly that a tiny change in input creates an “Avalanche Effect” in the output.
  3. Compression: It spits out a result of a fixed, predictable length (e.g., 256 bits for SHA-256).

Engineering Context: The Collision War

Since there are infinite possible inputs but only a finite number of hashes (e.g., 22562^{256}), two different inputs could produce the same hash. This is a Collision.

  • Non-Cryptographic (MurmurHash, CityHash): Fast, but “predictable.” Used for HashMaps and Load Balancers where speed is priority.
  • Cryptographic (SHA-2, SHA-3): Slower, but “Collision Resistant.” Used for passwords and digital signatures where security is priority.

Implementation (Python)

import hashlib

def calculate_fingerprint(data):
    # Using SHA-256 for high security and integrity
    hasher = hashlib.sha256()
    
    # We must encode string into bytes
    hasher.update(data.encode('utf-8'))
    
    # Return the hex digest (the readable fingerprint)
    return hasher.hexdigest()

# Example
msg1 = "Hello World"
msg2 = "Hello world" # Only one character changed (capitalization)

print(f"Hash 1: {calculate_fingerprint(msg1)}")
print(f"Hash 2: {calculate_fingerprint(msg2)}")
# Notice how the hashes are completely different (Avalanche Effect)

Summary

Hash functions teach us that identity can be compressed. By turning complexity into a simple fingerprint, we gain the ability to verify, protect, and index the entire world. It reminds us that in a universe of infinite information, a small, reliable signature is the only thing that keeps us sane.