Luke a Pro

Luke Sun

Developer & Marketer

🇺🇦
EN||

Huffman Coding

| , 4 minutes reading.

Huffman Coding: The Tree of Brevity

The Story: The Morse Code Problem

In 1836, Samuel Morse invented a code to send messages over wires. He noticed that the letter ‘E’ is the most common letter in English, while ‘Z’ is very rare. So he made ‘E’ a single dot (.). He made ‘Z’ a complex sequence (--..). By making common things short and rare things long, he saved time.

In 1952, David Huffman formalized this idea mathematically. He proved that if you build a specific type of binary tree based on frequency, you can create the optimal prefix-free code for any set of data.

This is Huffman Coding. It is the reason ZIP files are small and JPEG images load fast.

Why do we need it?

  • File Compression: ZIP, GZIP, PNG, JPEG, MP3 all use Huffman coding (or its variants) as a final step to squeeze out bits.
  • Bandwidth Savings: Sending less data over the network means faster websites and lower costs.

How the Algorithm “Thinks”

The algorithm is a Bottom-Up Builder.

  1. Count Frequencies: ‘A’: 5, ‘B’: 9, ‘C’: 12, ‘D’: 13, ‘E’: 16, ‘F’: 45.
  2. The Priority Queue: Put all characters into a Min-Heap sorted by frequency.
  3. Merge:
    • Pick the two smallest nodes (5 and 9).
    • Merge them into a new node (Sum: 14).
    • Put 14 back into the heap.
  4. Repeat: Keep merging the two smallest until only one node (the Root) remains.
  5. Assign Codes:
    • Left branch = ‘0’
    • Right branch = ‘1’

The most frequent characters end up near the top (short codes). The rare characters end up deep at the bottom (long codes).

Engineering Context: Prefix-Free Codes

A crucial property of Huffman codes is that no code is a prefix of another.

  • If ‘E’ is 0 and ‘A’ is 01, the decoder gets confused when it sees 0… is it ‘E’ or the start of ‘A’?
  • In Huffman, if 0 is taken by ‘E’, no other code starts with 0. This allows for instant, unambiguous decoding without separator markers.

Implementation (Python)

import heapq
from collections import Counter

class Node:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None
    
    # Comparison operator for Priority Queue
    def __lt__(self, other):
        return self.freq < other.freq

def build_huffman_tree(text):
    # 1. Count frequencies
    freq = Counter(text)
    
    # 2. Build Min-Heap
    heap = [Node(char, f) for char, f in freq.items()]
    heapq.heapify(heap)
    
    # 3. Merge nodes
    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        
        # Create internal node (char is None)
        merged = Node(None, left.freq + right.freq)
        merged.left = left
        merged.right = right
        
        heapq.heappush(heap, merged)
        
    return heap[0] # Root

def generate_codes(node, prefix="", code_map={}):
    if node:
        if node.char is not None:
            code_map[node.char] = prefix
        generate_codes(node.left, prefix + "0", code_map)
        generate_codes(node.right, prefix + "1", code_map)
    return code_map

# Usage
text = "BEEP BOOP BEER!"
root = build_huffman_tree(text)
codes = generate_codes(root)
print(f"Codes: {codes}")
# Common letters like 'E' will have shorter binary strings

Summary

Huffman Coding teaches us that equality is inefficient. Treating every character as equal (8 bits) wastes space. By embracing inequality—giving privileges (short codes) to the frequent and burdens (long codes) to the rare—we achieve the most efficient system possible. It reminds us that context (frequency) defines value.