Luke a Pro

Luke Sun

Developer & Marketer

🇺🇦
EN||

HyperLogLog (HLL)

| , 3 minutes reading.

HyperLogLog (HLL)

Introduction: The “Unique Visitors” Problem

You are the CTO of a social media site. You want to count how many unique users visited a specific page today (Cardinality Estimation).

  • Set Approach: Store every User ID in a Hash Set. 100 million IDs imesimes 8 bytes \approx 800MB. For 10,000 pages, you need 8TB of RAM. Impossible.
  • HyperLogLog: Count those 100 million users using only 1.5 KB of memory with a 2%\approx 2\% error.

HyperLogLog is a probabilistic algorithm for the count-distinct problem. It’s used by Redis, BigQuery, and AWS Redshift to handle massive scale.

What Problem does it solve?

  • Input: A stream of elements.
  • Output: An estimate of how many unique elements have been seen.
  • The Promise: Fixed, tiny memory usage regardless of how many items you count.

Core Idea (Intuition)

Think of Flipping a Coin.

  • If I tell you “I flipped a coin and got 3 Tails in a row at the start (TTT…),” you’d guess I flipped it maybe 8 times.
  • If I tell you “I got 20 Tails in a row (TTTTTTTTTTTTTTTTTTTT…),” you’d guess I flipped it millions of times.

HyperLogLog applies this to data:

  1. Hash an item to a binary string.
  2. Look for the number of leading zeros.
  3. The more zeros you see, the more unique items likely exist in the set.
  4. To avoid luck, it splits the data into many “buckets” and averages the results.

Why use it?

  • ✅ Memory Efficiency: It is the undisputed king of space. It can count billions of items in a space smaller than a single tweet.
  • ✅ Mergeable: You can calculate HLL for “Monday” and HLL for “Tuesday,” and merge them to get the unique count for the “Whole Week” perfectly.

Typical Business Scenarios

  • ✅ DAU/MAU Counting: Estimating Daily/Monthly Active Users in real-time.

  • ✅ Ad-Tech: Counting unique people who saw an ad across billions of impressions.

  • ✅ Network Security: Counting unique IP addresses connecting to a server to detect a DDoS attack.

  • ❌ Exact Counts Needed: If you need the exact number for billing or legal reasons, HLL is not for you. It has a standard error of 1.04/m1.04/\sqrt{m} (about 2% for typical configs).

Performance & Complexity

  • Time: O(1)O(1) to add an item; O(1)O(1) to query.
  • Space: O(loglogN)O(\log \log N) (practically constant, e.g., 1.5KB).

Summary

“HyperLogLog is the ‘Estimation Miracle’. It turns the impossible task of counting billions into a tiny statistical game of counting zeros. Use it when ‘close enough’ is worth 10,000x memory savings.”