Roaring Bitmaps in Inverted Indexes
How roaring bitmaps are a very good way to optimise inverted indexes
· 5 min read
The whole point of this blog is to show that how roaring bitmaps are a very good way to optimise inverted indexes.
Outline #
- What’s an inverted index.
- What are roaring bitmaps
- Will need some research here.
- How roaring bitmaps make it cheaper to store inverted indexes – both in memory and on disk.
Understanding Roaring Bitmaps from First Principles #
The Problem: Storing Sets of Integers #
Let’s start with a simple problem: how do you efficiently store and query sets of integers?
Consider a search engine that needs to track which documents contain specific words:
- Word “database” appears in documents: {1, 5, 42, 100, 5000, 10000}
- Word “index” appears in documents: {2, 5, 99, 100, 4999, 10000}
Traditional Approach: Simple Bitmaps #
A bitmap uses one bit per possible value:
- Bit position = integer value
- Bit value: 1 if present, 0 if absent
Example: Set {1, 3, 5}
Position: 0 1 2 3 4 5 6 7
Bitmap: 0 1 0 1 0 1 0 0
The Problem with Simple Bitmaps #
For large universes, bitmaps waste massive space:
- Universe of 1 billion integers = 1 billion bits = ~125MB
- Even if storing only 10 values!
Storing {1, 1000000000} in a bitmap:
Position 0: 0
Position 1: 1 <-- First value
Position 2-999999999: 0 (wasted space!)
Position 1000000000: 1 <-- Second value
Total: 125MB to store just 2 integers
Enter: Roaring Bitmaps #
Roaring bitmaps solve this by partitioning the universe and using different storage strategies based on density.
Core Insight: Partition by High Bits #
Split 32-bit integers into:
- High 16 bits: Container key
- Low 16 bits: Value within container
Number: 131,075 (0x20003 in hex)
High 16 bits: 0x0002 (container 2)
Low 16 bits: 0x0003 (value 3 in container)
Three Container Types #
Each container can store up to 65,536 values (2^16) using the most efficient format:
Container Decision Tree:
Number of elements in container?
├─ < 4,096 elements ──> Array Container
│ (sorted array of 16-bit integers)
├─ ≥ 4,096 elements ──> Bitmap Container
│ (65,536 bits = 8KB)
└─ All 65,536 values ─> Run Container
(for consecutive sequences)
Visual Example #
Let’s store the set {1, 2, 3, 1000, 65536, 65537, 131072}:
Step 1: Partition by high 16 bits
Container 0 (0x0000): {1, 2, 3, 1000} → Array Container
Container 1 (0x0001): {0, 1} → Array Container
Container 2 (0x0002): {0} → Array Container
Step 2: Roaring Bitmap Structure
┌─────────────────────────────────────────┐
│ Roaring Bitmap Header │
├─────────────────────────────────────────┤
│ Container Count: 3 │
├─────────────────────────────────────────┤
│ Container Keys: [0, 1, 2] │
├─────────────────────────────────────────┤
│ Container Types & Pointers │
└─────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────┐ ┌──────┐
│Array: 4 │ │Array:│ │Array:│
│[1,2,3, │ │ [0,1]│ │ [0] │
│ 1000] │ └──────┘ └──────┘
└──────────┘
Operations Efficiency #
Union (OR) #
Merge containers with same key, copy others:
A = {1, 1000, 65536}
B = {2, 1000, 131072}
Container view:
A: Container 0: [1, 1000], Container 1: [0]
B: Container 0: [2, 1000], Container 2: [0]
Union process:
Container 0: [1, 1000] ∪ [2, 1000] = [1, 2, 1000]
Container 1: [0] (from A only)
Container 2: [0] (from B only)
Result: {1, 2, 1000, 65536, 131072}
Intersection (AND) #
Only process containers that exist in both:
Intersection process:
Container 0: [1, 1000] ∩ [2, 1000] = [1000]
Container 1: No match in B, skip
Container 2: No match in A, skip
Result: {1000}
Memory Savings Example #
Storing web crawler results for 10 million URLs:
- Average 1000 URLs contain each word
- Traditional bitmap: 10M bits = 1.25MB per word
- Roaring bitmap: ~2KB per word (500x smaller!)
Traditional Bitmap (word "database"):
[0|0|1|0|0|0|0|1|0|0|0|0|0|0|1|...] × 10 million = 1.25MB
Roaring Bitmap:
Container 0: Array[15, 97, ...] (8 bytes)
Container 5: Array[200, 847, ...] (12 bytes)
...
Total: ~2KB
Implementation Sketch #
class RoaringBitmap:
def __init__(self):
self.containers = {} # key -> container
def add(self, value):
high = value >> 16
low = value & 0xFFFF
if high not in self.containers:
self.containers[high] = ArrayContainer()
container = self.containers[high]
container.add(low)
# Convert to bitmap if too many elements
if container.cardinality() > 4096:
self.containers[high] = container.to_bitmap()Why “Roaring”? #
The name comes from RoaringBitmap’s ability to efficiently “roar” through large datasets - it’s optimized for:
- Fast sequential access
- Cache-friendly operations
- SIMD instruction support
- Excellent compression ratios
Key Advantages #
- Space efficient: Adapts storage to data density
- Fast operations: Leverages CPU cache and SIMD
- Serialization friendly: Compact on-disk format
- Industry proven: Used by Apache Lucene, Druid, Spark
The combination of partitioning and adaptive containers makes roaring bitmaps ideal for:
- Search engine inverted indexes
- Database bitmap indexes
- Analytics engines
- Time-series databases
I was implementing a domain specific inverted index at work and discovered roaring bitmaps as a very optimised way to store document IDs in the index.
Wanted to write this post talking about roaring bitmaps and how they optimise on-disk storage for bitmaps.
Let me illustrate the problem I was trying to solve.
- “apple” -> 1,4,10
- “banana” -> 1,10,16
documents with “apple banana” = set intersection
Map where keys are tokens and the values are sets,
One cheaper way to represent a set is to use a bitmap where a bit being set indicates that presence of that index id.
But sparse bitmaps with large ranges can quickly become inefficient from a storage pov.
that’s where roaring bitmaps come in, where instead of storing bitmaps in a flat array of bits, we partition the space into different containers and then use different compression techniques to compress those containers
- Partitioning
- Array Container
- BitMap Container
- Run Container (uses run length encoding for compression)
- Some notes in the paper indicate how they use specific x86 instructions to count the number of bits set to 1
- File Layout
- Talk about how the set operations are optimised