[WIP] High Performance JSONL Processing via Byte Offsets

Processing TB-scale JSONL files efficiently

October 30, 2025 · 10 min read

This post is a work in progress, I’ll be updating it as I go along.

I work a lot with JSONL files at work, this performance optimisation guide is a collection of all the tricks I’ve discovered over the past two years to process them efficiently.

1. Introduction #

JSONL is a format for storing JSON objects in a line-delimited format. It looks like the following

{"name": "John", "age": 30}
{"name": "Jane", "age": 25}
{"name": "Jim", "age": 35}

It’s a simple way to store data that is too large to fit in memory, might not be the most efficient compared to formats like Parquet or ORC, but in most cases if you’re dealing with upstream APIs that return JSON data, JSONL is a convenient way to store it.

RFC 7464 defines the JSONL format.

The primary problem statement I want to discuss optimisations around is building indexes on top of JSONL files to enable fast random access to the data.

Usually you’ll be dealing with JSONL files that are larger than available memory, and you’d want to build specific indexes on top of them to enable fast random access to the data.

You definitely cannot loop over all the lines in the file to get to a specific record, that would be too slow.

2. Theoretical Foundation #

2.1 File I/O Performance Characteristics #

Understanding file I/O at the system level is crucial for optimization. When Python’s open() reads a file, it doesn’t directly access the disk - multiple layers of abstraction sit between your code and the physical storage.

System Calls and Kernel Buffers #

Every file read operation translates to system calls - typically read() or pread(). These calls have overhead:

Typical system call overhead: ~100-300ns on modern Linux
Sequential read throughput: ~3-5 GB/s on NVMe SSD
Random read IOPS: ~500K-1M on NVMe SSD

The kernel maintains buffers to amortize this cost. When you read 100 bytes, the kernel might fetch 4KB (page size) or more. This prefetching helps sequential access but becomes overhead for random access patterns.

Consider this measurement on a 10GB JSONL file:

# Sequential reading - kernel readahead helps
with open('large.jsonl', 'rb') as f:
    while f.read(4096):  # 4KB chunks
        pass
# Result: ~3.2 GB/s throughput, 180MB kernel buffer usage

# Random seeking - readahead becomes waste
with open('large.jsonl', 'rb') as f:
    for offset in random_offsets:
        f.seek(offset)
        f.read(100)  # Small read after seek
# Result: ~45 MB/s throughput, 890MB kernel buffer usage

The random access pattern shows 70x throughput degradation despite reading less data. Why? Each seek invalidates the kernel’s readahead buffer, forcing new page faults.

Page Cache Behavior #

Linux’s page cache acts as a transparent buffer between applications and disk. Key behaviors:

Cache pollution: Sequential scans can evict hot data from cache
MADV_SEQUENTIAL vs MADV_RANDOM: Hints that alter readahead behavior
O_DIRECT bypass: Eliminates cache but requires aligned I/O

For JSONL processing, we can exploit page cache predictably:

# Force aggressive readahead for sequential pass
os.posix_fadvise(fd, 0, file_size, os.POSIX_FADV_SEQUENTIAL)

# Minimize cache pollution for index building
os.posix_fadvise(fd, 0, file_size, os.POSIX_FADV_DONTNEED)

Cost Analysis #

Operation	Time (NVMe)	Time (HDD)	Memory Impact
Sequential 4KB read	~1.2μs	~10μs	+4KB cache
Random 4KB read	~70μs	~4ms	+4KB cache + eviction
Seek operation	~10μs	~3ms	Readahead buffer lost
mmap page fault	~3μs	~10μs	+4KB RSS

The 60x penalty for random reads on NVMe (400x on HDD) drives our offset-based optimization strategy.

2.2 JSONL Format Advantages #

JSONL’s structure makes it uniquely suitable for byte-offset optimization compared to other formats.

Newline Delimited Structure #

Each record boundary is a single byte (\n, 0x0A). This simplicity enables:

# Fast boundary detection using SIMD-optimized memchr
def find_newlines_fast(mmap_obj, start, end):
    # On x86_64, glibc's memchr uses SSE2/AVX2
    # Processes 32-64 bytes per cycle
    newlines = []
    pos = start
    while pos < end:
        nl = mmap_obj.find(b'\n', pos, end)
        if nl == -1:
            break
        newlines.append(nl)
        pos = nl + 1
    return newlines

# Benchmark: Scanning 1GB for newlines
# Python loop: 2.3 seconds
# memchr-based: 0.08 seconds (28x faster)

Compare with other formats:

CSV: Variable field counts, escaped newlines complicate boundary detection
XML/JSON: Nested structures require parsing to find record boundaries
Parquet: Binary format, requires metadata parsing for row groups

JSONL’s independence between records means:

Valid partial reads: Can start reading from any newline
Corruption isolation: Bad record doesn’t affect others
Streaming compatible: No closing tags or brackets needed

# This works with JSONL, fails with regular JSON
with open('data.jsonl', 'rb') as f:
    f.seek(file_size // 2)  # Jump to middle
    f.readline()  # Discard partial line
    record = json.loads(f.readline())  # Valid record

Natural Parallelization Boundaries #

JSONL files can be split at any newline without context:

def calculate_split_points(file_size: int, num_workers: int) -> List[Tuple[int, int]]:
    """
    Calculate byte ranges for parallel processing.
    Each worker gets roughly equal-sized chunks.
    """
    chunk_size = file_size // num_workers
    splits = []

    with open('data.jsonl', 'rb') as f:
        for i in range(num_workers):
            start = i * chunk_size
            end = start + chunk_size if i < num_workers - 1 else file_size

            # Adjust end to next newline
            if end < file_size:
                f.seek(end)
                f.readline()  # Read to next complete line
                end = f.tell()

            splits.append((start, end))

    return splits

# Perfect work distribution without parsing overhead

2.3 Offset Index Design #

The offset index is the core data structure enabling fast random access. Design decisions here impact memory usage, lookup speed, and update complexity.

B-tree vs Hash Map for Offset Storage #

Hash Map Approach:

# Simple: record_number -> byte_offset
offsets = {
    0: 0,
    1: 1247,
    2: 2891,
    # ...
}
# Pros: O(1) lookup
# Cons: No range queries, high memory overhead (28 bytes/entry in Python)

B-tree Approach:

# Using sorted array for simplicity (B-tree principle)
import numpy as np

offsets = np.array([
    (0, 0),
    (1, 1247),
    (2, 2891),
    # ...
], dtype=[('record', 'u4'), ('offset', 'u8')])

# Pros: Cache-friendly, 12 bytes/entry, range queries
# Cons: O(log n) lookup

Benchmark with 10M records:

Hash Map:
  - Memory: 268MB
  - Lookup: 41ns average
  - Construction: 1.8s

NumPy Array:
  - Memory: 114MB (2.3x less)
  - Lookup: 124ns average (3x slower)
  - Construction: 0.3s (6x faster)
  - Binary search benefits from CPU cache

For most use cases, the sorted array wins due to memory efficiency and cache locality.

Memory Overhead Calculations #

Given a JSONL file with N records averaging S bytes each:

def calculate_index_overhead(num_records: int, avg_record_size: int) -> dict:
    """Calculate memory overhead for different index strategies."""

    file_size = num_records * avg_record_size

    # Full index: every record
    full_index_size = num_records * 8  # 8 bytes per offset

    # Sampled index: every Kth record
    sample_rate = max(1, avg_record_size // 1024)  # Sample if records > 1KB
    sampled_index_size = (num_records // sample_rate) * 8

    # Hierarchical: two-level index
    block_size = 65536  # 64KB blocks
    num_blocks = file_size // block_size
    l1_size = num_blocks * 8  # Block offsets
    l2_size = num_blocks * 16  # Min/max record IDs per block
    hierarchical_size = l1_size + l2_size

    return {
        'full_index_mb': full_index_size / 1024 / 1024,
        'overhead_pct': (full_index_size / file_size) * 100,
        'sampled_mb': sampled_index_size / 1024 / 1024,
        'hierarchical_mb': hierarchical_size / 1024 / 1024
    }

# Example: 10GB file, 10M records, ~1KB each
# Output: {'full_index_mb': 76.3, 'overhead_pct': 0.78,
#          'sampled_mb': 76.3, 'hierarchical_mb': 2.4}

Index Granularity Trade-offs #

The granularity of the index determines the trade-off between memory usage and access speed:

Fine-grained (every record):

# 8 bytes per record overhead
# Direct O(1) access to any record
# Best for: Random access patterns, small files
offsets = np.fromfile('full_index.bin', dtype='u8')
record_offset = offsets[record_id]  # Direct lookup

Coarse-grained (every Nth record):

# 8 bytes per N records
# Requires scanning up to N records
# Best for: Sequential access, huge files
SAMPLE_RATE = 1000
checkpoint_id = record_id // SAMPLE_RATE
checkpoint_offset = checkpoints[checkpoint_id]

# Scan forward to exact record
with open('data.jsonl', 'rb') as f:
    f.seek(checkpoint_offset)
    for _ in range(record_id % SAMPLE_RATE):
        f.readline()
    record = f.readline()

Adaptive granularity:

def build_adaptive_index(filepath: Path) -> np.ndarray:
    """
    Build index with variable sampling based on record size variance.
    Dense sampling for variable-size records, sparse for uniform.
    """
    sizes = []
    offsets = []
    sample_interval = 1

    with open(filepath, 'rb') as f:
        while line := f.readline():
            sizes.append(len(line))

            # Adjust sampling based on size variance
            if len(sizes) >= 100:
                cv = np.std(sizes[-100:]) / np.mean(sizes[-100:])
                sample_interval = 1 if cv > 0.5 else min(1000, sample_interval * 2)

            if len(offsets) % sample_interval == 0:
                offsets.append(f.tell() - len(line))

    return np.array(offsets, dtype='u8')

Performance comparison:

Index Type	Memory (10M records)	Lookup Time	Scan Distance
Full	76MB	124ns	0 records
Every 100th	0.76MB	124ns + 18μs	≤99 records
Every 1000th	0.076MB	124ns + 180μs	≤999 records
Adaptive	2-20MB	124ns + variable	0-999 records

The optimal choice depends on access patterns:

Random uniform: Full index
Random clustered: Adaptive index
Mostly sequential: Coarse index
Mixed: Hierarchical index with hot path cache

This theoretical foundation establishes why byte offsets provide significant performance gains for JSONL processing. The key insights:

Random I/O is 60-400x slower than sequential
JSONL’s newline boundaries enable perfect parallelization
Index memory overhead is <1% of file size with proper design

The next section will implement these concepts with production-ready code and benchmarks.

2.2 JSONL Format Advantages #

Newline delimited structure
No parsing of entire file needed
Natural parallelization boundaries

2.3 Offset Index Design #

B-tree vs hash map for offset storage
Memory overhead calculations
Index granularity trade-offs

3. Basic Implementation: Offset Tracking #

3.1 Building the Offset Index #

# Code: Efficient offset scanner using buffered reads
# Discussion of buffer size optimization

3.2 Index Serialization #

Binary format design for offset files
Versioning and compatibility
Compression considerations (zstd vs lz4)

3.3 Random Access Implementation #

# Code: Seek-based record retrieval
# Handling partial reads and buffer boundaries

Benchmark Set 1: Sequential vs Random Access #

Dataset: 10GB, 100GB, 1TB JSONL files
Metrics: Throughput, latency percentiles, memory usage
Comparison with json.loads() line-by-line

4. Memory-Mapped File Optimization #

4.1 mmap Fundamentals #

Virtual memory and page faults
MAP_PRIVATE vs MAP_SHARED implications
Platform-specific considerations (Linux vs macOS)

4.2 mmap-based Implementation #

# Code: mmap with offset index
# Zero-copy string extraction techniques

4.3 Memory Pressure Handling #

madvise() strategies
Working set size management
OOM killer considerations

Benchmark Set 2: Traditional I/O vs mmap #

Cold cache vs warm cache performance
Memory usage patterns
Impact of file size on performance

5. Parallel Processing Architecture #

5.1 Chunking Strategy #

Optimal chunk size determination
Offset-based work distribution
Load balancing considerations

5.2 Multiprocessing Implementation #

# Code: Process pool with offset chunks
# Shared memory for index data
# Queue management and back-pressure

5.3 Threading vs Multiprocessing #

GIL implications for I/O bound work
Memory sharing trade-offs
Context switching overhead

Benchmark Set 3: Scaling Analysis #

Linear scalability testing (1-64 cores)
Amdahl’s Law in practice
Network storage (NFS/S3) impact

6. Advanced Optimizations #

6.1 Adaptive Index Granularity #

Dynamic sampling strategies
Bloom filters for existence checks
Hierarchical indexing for huge files

6.2 Compression-Aware Processing #

Reading gzipped JSONL efficiently
Block-level compression with random access
Index modification for compressed files

6.3 Incremental Processing #

Change detection using inotify/FSEvents
Delta index updates
Checkpointing and recovery

Benchmark Set 4: Advanced Features #

Compressed vs uncompressed performance
Incremental update overhead
Recovery time objectives

7. Production Implementation #

7.1 Error Handling #

Corrupt record recovery
Partial write detection
Index consistency validation

7.2 Monitoring and Observability #

# Code: Metrics collection integration
# Tracing with OpenTelemetry
# Performance regression detection

7.3 Cache Management #

LRU cache for hot records
Distributed caching with Redis
Cache invalidation strategies

8. Case Study: Data Pipeline Optimization #

8.1 Before: Naive Implementation #

Architecture diagram
Bottleneck analysis
Baseline metrics

8.2 After: Offset-Based Solution #

Refactored architecture
Implementation challenges
Performance improvements

8.3 Lessons Learned #

Unexpected bottlenecks
Tuning parameters that mattered
Maintenance considerations

9. Comparative Analysis #

9.1 Alternative Approaches #

Pandas read_json(lines=True, chunksize=N)
Dask DataFrame partitioning
Apache Arrow memory format
DuckDB’s JSON reader

9.2 Decision Matrix #

When to use byte offsets
When to use alternatives
Hybrid approaches

10. Performance Results Summary #

10.1 Comprehensive Benchmark Suite #

Test methodology
Hardware specifications
Dataset characteristics

10.2 Results Visualization #

Throughput vs file size graphs
Latency distribution plots
Memory usage over time
CPU utilization patterns

10.3 Statistical Analysis #

Confidence intervals
Variance analysis
Performance predictability

11. Future Directions #

GPU-accelerated JSON parsing integration
SIMD optimizations for offset scanning
Distributed index management
Integration with Apache Arrow

Appendix A: Complete Implementation #

Full production-ready code
PyPI package structure
CI/CD configuration

Appendix B: Benchmark Reproduction #

Dataset generation scripts
Benchmark harness code
Environment setup instructions

1. Introduction #

2. Theoretical Foundation #

2.1 File I/O Performance Characteristics #

System Calls and Kernel Buffers #

Page Cache Behavior #

Cost Analysis #

2.2 JSONL Format Advantages #

Newline Delimited Structure #

No Parsing Required for Navigation #

Natural Parallelization Boundaries #

2.3 Offset Index Design #

B-tree vs Hash Map for Offset Storage #

Memory Overhead Calculations #

Index Granularity Trade-offs #

2.2 JSONL Format Advantages #

2.3 Offset Index Design #

3. Basic Implementation: Offset Tracking #

3.1 Building the Offset Index #

3.2 Index Serialization #

3.3 Random Access Implementation #

Benchmark Set 1: Sequential vs Random Access #

4. Memory-Mapped File Optimization #

4.1 mmap Fundamentals #

4.2 mmap-based Implementation #

4.3 Memory Pressure Handling #

Benchmark Set 2: Traditional I/O vs mmap #

5. Parallel Processing Architecture #

5.1 Chunking Strategy #

5.2 Multiprocessing Implementation #

5.3 Threading vs Multiprocessing #

Benchmark Set 3: Scaling Analysis #

6. Advanced Optimizations #

6.1 Adaptive Index Granularity #

6.2 Compression-Aware Processing #

6.3 Incremental Processing #

Benchmark Set 4: Advanced Features #

7. Production Implementation #

7.1 Error Handling #

7.2 Monitoring and Observability #

7.3 Cache Management #

8. Case Study: Data Pipeline Optimization #

8.1 Before: Naive Implementation #

8.2 After: Offset-Based Solution #

8.3 Lessons Learned #

9. Comparative Analysis #

9.1 Alternative Approaches #

9.2 Decision Matrix #

10. Performance Results Summary #

10.1 Comprehensive Benchmark Suite #

10.2 Results Visualization #

10.3 Statistical Analysis #

11. Future Directions #

Appendix A: Complete Implementation #

Appendix B: Benchmark Reproduction #