1BRC in Python

This is my attempt at the 1 Billion Rows Challenge in Python.

Inspiration and Attribution

Some ideas for this came from a few sources, listed below:

Doug Mercer video
Discussion from the official repo

While I could simply follow these and have a working, fast implementation, that would not help me learn. So, while I am taking ideas from these sources, I am not following them exactly. However, I expect that my implementation will be fairly similar by the time I am done.

Measurement

I’ll use hyperfine to benchmark my script, running three tests with one warm-up run and the time is an average of 3 runs.

Data Generation

I am using a create_measurements.py file that I built myself since I do not want to install a Java environment on my system. You can find it in the repo, and I even performed a bit of speed analysis on it.

Goal / Target

I have created a baseline to work from. It is quite poor, averaging around 25 minutes on my machine. I ran Doug Mercer’s “doug booty v4” version on my machine, and it completed in around 2 minutes. My goal is to get to a similar range.

Some things to note:

My machine is significantly slower than others in the competition.
I do not have enough RAM to start a RAM disk like many other competitors.

These two factors contribute to my script running much slower than the time Doug gets in his video.

Diagnostic Tools

Some tools I used:

Multiple sizes of measurement files
hyperfine
time (GNU tool)
viztracer (python profiler)

Optimization Ideas

Some ideas for optimizing this code:

Parallel Processing
String parsing improvements
Using PyPy
Faster file reading
Using a flamegraph to identify bottlenecks

PyPy Performance

PyPy3 has shown to improve performance in many projects, so I’ll run most of my tests using PyPy moving forward. Below are the differences in baseline time using PyPy3 versus CPython3.

Command	Mean [s]	Min [s]	Max [s]	Relative
`pypy3 ./calculate_average_baseline.py ./measurements/1B_Measurements.csv`	979.168 ± 3.948	976.766	983.724	1.00
`python3 ./calculate_average_baseline.py ./measurements/1B_Measurements.csv`	1733.957 ± 11.034	1721.578	1742.757	1.77 ± 0.01

File Reading

I initially tested different file read modes by creating three scripts that only read the file line by line without any processing.

Command	Mean [s]	Min [s]	Max [s]	Relative
`pypy3 ./ca_r.py ./measurements/100M_Measurements.csv`	16.624 ± 0.565	15.996	17.091	1.88 ± 0.06
`pypy3 ./ca_rb.py ./measurements/100M_Measurements.csv`	8.841 ± 0.024	8.817	8.864	1.00
`pypy3 ./ca_r+b.py ./measurements/100M_Measurements.csv`	15.350 ± 0.422	14.982	15.810	1.74 ± 0.05

Binary mode (rb) had a significant speedup, but the extra work needed to parse binary data may negate this improvement. I tested this with full script modifications.

Command	Mean [s]	Min [s]	Max [s]	Relative
`pypy3 ./calculate_average_r.py ./measurements/100M_Measurements.csv`	100.081 ± 0.838	99.287	100.958	1.46 ± 0.01
`pypy3 ./calculate_average_rb.py ./measurements/100M_Measurements.csv`	68.539 ± 0.283	68.349	68.864	1.00

For 1 billion rows:

Command	Mean [s]	Min [s]	Max [s]	Relative
`pypy3 ./calculate_average_rb.py ./measurements/1B_Measurements.csv`	666.263 ± 5.518	660.911	671.934	1.00

Parallel Processing

Parallel processing is expected to be the biggest improvement. I created a dispatcher function that dynamically chooses the number of cores to use unless specified.

Command	Mean [s]	Min [s]	Max [s]	Relative
`pypy3 ./calculate_average_mt.py ./measurements/1B_Measurements.csv`	246.554 ± 0.181	246.352	246.700	1.00

Using Memory Mapping (mmap)

Many competitors use mmap for file reading. My concern was that I only have 16GB of RAM, while the file size is 14GB. However, my system did not crash, and performance improved.

Command	Mean [s]	Min [s]	Max [s]	Relative
`pypy3 ./calculate_average_mmap.py ./measurements/1B_Measurements.csv`	189.611 ± 0.293	189.359	189.932	1.00

String Parsing and Using Integers Instead of Floats

String parsing is expensive. Since our data is uniform, I replaced split() with direct indexing using find(). Additionally, since all values have one decimal place, I converted them to integers instead of floats.

The code is less readable but much faster.

Command	Mean [s]	Min [s]	Max [s]	Relative
`pypy3 ./calculate_average_int.py ./measurements/1B_Measurements.csv`	144.804 ± 0.524	144.327	145.365	1.00

Final Results

With all optimizations, I reduced execution time from 1733 seconds to 144 seconds. This meets my goal and represents a massive improvement! 🚀

One Billion Row Challenge in Python