Julia: Flajolet-Martin Algorithm
21 Dec 2019I choose to pick up Julia because of its high performance. The problem for Python is that it works nicely 95 percent of the time, when all the computationally expensive operations can be replaced by function calls to C libraries. However, it fails completely when the rest 5 percent of nasty situations show up. Recently, such an annoying instance just showed up for me. I am trying to migrate my code to Julia to see if there is a performance boost.
The count distinct problem
When counting the number of distinct elements in a stream, what we usually do is to keep a set in the memory and union the set with the new incoming element. This approach requires storing all unique elements in the main memory, which can be infeasible when the size of the stream is huge. The Flajolet–Martin algorithm proposes a way of estimating the number of unique elements in the stream without the need to store all elements. It’s basic idea is as follows: we generate the hash value of each record and count the number of tailing zeros in the hashed bits. Suppose the greatest tailing zero is \(R\), then the number of unique elements is \(2^R\).
Using pointer to array
When I first start writing the Julia version of the algorithm, I found that there was no significant performance gain. After some investigation, I realize that it’s because of array slicing. Whenever a new array slice is created, the memory will be copied and new object will be allocated. This process can be extremely time consuming. Therefore, we can use the pointer to the array to save some time. To get the pointer to array elements, we can write:
array = zeros(UInt64, 100)
# convert it to a byte pointer and use pointer to access slices
cPtr = convert(Ptr{UInt8}, pointer_from_objref(array))
To access the data, use unsafe_load
.
Set the number of threads
The number of threads used by Julia is decided at startup. To change the value:
-
In Juno:
Open Juno settings and there is a field to set this parameter.
-
In Jupyter Notebook:
In console, run
jupyter kernelspec list
, and the list of kernel configurations will be shown. Open thekernel.json
in Julia’s folder, and then add"JULIA_NUM_THREADS":"6"
in theenv
key. Then, restart the jupyter server.
Code
Generating the testing dataset
The following code is used to generate the dataset.
import numpy as np
numEntries = 10000000
entrySize = 32
array = np.random.randint(0, 255, size=numEntries * entrySize, dtype=np.uint8)
with open('random_data.bin', 'wb') as outfile:
outfile.write(array.tobytes())
Julia Notebook
The test shows that we can run the dataset of 10 million samples in around 30 seconds with 6 threads. Compared to my C++ implementation, I would say that this Julia version is roughly 6-8 times slower. But it is much faster to write Julia than C++, even for a Julia beginner like me. I am still digging into the usage of Julia to make sure that I am using all the correct conventions in order to make my code faster.