Since Julia is really fast, I was wondering what the fastest way is to join data frames. For example, in R we may use the data.table package which is pretty fast. Sometimes when working with big datasets the computation time becomes high. Here I created a benchmark using innerjoin and leftjoin:
julia> using StatsBase, DataFrames, BenchmarkTools
julia> n = 1000000
julia> df1 = DataFrame(x = 1:n,
y1 = rand(n))
julia> df2 = DataFrame(x = 1:n,
y2 = rand(n))
julia> #benchmark innerjoin(df1, df2, on = :x)
BenchmarkTools.Trial: 102 samples with 1 evaluation.
Range (min … max): 41.437 ms … 73.495 ms ┊ GC (min … max): 0.00% … 29.87%
Time (median): 45.926 ms ┊ GC (median): 0.00%
Time (mean ± σ): 49.160 ms ± 8.227 ms ┊ GC (mean ± σ): 7.26% ± 11.50%
▄▅ █ ▂ ▂
██▆██▅████▅▃▃▅█▆▁▅▁▁▁▁▃▃▁▁▁▁▆▁▃▅▁▁▁▃▁▅▅▅▅▃▁▁▃▁▃▁▃▁▁▁▃▃▃▁▁▁▃ ▃
41.4 ms Histogram: frequency by time 71.7 ms <
Memory estimate: 38.16 MiB, allocs estimate: 174.
julia> #btime innerjoin(df1, df2, on = :x)
41.592 ms (174 allocations: 38.16 MiB)
julia> #benchmark leftjoin(df1, df2, on = :x)
BenchmarkTools.Trial: 96 samples with 1 evaluation.
Range (min … max): 43.823 ms … 79.582 ms ┊ GC (min … max): 0.00% … 34.30%
Time (median): 48.566 ms ┊ GC (median): 0.00%
Time (mean ± σ): 52.387 ms ± 9.026 ms ┊ GC (mean ± σ): 6.74% ± 10.90%
█▂▁▅▄▅
██████▅▃▆▆▃▃▅▅▆▆▃▁█▃▃▁▁▃▁▃▃▁▁▁▁▃▃▁▅▃▁▁█▃▃▃▃▅▃▁▁▅▁▁▁▁▁▁▃▁▁▁▃ ▁
43.8 ms Histogram: frequency by time 76.9 ms <
Memory estimate: 39.23 MiB, allocs estimate: 230.
julia> #btime leftjoin(df1, df2, on = :x)
44.198 ms (230 allocations: 39.23 MiB)
Here we can see that innerjoin is in this case slightly faster. So, I was wondering if there are faster ways of joining data frames in Julia?
If you know that the values in :x column in both data frames have the same sequence of values (which is happening in your case), then you can use hcat. I get a better result on hcat (~24,500x faster 👀 on average without copying the data and ~7x faster with copying) (note that you should prepend $ to df1 and df2):
julia> #benchmark hcat($df1, $df2[!, 2:end], copycols=false)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.550 μs … 1.705 ms ┊ GC (min … max): 0.00% … 99.45%
Time (median): 2.440 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.670 μs ± 17.047 μs ┊ GC (mean ± σ): 6.35% ± 0.99%
▂█▇▆▁ ▁▂▂▃▄▁
█████▅▄▃▄▆████████▇▆▅▅▄▃▃▃▃▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
1.55 μs Histogram: frequency by time 5.72 μs <
Memory estimate: 2.66 KiB, allocs estimate: 34.
# with copying
julia> #benchmark hcat($df1, $df2[!, 2:end])
BenchmarkTools.Trial: 633 samples with 1 evaluation.
Range (min … max): 4.387 ms … 50.850 ms ┊ GC (min … max): 0.00% … 86.24%
Time (median): 6.380 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.876 ms ± 6.799 ms ┊ GC (mean ± σ): 17.48% ± 16.80%
▄▅██▅▃▂
███████▆▄▅▅▄▄▆▅▁▁▁▁▅▄▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▄▄▄▅▆▄▅▅▄▅ ▇
4.39 ms Histogram: log(frequency) by time 42.1 ms <
Memory estimate: 22.89 MiB, allocs estimate: 64.
julia> hcat(df1, df2[!, 2:end]) == hcat(df1, df2[!, 2:end], copycols=false) == innerjoin(df1, df2, on = :x) == leftjoin(df1, df2, on = :x)
true
# Element-wise comparison (However, the above expression is enough)
julia> all(Matrix(hcat(df1, df2[!, 2:end]) .== hcat(df1, df2[!, 2:end], copycols=false) .== innerjoin(df1, df2, on = :x) .== leftjoin(df1, df2, on = :x)))
true
If I want to make a summary:
Time
Memory
hcat
~24,500x faster
~19,000x fewer
hcat(with copy)
~7x faster
~1.7x fewer
*Note that the comparisons are against your best result, which is related to innerjoin(df1, df2, on = :x)
Additional Note
Also, note that #benchmark contains comprehensive results, and it contains #btime as well:
So you don't need to necessarily run #btime to get it!
Related
I'm trying to understand the performance differences I am seeing by using various numba implementations of an algorithm. In particular, I would expect func1d from below to be the fastest implementation since it it the only algorithm that is not copying data, however from my timings func1b appears to be fastest.
import numpy
import numba
def func1a(data, a, b, c):
# pure numpy
return a * (1 + numpy.tanh((data / b) - c))
#numba.njit(fastmath=True)
def func1b(data, a, b, c):
new_data = a * (1 + numpy.tanh((data / b) - c))
return new_data
#numba.njit(fastmath=True)
def func1c(data, a, b, c):
new_data = numpy.empty(data.shape)
for i in range(new_data.shape[0]):
for j in range(new_data.shape[1]):
new_data[i, j] = a * (1 + numpy.tanh((data[i, j] / b) - c))
return new_data
#numba.njit(fastmath=True)
def func1d(data, a, b, c):
for i in range(data.shape[0]):
for j in range(data.shape[1]):
data[i, j] = a * (1 + numpy.tanh((data[i, j] / b) - c))
return data
Helper functions for testing memory copying
def get_data_base(arr):
"""For a given NumPy array, find the base array
that owns the actual data.
https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/
"""
base = arr
while isinstance(base.base, numpy.ndarray):
base = base.base
return base
def arrays_share_data(x, y):
return get_data_base(x) is get_data_base(y)
def test_share(func):
data = data = numpy.random.randn(100, 3)
print(arrays_share_data(data, func(data, 0.5, 2.5, 2.5)))
Timings
# force compiling
data = numpy.random.randn(10_000, 300)
_ = func1a(data, 0.5, 2.5, 2.5)
_ = func1b(data, 0.5, 2.5, 2.5)
_ = func1c(data, 0.5, 2.5, 2.5)
_ = func1d(data, 0.5, 2.5, 2.5)
data = numpy.random.randn(10_000, 300)
%timeit func1a(data, 0.5, 2.5, 2.5)
%timeit func1b(data, 0.5, 2.5, 2.5)
%timeit func1c(data, 0.5, 2.5, 2.5)
%timeit func1d(data, 0.5, 2.5, 2.5)
67.2 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
69.8 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.8 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Test which implementations copy memory
test_share(func1a)
test_share(func1b)
test_share(func1c)
test_share(func1d)
False
False
False
True
Here, copying of data doesn't play a big role: the bottle neck is fast how the tanh-function is evaluated. There are many algorithms: some of them are faster some of them are slower, some are more precise some less.
Different numpy-distributions use different implementations of tanh-function, e.g. it could be one from mkl/vml or the one from the gnu-math-library.
Depending on numba version, also either the mkl/svml impelementation is used or gnu-math-library.
The easiest way to look inside is to use a profiler, for example perf.
For the numpy-version on my machine I get:
>>> perf record python run.py
>>> perf report
Overhead Command Shared Object Symbol
46,73% python libm-2.23.so [.] __expm1
24,24% python libm-2.23.so [.] __tanh
4,89% python _multiarray_umath.cpython-37m-x86_64-linux-gnu.so [.] sse2_binary_scalar2_divide_DOUBLE
3,59% python [unknown] [k] 0xffffffff8140290c
As one can see, numpy uses the slow gnu-math-library (libm) functionality.
For the numba-function I get:
53,98% python libsvml.so [.] __svml_tanh4_e9
3,60% python [unknown] [k] 0xffffffff81831c57
2,79% python python3.7 [.] _PyEval_EvalFrameDefault
which means that fast mkl/svml functionality is used.
That is (almost) all there is to it.
As #user2640045 has rightly pointed out, the numpy performance will be hurt by additional cache misses due to creation of temporary arrays.
However, cache misses don't play such a big role as the calculation of tanh:
%timeit func1a(data, 0.5, 2.5, 2.5) # 91.5 ms ± 2.88 ms per loop
%timeit numpy.tanh(data) # 76.1 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
i.e. creation of temporary objects is responsible for around 20% of the running time.
FWIW, also for version with the handwritten loops, my numba version (0.50.1) is able to vectorize and call mkl/svml functionality. If for some other version this not happens - numba will fall back to gnu-math-library functionality, what seems to be happening on your machine.
Listing of run.py:
import numpy
# TODO: define func1b for checking numba
def func1a(data, a, b, c):
# pure numpy
return a * (1 + numpy.tanh((data / b) - c))
data = numpy.random.randn(10_000, 300)
for _ in range(100):
func1a(data, 0.5, 2.5, 2.5)
The performance difference is NOT in the evaluation of the tanh-function
I must disagree with #ead. Let's assume for the moment that
the main performance difference is in the evaluation of the tanh-function
Then one would expect that running just tanh from numpy and numba with fast math would show that speed difference.
def func_a(data):
return np.tanh(data)
#nb.njit(fastmath=True)
def func_b(data):
new_data = np.tanh(data)
return new_data
data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)
Yet on my machine the above code shows almost no difference in performance.
15.7 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.8 ms ± 82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Short detour on NumExpr
I tried a NumExpr version of your code. But before being amazed that it runts almost 7 times faster you should keep in mind that it uses all 10 cores available on my machine. After allowing numba to run in parallel too and optimising that a little bit the performance benefit is small but sill there 2.56 ms vs 3.87 ms. See code below.
#nb.njit(fastmath=True)
def func_a(data):
new_data = a * (1 + np.tanh((data / b) - c))
return new_data
#nb.njit(fastmath=True, parallel=True)
def func_b(data):
new_data = a * (1 + np.tanh((data / b) - c))
return new_data
#nb.njit(fastmath=True, parallel=True)
def func_c(data):
for i in nb.prange(data.shape[0]):
for j in range(data.shape[1]):
data[i, j] = a * (1 + np.tanh((data[i, j] / b) - c))
return data
def func_d(data):
return ne.evaluate('a * (1 + tanh((data / b) - c))')
data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)
%timeit func_c(data)
%timeit func_d(data)
17.4 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.31 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.87 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.56 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The actual explanation
The ~34% time that NumExpr saves compared to numba are nice but even nicer is that they have a concise explanation why they are faster than numpy. I am pretty sure that this applies to numba too.
From the NumExpr github page:
The main reason why NumExpr achieves better performance than NumPy is
that it avoids allocating memory for intermediate results. This
results in better cache utilization and reduces memory access in
general.
So
a * (1 + numpy.tanh((data / b) - c))
is slower because it does a lot of steps producing intermediate results.
I'm trying to understand the complexity of numpy array indexing here.
Given a 1-d numpy array A. and b = numpy.argsort(A)
what's the difference in time compleixty between np.sort(A) vs A[b] ?
for np.sort(A), it would be O(n log (n)), while A[b] should be O(n) ?
Under the hood argsort does a sort, which again gives complexity O(n log(n)).
You can actually specify the algorithm as described here
To conclude, while only A[b] is linear you cannot use this to beat the general complexity of sorting, as you yet have to determine b (by sorting).
Do a simple timing:
In [233]: x = np.random.random(100000)
In [234]: timeit np.sort(x)
6.79 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [235]: timeit x[np.argsort(x)]
8.42 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [236]: %%timeit b = np.argsort(x)
...: x[b]
...:
235 µs ± 694 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [237]: timeit np.argsort(x)
8.08 ms ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Timing only one size doesn't give O complexity, but it reveals the relative significance of the different steps.
If you don't need the argsort, then sort directly. If you already have b use it rather than sorting again.
Here is a visual comparison to see it better:
#sort
def m1(A,b):
return np.sort(A)
#compute argsort and them index
def m2(A,b):
return A[np.argsort(A)]
#index with precomputed argsort
def m3(A,b):
return A[b]
A = [np.random.rand(n) for n in [10,100,1000,10000]]
Runtime on a log-log scale:
I am trying to simply implement a loss function (MSE) in Python using numpy and this is my code:
import numpy as np
def loss(X, y, w):
N = (X.shape)[0]
X_new = np.concatenate((np.ones((N, 1)), X), axis=1)
E = y-np.matmul(X_new, w)
E_t = np.transpose(E)
loss_value = (1/N)*(np.matmul(E_t, E))
return loss_value
The dimension of E is (15000, 1) and E_t is obviously (1,15000). However, when debugging, I realized that np.matmul(E_t,E) takes too much time. I have a laptop with 16GB of RAM and Core i7, so it's weird for me that np.matmul is failing here. Is this normal if the matrices I am dealing with have these dimensions?
On a rather basic 4GB machine:
In [477]: E=np.ones((15000, 1))
In [478]: E.T#E
Out[478]: array([[15000.]])
In [479]: timeit E.T#E
10.5 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You don't tell us anything about X, but assuming the worse case:
In [480]: E=np.ones((15000, 1),object)
In [481]: E.T#E
Out[481]: array([[15000]], dtype=object)
In [482]: timeit E.T#E
577 µs ± 492 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am trying to do random sampling in the most efficient way in Python, however, I am puzzled because when using the numpy's random.choices() was slower than using the random.choices()
import numpy as np
import random
np.random.seed(12345)
# use gamma distribution
shape, scale = 2.0, 2.0
s = np.random.gamma(shape, scale, 1000000)
meansample = []
samplesize = 500
%timeit meansample = [ np.mean( np.random.choice( s, samplesize, replace=False)) for _ in range(500)]
23.3 s ± 229 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit meansample = [np.mean(random.choices(s, k=samplesize)) for x in range(0,500)]
152 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23 Seconds vs 152 ms is a lot of time
What i'am doing wrong?
Two issues here. First, for the pure-Python random library, you probably mean to use sample instead of choices to sample without replacement. That alters the benchmark somewhat. Second, np.random.choice has better performing alternatives for sampling without replacement. This is a known issue related to random generator API. You can use np.random.Generator to get better performance. My timings:
%timeit meansample = [ np.mean( np.random.choice( s, samplesize, replace=False)) for _ in range(500)]
# 1 loop, best of 3: 12.4 s per loop
%timeit meansample = [np.mean(random.choices(s, k=samplesize)) for x in range(0,500)]
# 10 loops, best of 3: 118 ms per loop
sl = s.tolist()
%timeit meansample = [np.mean(random.sample(sl, k=samplesize)) for x in range(0,500)]
# 1 loop, best of 3: 219 ms per loop
g = np.random.Generator(np.random.PCG64())
%timeit meansample = [ np.mean( g.choice( s, samplesize, replace=False)) for _ in range(500)]
# 10 loops, best of 3: 25 ms per loop
So, without replacement, random.sample outperforms np.random.choice but is slower than np.random.Generator.choice.
I'm profiling the timing of one od my functions and I see that I spent alot of time on pandas DataFrame creation - I'm talking about 2.5 seconds to construct a dataFrame with 1000 columns and 10k rows:
def test(size):
samples = []
for r in range(10000):
a,b = np.random.randint(100, size=2)
data = np.random.beta(a,b ,size = size)
samples.append(data)
return DataFrame(samples, dtype = np.float64)
Running %prun -l 4 test(1000) returns:
Is there anyway I can avoid this check? This really not seems Tried to find out about this method and ways to bypass here but didnt find anything online.
pandas must introspect each row because you are passing it a list of arrays. Here are some more efficient methods in this case.
In [27]: size=1000
In [28]: samples = []
...: for r in range(10000):
...: data = np.random.beta(1,1 ,size = size)
...: samples.append(data)
...:
In [29]: np.asarray(samples).shape
Out[29]: (10000, 1000)
# original
In [30]: %timeit DataFrame(samples)
2.29 s ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy is less flexible on the conversion, but in this case
# it is fine
In [31]: %timeit DataFrame(np.asarray(samples))
30.9 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# you should probably just do this
In [32]: samples = np.random.beta(1,1, size=(10000, 1000))
In [33]: %timeit DataFrame(samples)
74.4 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)