Cannot make sense of timing of numba-compiled functions - numpy

I'm running some simulations, where I use numba to compile my python code to speed up the simulations. I wrote a function that will overwrite one of the input arrays, and therefore I would like to pass in a copy of that array instead. However, this makes the code much slower, and far slower than the time it takes to make the copy.
Here are the timing results:
> population_ = population.copy()
> %timeit _ = run_simulation(population_, Tmax, dt, Nskip = Nskip)
64.6 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> %timeit _ = run_simulation(population.copy(), Tmax, dt, Nskip = Nskip)
87.4 ms ± 778 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> %timeit _ = population.copy()
442 ns ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So calling run_simulation directly with the result of .copy() as an argument is about 23 milliseconds slower, despite the fact that making the copy only takes about 0.0004 milliseconds. I don't understand why this is the case.
For background, here is the full code:
import numpy as np
from numba import jit, int32, int64, float64
#jit('int32[:,:,:](int32[:,:,:], float64)', nopython=True)
def one_step(population, dt):
# Hard-coding model parameters here
beta = 0.55
tau = 10
# This probabilty doesn't depend on the other states
pIR = 1 - np.exp(-dt/tau)
# Double for loop over towns and towns
for i in range(population.shape[0]):
I = np.sum(population[i,1,:])
N = np.sum(population[i,:,:])
# Transition probability from susceptible to infected
pSI = 1 - np.exp(-dt*beta*I/N)
for j in range(population.shape[1]):
# Unpack variables for convenience
S, I, R = population[i,j,:]
S2I = np.random.binomial(S, pSI)
I2R = np.random.binomial(I, pIR)
# Calculate new values
S = S - S2I
I = I + S2I - I2R
R = R + I2R
population[i,j,:] = (S, I, R)
return population
#jit('int32[:,:,:](int32[:,:,:], float64, float64, int64)', nopython=True)
def run_simulation(population, Tmax, dt, Nskip = 10):
Nt = int(Tmax/dt)
history = np.zeros((population.shape[0], 3, int((Tmax/dt)/Nskip) + 1), dtype = np.int32)
history[:,:,0] = np.sum(population, axis = 1)
t = 0
for i in range(1, Nt+1):
population = one_step(population, dt)
t += dt
if i % Nskip == 0:
history[:,:,int(i/Nskip)] = np.sum(population, axis = 1)
return history
# Initial state
population = np.random.randint(low = 0, high = 1000, size = (10,10,3), dtype = np.int32)
# Run simulation for 100 days
Tmax = 100
dt = 0.01
# Only store once per day
Nskip = int(1/dt)
# Call one timestep to compile numba-decorated functions
# prior to measuring timing
_ = run_simulation(population, 1.0, 1.0, Nskip = 1)
# Run timing
population_ = population.copy()
%timeit _ = run_simulation(population_, Tmax, dt, Nskip = Nskip)
# Run timing
%timeit _ = run_simulation(population.copy(), Tmax, dt, Nskip = Nskip)
# Run timing
%timeit _ = population.copy()

What you are referring to isn't really related to numba.
Consider the following simple example:
import numpy as np
def run_simulation_2(population):
return population.sum(axis=0)
# Initial state
population = np.random.randint(low = 0, high = 1000, size = (10,10,3), dtype = np.int32)
# Run timing
population_ = population.copy()
%timeit _ = run_simulation_2(population_)
# Run timing
%timeit _ = run_simulation_2(population.copy())
# Run timing
%timeit _ = population.copy()
Timing results are:
3.45 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.34 µs ± 91.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
680 ns ± 23.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So you then is an overhead of about 25% even without numba which is about the same overhead you saw yourself.
Therefore I think it is not related to numba, but to different "behind the scenes" stuff that happens when you send aa variable as an argument vs result of a function.
Unfortunately I can't offer you a good explanation why it happens, but I hope that the fact that it's not related to numba is a well enough explanation for your needs.

Related

how to optimize these pandas apply functions?

train["gender"] = train.apply(lambda x: 1 if x["gender"] == "F" else 0, axis=1)
train["car"] = train.apply(lambda x: 1 if x["car"] == "Y" else 0, axis=1)
train["reality"] = train.apply(lambda x: 1 if x["reality"] == "Y" else 0, axis=1)
these 3 codes require many time even it is simple change.
I guess, accessing each row 3 times makes inefficeny.
So, if I can make 1 access to row and apply function change 3 data, it can be faster 2~3 times than now.
like.....
# it is my imaginary code. not works
train[["gender","car", "reality"]] = train.apply(lambda x: 1 if x["gender"] == "F" else 0, axis=1,
lambda y: 1 if y["car"] == "Y" else 0, axis=1,
lambda z: 1 if z["reality"] == "Y" else 0, axis=1)
How can optimize these codes?
===========================
test result for tdy
You can try 3x np.where() which is generally the fastest option:
train['gender'] = np.where(train.gender == 'F', 1, 0)
train['car'] = np.where(train.car == 'Y', 1, 0)
train['reality'] = np.where(train.reality == 'Y', 1, 0)
Or 2x np.where() which is slightly slower:
train['gender'] = np.where(train.gender == 'F', 1, 0)
train[['car', 'reality']] = np.where(train[['car', 'reality']] == 'Y', 1, 0)
Timings with 10 million rows:
method
%timeit
3x np.where()
152 ms ± 8.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2x np.where()
198 ms ± 39.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3x apply()
8.91 s ± 495 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas - what is the most efficient way to filter DataFrame using pandas.Series.all

Consider the code below -
import pandas as pd
data = []
val = 0
for ind_1 in range(1000):
for ind_2 in range(1000):
data.append({'ind_1': ind_1, 'ind_2': ind_2,
'val': val})
val += 1
df_mi = pd.DataFrame(data).set_index(['ind_1', 'ind_2'])
which creates the DataFrame df_mi with MultiIndex-
In [90]: df_mi
Out[90]:
val
ind_1 ind_2
0 0 0
1 1
2 2
3 3
4 4
... ...
999 995 999995
996 999996
997 999997
998 999998
999 999999
[1000000 rows x 1 columns]
Now I want to filter the rows by applying some condition on all values for each ind_1 -
In [116]: bool_filter_ind_1 = (df_mi['val'] < 999997).all(level='ind_1')
In [117]: bool_filter_ind_1
Out[117]:
ind_1
0 True
1 True
2 True
3 True
4 True
...
995 True
996 True
997 True
998 True
999 False
Name: val, Length: 1000, dtype: bool
In [118]: ind_1_filtered = bool_filter_ind_1.index[bool_filter_ind_1]
In [119]: ind_1_filtered
Out[119]:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
989, 990, 991, 992, 993, 994, 995, 996, 997, 998],
dtype='int64', name='ind_1', length=999)
The result is correct but df_mi.loc[ind_1_filtered] is relatively slow -
In [120]: timeit df_mi_filtered = df_mi.loc[ind_1_filtered]
4.73 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [121]: df_mi_filtered
Out[121]:
val
ind_1 ind_2
0 0 0
1 1
2 2
3 3
4 4
... ...
998 995 998995
996 998996
997 998997
998 998998
999 998999
[999000 rows x 1 columns]
Is there a faster way to perform the same filtering?
You can use:
First idea is invert mask to df_mi['val'] >= 999997) and get all ind_1 indices for less like threshold and filter original indices of first level by Index.isin for mask and filtering by boolean indexing:
def new(df_mi):
lvl0 = df_mi.index.get_level_values(0)
return df_mi[~lvl0.isin(lvl0[(df_mi['val'] >= 999997)].unique())]
In [240]: %timeit (new(df_mi))
51.5 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another idea is use GroupBy.transform and GroupBy.all for mask and again filtering by boolean indexing:
In [241]: %timeit df_mi[(df_mi['val'] < 999997).groupby(level='ind_1').transform('all')]
97.3 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Original solution:
def orig(df_mi):
bool_filter_ind_1 = (df_mi['val'] < 999997).all(level='ind_1')
ind_1_filtered = bool_filter_ind_1.index[bool_filter_ind_1]
return df_mi.loc[ind_1_filtered]
In [242]: %timeit orig(df_mi)
11.2 s ± 405 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

get count of entries less or equal in Series

I want to get the count of all elements less or equal to each entry in a pandas.Series eg:
if __name__ == '__main__':
import pandas as pd
a = pd.Series(data=[4,7,3,5,2,1,1,6])
le = pd.Series(data=[a[a <= i].count() for i in a])
print(le)
Result:
0 5
1 8
2 4
3 6
4 3
5 2
6 2
7 7
dtype: int64
Is there a function in Series or a better way to do this for large data sets?
Faster is numpy solution - convert Series to numpy array and compare by broadcasting to 2d array, last count True values by sum:
b = a.values
#pandas 0.24+
#b = a.to_numpy()
le = pd.Series((b <= b[:, None]).sum(axis=1), index=a.index)
Details:
print (b <= b[:, None])
[[ True False True False True True True False]
[ True True True True True True True True]
[False False True False True True True False]
[ True False True True True True True False]
[False False False False True True True False]
[False False False False False True True False]
[False False False False False True True False]
[ True False True True True True True True]]
le = pd.Series([a.le(i).sum() for i in a])
le = a.apply(lambda i: a.le(i).sum())
print(le)
0 5
1 8
2 4
3 6
4 3
5 2
6 2
7 7
dtype: int64
Performance:
np.random.seed(2019)
N = 10**6
s = pd.Series(np.random.randint(100, size=N))
#print (s)
In [173]: %%timeit
...: b = a.values
...: le = pd.Series((b <= b[:, None]).sum(axis=1), index=a.index)
...:
78.6 µs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [174]: %%timeit
...: le = pd.Series([a.le(i).sum() for i in a])
...:
3.22 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [175]: %%timeit
...: le = a.apply(lambda i: a.le(i).sum())
...:
3.35 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [176]: %%timeit
...: a.apply(lambda x: a[a.le(x)].count())
...:
...:
5.41 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [177]: %%timeit
...: le = pd.Series(data=[a[a <= i].count() for i in a])
...:
4.91 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You could use apply and a lambda function:
In [4]: a.apply(lambda x: a[a.le(x)].count())
Out[4]: 0 5
1 8
2 4
3 6
4 3
5 2
6 2
7 7
dtype: int64
As the problem would be applied on large datasets:
%timeit [(a.values <= x).sum() for x in a]
10000 loops, best of 3: 28.6 µs per loop
%timeit le = pd.Series(data=[a[a <= i].count() for i in a])
100 loops, best of 3: 2.74 ms per loop
%timeit a.apply(lambda x: a[a.le(x)].count())
100 loops, best of 3: 3.09 ms per loop
which implies that apply is slow, as well as OP's way is also not the best.

How to test for list equality in a column where cells are lists

I want to be able to test if some cells that are lists are equal to [0] and Var1==4, and set a new column to 1 if this happens. Input and expected output are below.
I had several tries but only managed with apply and lambda , and this does not scale well for 50k+ rows. Is there a faster way I'm missing?
Input:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Id': [1,2,3,4],
'Var1': [[0,1],[0],[6,7],[0]],
})
Id Var1
1 [0, 1]
2 [0]
3 [6, 7]
4 [0]
What I've tried:
df['ERR'] = 0
df.loc[(df['Id']==4) & (df['Var1']==[0]) , 'ERR'] = 1 # doesn't work
df.loc[(df['Id']==4) & (df['Var1'].isin([0])) , 'ERR'] = 1 # doesn't work
df['ERR'] = df.apply(lambda x: 1 if x['Id']==4 and x['Var1']==[0] else 0 , axis = 1)
Expected output:
Id Var1 ERR
1 [0, 1] 0
2 [0] 0
3 [6, 7] 0
4 [0] 1
You can compare by tuple or set:
df['ERR1'] = ((df['Id']==4) & (df['Var1'].apply(tuple)==(0, ))).astype(int)
df['ERR2'] = ((df['Id']==4) & ([tuple(x) ==(0, ) for x in df['Var1']])).astype(int)
df['ERR3'] = ((df['Id']==4) & (df['Var1'].apply(set)==set([0]))).astype(int)
df['ERR4'] = ((df['Id']==4) & ([set(x) == set([0]) for x in df['Var1']])).astype(int)
Performance (depends of input data):
df = pd.DataFrame({'Id': [1,2,3,4],
'Var1': [[0,1],[0],[6,7],[0]],
})
df = pd.concat([df] * 10000, ignore_index=True)
In [188]: %timeit df['ERR1'] = ((df['Id']==4) & (df['Var1'].apply(tuple)==(0, ))).astype(int)
13.1 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [189]: %timeit df['ERR2'] = ((df['Id']==4) & ([tuple(x) ==(0, ) for x in df['Var1']])).astype(int)
8.98 ms ± 266 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [190]: %timeit df['ERR3'] = ((df['Id']==4) & (df['Var1'].apply(set)==set([0]))).astype(int)
17 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [191]: %timeit df['ERR4'] = ((df['Id']==4) & ([set(x) == set([0]) for x in df['Var1']])).astype(int)
19.4 ms ± 93.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

weird numba behavior when assigning to an array

I have a function I'm jitting with #jit(nopython=True).
Inside it has a loop that does a bunch of stuff, calculates a correlation and then assigns that to a preallocated output array. Both the target array and the correlation have the same type (np.float32), but for some reason the assignment makes the function take 100X as long.
To make things even more strange, if i instead assign a meaningless float np.float32(i*1.01) instead of my correlation value, the function runs at an appropriate speed.
Given that everything is the same type, they should both run at the same speed no?
corrs = np.zeros(a.shape[0], dtype=np.float32)
for i in range(lb, a.shape[0]):
#a bunch of calculations happens here
correl = np.float32(covar/(a_std*b_std))
testval = np.float32(i*1.01)
#doing this makes the function take FOREVER
#corrs[i] = correl
#but doing this runs very quickly, even though it is also a np.float32
#corrs[i] = testval
here is a runable example. I added an argument called "assign" that if true will assign what I want to assign, and if false will assign my useless test value.
#jit(nopython=True)
def hist_corr_loop(a, b, lb = 1000, assign=True):
flb = np.float32(lb)
a_mu, b_mu = a[0], b[0]
for i in range(1, lb):
a_mu+=a[i]
b_mu+=b[i]
a_mu = a_mu/flb
b_mu = b_mu/flb
a_var, b_var = np.float32(0.0), np.float32(0.0)
for i in range(lb):
a_var += np.square(a[i] - a_mu)
b_var += np.square(b[i] - b_mu)
a_var = a_var/flb
b_var = b_var/flb
corrs = np.zeros(a.shape[0], dtype=np.float32)
for i in range(lb, a.shape[0]):
#calculate new means and stdevs
_a_mu = a_mu
_b_mu = b_mu
a_mu = _a_mu + (a[i] - a[i-lb])/flb
b_mu = _b_mu + (b[i] - b[i-lb])/flb
a_var += (a[i] - a[i-lb])*(a[i] - a_mu + a[i-lb] - _a_mu)/flb
b_var += (b[i] - b[i-lb])*(b[i] - b_mu + b[i-lb] - _b_mu)/flb
a_std = np.sqrt(a_var)#**0.5
b_std = np.sqrt(b_var)#**0.5
covar = np.float32(0.0)
for j in range(i-lb+1,i+1):
covar += (a[j] - a_mu)*(b[j] - b_mu)
covar = covar/flb
correl = np.float32(covar/(a_std*b_std))
testval = np.float32(i*1.01)
if assign:
corrs[i] = correl
else:
corrs[i] = testval
return corrs
to run:
n = 10000000
a = np.random.random(n)
b = np.random.random(n)
%timeit hist_corr_loop(a,b,1000, True)
%timeit hist_corr_loop(a,b, 1000, False)
I get
%timeit hist_corr_loop(a,b,1000, True)
10.5 s ± 52.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit hist_corr_loop(a,b, 1000, False)
220 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10 seconds vs 220 ms.