get count of entries less or equal in Series - pandas

I want to get the count of all elements less or equal to each entry in a pandas.Series eg:
if __name__ == '__main__':
import pandas as pd
a = pd.Series(data=[4,7,3,5,2,1,1,6])
le = pd.Series(data=[a[a <= i].count() for i in a])
print(le)
Result:
0 5
1 8
2 4
3 6
4 3
5 2
6 2
7 7
dtype: int64
Is there a function in Series or a better way to do this for large data sets?

Faster is numpy solution - convert Series to numpy array and compare by broadcasting to 2d array, last count True values by sum:
b = a.values
#pandas 0.24+
#b = a.to_numpy()
le = pd.Series((b <= b[:, None]).sum(axis=1), index=a.index)
Details:
print (b <= b[:, None])
[[ True False True False True True True False]
[ True True True True True True True True]
[False False True False True True True False]
[ True False True True True True True False]
[False False False False True True True False]
[False False False False False True True False]
[False False False False False True True False]
[ True False True True True True True True]]
le = pd.Series([a.le(i).sum() for i in a])
le = a.apply(lambda i: a.le(i).sum())
print(le)
0 5
1 8
2 4
3 6
4 3
5 2
6 2
7 7
dtype: int64
Performance:
np.random.seed(2019)
N = 10**6
s = pd.Series(np.random.randint(100, size=N))
#print (s)
In [173]: %%timeit
...: b = a.values
...: le = pd.Series((b <= b[:, None]).sum(axis=1), index=a.index)
...:
78.6 µs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [174]: %%timeit
...: le = pd.Series([a.le(i).sum() for i in a])
...:
3.22 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [175]: %%timeit
...: le = a.apply(lambda i: a.le(i).sum())
...:
3.35 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [176]: %%timeit
...: a.apply(lambda x: a[a.le(x)].count())
...:
...:
5.41 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [177]: %%timeit
...: le = pd.Series(data=[a[a <= i].count() for i in a])
...:
4.91 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You could use apply and a lambda function:
In [4]: a.apply(lambda x: a[a.le(x)].count())
Out[4]: 0 5
1 8
2 4
3 6
4 3
5 2
6 2
7 7
dtype: int64

As the problem would be applied on large datasets:
%timeit [(a.values <= x).sum() for x in a]
10000 loops, best of 3: 28.6 µs per loop
%timeit le = pd.Series(data=[a[a <= i].count() for i in a])
100 loops, best of 3: 2.74 ms per loop
%timeit a.apply(lambda x: a[a.le(x)].count())
100 loops, best of 3: 3.09 ms per loop
which implies that apply is slow, as well as OP's way is also not the best.

Related

Python: Finding individuals with conditions from two lists using Python (pandas)

I have the following dataframe:
df =
id medication
1 A
1 B
1 A
2 Z
2 A
2 A
3 B
3 D
3 A
I create two lists of medications:
ListA = ['A', 'Z']
ListB = ['B', 'C']
I want to obtain those individuals that have any of the medications from listA and any of the medications from listB so that the answer is:
dfOutput =
id medication
1 A
1 B
3 B
3 A
So far I am trying the following:
dfOutput = df.groupby("id").filter(lambda x : pd.Series([*ListA,*ListB]).isin(x['medication']).all())
(df.groupby('id')['medication']
.agg(result = lambda x: x.isin(ListA).any() and x.isin(ListB).any())
.reset_index().merge(df)
.loc[lambda x: x['medication'].isin(ListA+ListB)*x.result].drop_duplicates())
id result medication
0 1 True A
1 1 True B
6 3 True B
8 3 True A
Solution without groupby.filter for improve performance:
ListA = ['A', 'Z']
ListB = ['B', 'C']
#test matching both lists
m1 = df['medication'].isin(ListA)
m2 = df['medication'].isin(ListB)
#get intersection of matched id filtere by both conditions
both = np.intersect1d(df.loc[m1, 'id'],df.loc[m2, 'id'])
#filter id with both masks m1 or m2, last tremove duplicates
df = df[df['id'].isin(both) & (m1 | m2)].drop_duplicates()
print (df)
id medication
0 1 A
1 1 B
6 3 B
8 3 A
Performance:
N = 10000
df = pd.DataFrame({'id':np.random.randint(N // 30, size=N),
'medication':np.random.choice(list('ABCDEFGHIJKL'), size=N)}).sort_values('id', ignore_index=True)
print (df)
ListA = ['A', 'Z']
ListB = ['B', 'C']
In [131]: %%timeit
...: m1 = df['medication'].isin(ListA)
...: m2 = df['medication'].isin(ListB)
...: #get intersection of matched id filtere by both conditions
...: both = np.intersect1d(df.loc[m1, 'id'],df.loc[m2, 'id'])
...:
...: #filter id with both masks m1 or m2, last tremove duplicates
...: df1 = df[df['id'].isin(both) & (m1 | m2)].drop_duplicates()
...:
3.53 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [132]: %%timeit
...: (df.groupby('id')['medication']
...: .agg(result = lambda x: x.isin(ListA).any() and x.isin(ListB).any())
...: .reset_index().merge(df)
...: .loc[lambda x: x['medication'].isin(ListA+ListB)*x.result].drop_duplicates())
...:
121 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Cannot make sense of timing of numba-compiled functions

I'm running some simulations, where I use numba to compile my python code to speed up the simulations. I wrote a function that will overwrite one of the input arrays, and therefore I would like to pass in a copy of that array instead. However, this makes the code much slower, and far slower than the time it takes to make the copy.
Here are the timing results:
> population_ = population.copy()
> %timeit _ = run_simulation(population_, Tmax, dt, Nskip = Nskip)
64.6 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> %timeit _ = run_simulation(population.copy(), Tmax, dt, Nskip = Nskip)
87.4 ms ± 778 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> %timeit _ = population.copy()
442 ns ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So calling run_simulation directly with the result of .copy() as an argument is about 23 milliseconds slower, despite the fact that making the copy only takes about 0.0004 milliseconds. I don't understand why this is the case.
For background, here is the full code:
import numpy as np
from numba import jit, int32, int64, float64
#jit('int32[:,:,:](int32[:,:,:], float64)', nopython=True)
def one_step(population, dt):
# Hard-coding model parameters here
beta = 0.55
tau = 10
# This probabilty doesn't depend on the other states
pIR = 1 - np.exp(-dt/tau)
# Double for loop over towns and towns
for i in range(population.shape[0]):
I = np.sum(population[i,1,:])
N = np.sum(population[i,:,:])
# Transition probability from susceptible to infected
pSI = 1 - np.exp(-dt*beta*I/N)
for j in range(population.shape[1]):
# Unpack variables for convenience
S, I, R = population[i,j,:]
S2I = np.random.binomial(S, pSI)
I2R = np.random.binomial(I, pIR)
# Calculate new values
S = S - S2I
I = I + S2I - I2R
R = R + I2R
population[i,j,:] = (S, I, R)
return population
#jit('int32[:,:,:](int32[:,:,:], float64, float64, int64)', nopython=True)
def run_simulation(population, Tmax, dt, Nskip = 10):
Nt = int(Tmax/dt)
history = np.zeros((population.shape[0], 3, int((Tmax/dt)/Nskip) + 1), dtype = np.int32)
history[:,:,0] = np.sum(population, axis = 1)
t = 0
for i in range(1, Nt+1):
population = one_step(population, dt)
t += dt
if i % Nskip == 0:
history[:,:,int(i/Nskip)] = np.sum(population, axis = 1)
return history
# Initial state
population = np.random.randint(low = 0, high = 1000, size = (10,10,3), dtype = np.int32)
# Run simulation for 100 days
Tmax = 100
dt = 0.01
# Only store once per day
Nskip = int(1/dt)
# Call one timestep to compile numba-decorated functions
# prior to measuring timing
_ = run_simulation(population, 1.0, 1.0, Nskip = 1)
# Run timing
population_ = population.copy()
%timeit _ = run_simulation(population_, Tmax, dt, Nskip = Nskip)
# Run timing
%timeit _ = run_simulation(population.copy(), Tmax, dt, Nskip = Nskip)
# Run timing
%timeit _ = population.copy()
What you are referring to isn't really related to numba.
Consider the following simple example:
import numpy as np
def run_simulation_2(population):
return population.sum(axis=0)
# Initial state
population = np.random.randint(low = 0, high = 1000, size = (10,10,3), dtype = np.int32)
# Run timing
population_ = population.copy()
%timeit _ = run_simulation_2(population_)
# Run timing
%timeit _ = run_simulation_2(population.copy())
# Run timing
%timeit _ = population.copy()
Timing results are:
3.45 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.34 µs ± 91.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
680 ns ± 23.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So you then is an overhead of about 25% even without numba which is about the same overhead you saw yourself.
Therefore I think it is not related to numba, but to different "behind the scenes" stuff that happens when you send aa variable as an argument vs result of a function.
Unfortunately I can't offer you a good explanation why it happens, but I hope that the fact that it's not related to numba is a well enough explanation for your needs.

how to optimize these pandas apply functions?

train["gender"] = train.apply(lambda x: 1 if x["gender"] == "F" else 0, axis=1)
train["car"] = train.apply(lambda x: 1 if x["car"] == "Y" else 0, axis=1)
train["reality"] = train.apply(lambda x: 1 if x["reality"] == "Y" else 0, axis=1)
these 3 codes require many time even it is simple change.
I guess, accessing each row 3 times makes inefficeny.
So, if I can make 1 access to row and apply function change 3 data, it can be faster 2~3 times than now.
like.....
# it is my imaginary code. not works
train[["gender","car", "reality"]] = train.apply(lambda x: 1 if x["gender"] == "F" else 0, axis=1,
lambda y: 1 if y["car"] == "Y" else 0, axis=1,
lambda z: 1 if z["reality"] == "Y" else 0, axis=1)
How can optimize these codes?
===========================
test result for tdy
You can try 3x np.where() which is generally the fastest option:
train['gender'] = np.where(train.gender == 'F', 1, 0)
train['car'] = np.where(train.car == 'Y', 1, 0)
train['reality'] = np.where(train.reality == 'Y', 1, 0)
Or 2x np.where() which is slightly slower:
train['gender'] = np.where(train.gender == 'F', 1, 0)
train[['car', 'reality']] = np.where(train[['car', 'reality']] == 'Y', 1, 0)
Timings with 10 million rows:
method
%timeit
3x np.where()
152 ms ± 8.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2x np.where()
198 ms ± 39.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3x apply()
8.91 s ± 495 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas - what is the most efficient way to filter DataFrame using pandas.Series.all

Consider the code below -
import pandas as pd
data = []
val = 0
for ind_1 in range(1000):
for ind_2 in range(1000):
data.append({'ind_1': ind_1, 'ind_2': ind_2,
'val': val})
val += 1
df_mi = pd.DataFrame(data).set_index(['ind_1', 'ind_2'])
which creates the DataFrame df_mi with MultiIndex-
In [90]: df_mi
Out[90]:
val
ind_1 ind_2
0 0 0
1 1
2 2
3 3
4 4
... ...
999 995 999995
996 999996
997 999997
998 999998
999 999999
[1000000 rows x 1 columns]
Now I want to filter the rows by applying some condition on all values for each ind_1 -
In [116]: bool_filter_ind_1 = (df_mi['val'] < 999997).all(level='ind_1')
In [117]: bool_filter_ind_1
Out[117]:
ind_1
0 True
1 True
2 True
3 True
4 True
...
995 True
996 True
997 True
998 True
999 False
Name: val, Length: 1000, dtype: bool
In [118]: ind_1_filtered = bool_filter_ind_1.index[bool_filter_ind_1]
In [119]: ind_1_filtered
Out[119]:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
989, 990, 991, 992, 993, 994, 995, 996, 997, 998],
dtype='int64', name='ind_1', length=999)
The result is correct but df_mi.loc[ind_1_filtered] is relatively slow -
In [120]: timeit df_mi_filtered = df_mi.loc[ind_1_filtered]
4.73 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [121]: df_mi_filtered
Out[121]:
val
ind_1 ind_2
0 0 0
1 1
2 2
3 3
4 4
... ...
998 995 998995
996 998996
997 998997
998 998998
999 998999
[999000 rows x 1 columns]
Is there a faster way to perform the same filtering?
You can use:
First idea is invert mask to df_mi['val'] >= 999997) and get all ind_1 indices for less like threshold and filter original indices of first level by Index.isin for mask and filtering by boolean indexing:
def new(df_mi):
lvl0 = df_mi.index.get_level_values(0)
return df_mi[~lvl0.isin(lvl0[(df_mi['val'] >= 999997)].unique())]
In [240]: %timeit (new(df_mi))
51.5 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another idea is use GroupBy.transform and GroupBy.all for mask and again filtering by boolean indexing:
In [241]: %timeit df_mi[(df_mi['val'] < 999997).groupby(level='ind_1').transform('all')]
97.3 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Original solution:
def orig(df_mi):
bool_filter_ind_1 = (df_mi['val'] < 999997).all(level='ind_1')
ind_1_filtered = bool_filter_ind_1.index[bool_filter_ind_1]
return df_mi.loc[ind_1_filtered]
In [242]: %timeit orig(df_mi)
11.2 s ± 405 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How to test for list equality in a column where cells are lists

I want to be able to test if some cells that are lists are equal to [0] and Var1==4, and set a new column to 1 if this happens. Input and expected output are below.
I had several tries but only managed with apply and lambda , and this does not scale well for 50k+ rows. Is there a faster way I'm missing?
Input:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Id': [1,2,3,4],
'Var1': [[0,1],[0],[6,7],[0]],
})
Id Var1
1 [0, 1]
2 [0]
3 [6, 7]
4 [0]
What I've tried:
df['ERR'] = 0
df.loc[(df['Id']==4) & (df['Var1']==[0]) , 'ERR'] = 1 # doesn't work
df.loc[(df['Id']==4) & (df['Var1'].isin([0])) , 'ERR'] = 1 # doesn't work
df['ERR'] = df.apply(lambda x: 1 if x['Id']==4 and x['Var1']==[0] else 0 , axis = 1)
Expected output:
Id Var1 ERR
1 [0, 1] 0
2 [0] 0
3 [6, 7] 0
4 [0] 1
You can compare by tuple or set:
df['ERR1'] = ((df['Id']==4) & (df['Var1'].apply(tuple)==(0, ))).astype(int)
df['ERR2'] = ((df['Id']==4) & ([tuple(x) ==(0, ) for x in df['Var1']])).astype(int)
df['ERR3'] = ((df['Id']==4) & (df['Var1'].apply(set)==set([0]))).astype(int)
df['ERR4'] = ((df['Id']==4) & ([set(x) == set([0]) for x in df['Var1']])).astype(int)
Performance (depends of input data):
df = pd.DataFrame({'Id': [1,2,3,4],
'Var1': [[0,1],[0],[6,7],[0]],
})
df = pd.concat([df] * 10000, ignore_index=True)
In [188]: %timeit df['ERR1'] = ((df['Id']==4) & (df['Var1'].apply(tuple)==(0, ))).astype(int)
13.1 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [189]: %timeit df['ERR2'] = ((df['Id']==4) & ([tuple(x) ==(0, ) for x in df['Var1']])).astype(int)
8.98 ms ± 266 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [190]: %timeit df['ERR3'] = ((df['Id']==4) & (df['Var1'].apply(set)==set([0]))).astype(int)
17 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [191]: %timeit df['ERR4'] = ((df['Id']==4) & ([set(x) == set([0]) for x in df['Var1']])).astype(int)
19.4 ms ± 93.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)