How do you speed up a score calculation based on two rows in a Pandas Dataframe? - pandas

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks

If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

Related

Fastest way to compute only last row in pandas dataframe

I am trying to find the fastest way to compute the results only for the last row in a dataframe. For some reason, when I do so it is slower than computing the entire dataframe. What am I doing wrong here? What would be the correct way to access only the last two rows and compute their values?
Currently these are my results:
Processing time of add_complete(): 1.333 seconds
Processing time of add_last_row_only(): 1.502 seconds
import numpy as np
import pandas as pd
def add_complete(df):
df['change_a'] = df['a'].diff()
df['change_b'] = df['b'].diff()
df['factor'] = df['change_a'] * df['change_b']
def add_last_row_only(df):
df.at[df.index[-1], 'change_a_last_row'] = df['a'].iloc[-1] - df['a'].iloc[-2]
df.at[df.index[-1], 'change_b_last_row'] = df['b'].iloc[-1] - df['b'].iloc[-2]
df.at[df.index[-1], 'factor_last_row'] = df['change_a_last_row'].iloc[-1] * df['change_b_last_row'].iloc[-1]
def main():
a = np.arange(200_000_000).reshape(100_000_000, 2)
df = pd.DataFrame(a, columns=['a', 'b'])
add_complete(df)
add_last_row_only(df)
print(df.tail())
Unless I am missing something, for this kind of operation I would use numpy on the two last lines:
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
21µs just this operation, yes, microseconds.
If I add insertion it increases to 511ms, even filling all with same value.
I suspect the problem comes from handling around a 1.5Gb dataframe, which actually doubles the size when inserting the two extra columns.
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
df['factor']=factor
df['changes_a']=changes[0][0]
df['changes_b']=changes[0][1]

A more efficient way of aggregating a random sample from pandas dataframe and iteratively append the mean of sampled df in an empty dataframe

I am trying to take out a random sample from my df, take mean of all columns in a single row series, using df_sample.mean(axis =0) and then appending this series to an empty dataframe and I want 1 million such rows. I am getting the result but it is taking too much time to run. Can someone please suggest an efficient way to do this ?
train = pd.DataFrame()
for i in range (1000000):
df_sample = df_2.sample(n=100)
row = df_sample.mean(axis=0)
train = train.append(row,ignore_index=True)
Here's a faster way to do, this would result in 1 million (10 lakh) rows:
Method 1: Sampling using pandas inbuilt
n_times = 1000000
values = [df_2.sample(n=1).mean(axis=0, numeric_only=True) for _ in range (n_times)]
train = pd.DataFrame(values, columns=['mean_col'])
Method 2: Sampling using numpy
def f1():
return np.mean(df_2.values[np.random.randint(0, df.shape[0])])
def f2():
return df_2.iloc[np.random.randint(0, df.shape[0])].mean(axis=0, numeric_only=True)
values = [f1() for _ in range(n_times)]
train = pd.DataFrame(values, columns=['mean_col'])
values = [f2() for _ in range(n_times)]
train = pd.DataFrame(values, columns=['mean_col'])

How to use Pandas vector methods based on rolling custom function that involves entire row and prior data

While its easy to use pandas rolling method to apply standard formulas, but i find it hard if it involves multiple column with limited past rows. Using the following code to better elaborate: -
import numpy as np
import pandas as pd
#create dummy pandas
df=pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
def func1(shortdf):
#dummy formula
#use last row of col1 multiply by sum of col2
return (shortdf.col1.tail(1).values[0]+shortdf.col2.sum())*3.14
for idx, i in df.iterrows():
if idx>3:
#only interested in the last 3 rows from position of dataframe
df.loc[idx,'col3']=func1(df.iloc[idx-3:idx])
I currently use this iterrow method which needless to say is extremely slow. can anyone has better suggestion?
Option 1
So shift is the solution here. You do have to use rolling for the summation, and then shift that series after the addition and multiplication.
df = pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
ans = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
You can check to see that ans is the same as df['col3'] by using ans.eq(df['col3']). Once you see that all but the first few are the same, just change ans to df['col3'] and you should be all set.
Option 2
Without additional information about the customized weight function, it is hard to help. However, this option may be a solution as it separates the rolling calculation at the cost of using more memory.
# df['col3'] = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
s = df['col2']
stride = pd.DataFrame([s.shift(x).values[::-1][:3] for x in range(len(s))[::-1]])
res = pd.concat([df, stride], axis=1)
# here you can perform your custom weight function
res['final'] = ((res[0] + res[1] + res[2] + res['col1']) * 3.14).shift(1)
stride is adapted from this question and the calculation is concatenated row-wise to the original dataframe. In this way each column has the value needed to compute whatever it is you may need.
res['final'] is identical to option 1's ans

get less correlated variable names

I have a dataset (50 columns, 100 rows).
Also have 50 variable names, 0,1,2...49 for 50 columns.
I have to find less correlated variables, say correlation < 0.7.
I tried as follows:
import os, glob, time, numpy as np, pandas as pd
data = np.random.randint(1,99,size=(100, 50))
dataframe = pd.DataFrame(data)
print (dataframe.shape)
codes = np.arange(50).astype(str)
dataframe.columns = codes
corr = dataframe.corr()
corr = corr.unstack().sort_values()
print (corr)
corr = corr.values
indices = np.where(corr < 0.7)
print (indices)
res = codes[indices[0]].tolist() + codes[indices[1]].tolist()
print (len(res))
res = list(set(res))
print (len(res))
The result is, 50(all variables!), which is unexpected.
How to solve this problem, guys?
As mentioned in the comments, your question is somewhat ambiguous. First, there is the possibility, that no column pair is correlated. Second, the unstacking doesn't make sense, because you create an index array that you can't directly use on your 2D array. Third, which should be first, but I was blind to this - as #AmiTavory mentioned there is no point in "correlating names".
The correlation procedure per se works, as you can see in the following example:
import numpy as np
import pandas as pd
A = np.arange(100).reshape(25, 4)
#random order in column 2, i.e. a low correlation to the first columns
np.random.shuffle(A[:,2])
#flip column 3 to create a negative correlation with the first columns
A[:,3] = np.flipud(A[:,3])
#column 1 is unchanged, therefore positively correlated to column 0
df = pd.DataFrame(A)
print(df)
#establish a correlation matrix
corr = df.corr()
#retrieve index of pairs below a certain value
#use only the upper triangle with np.triu to filter for symmetric solutions
#use np.abs to take also negative correlation into account
res = np.argwhere(np.triu(np.abs(corr.values) <0.7))
print(res)
Output:
[[0 2]
[1 2]
[2 3]]
As expected, column 2 is the only one that is not correlated to any other, meaning, that all other columns are correlated with each other.

Filtering out outliers in Pandas dataframe with rolling median

I am trying to filter out some outliers from a scatter plot of GPS elevation displacements with dates
I'm trying to use df.rolling to compute a median and standard deviation for each window and then remove the point if it is greater than 3 standard deviations.
However, I can't figure out a way to loop through the column and compare the the median value rolling calculated.
Here is the code I have so far
import pandas as pd
import numpy as np
def median_filter(df, window):
cnt = 0
median = df['b'].rolling(window).median()
std = df['b'].rolling(window).std()
for row in df.b:
#compare each value to its median
df = pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns = ['a', 'b'])
median_filter(df, 10)
How can I loop through and compare each point and remove it?
Just filter the dataframe
df['median']= df['b'].rolling(window).median()
df['std'] = df['b'].rolling(window).std()
#filter setup
df = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]
There might well be a more pandastic way to do this - this is a bit of a hack, relying on a sorta manual way of mapping the original df's index to each rolling window. (I picked size 6). The records up and until row 6 are associated with the first window; row 7 is the second window, and so on.
n = 100
df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])
## set window size
window=6
std = 1 # I set it at just 1; with real data and larger windows, can be larger
## create df with rolling stats, upper and lower bounds
bounds = pd.DataFrame({'median':df['b'].rolling(window).median(),
'std':df['b'].rolling(window).std()})
bounds['upper']=bounds['median']+bounds['std']*std
bounds['lower']=bounds['median']-bounds['std']*std
## here, we set an identifier for each window which maps to the original df
## the first six rows are the first window; then each additional row is a new window
bounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))
## then we can assign the original 'b' value back to the bounds df
bounds['b']=df['b']
## and finally, keep only rows where b falls within the desired bounds
bounds.loc[bounds.eval("lower<b<upper")]
This is my take on creating a median filter:
def median_filter(num_std=3):
def _median_filter(x):
_median = np.median(x)
_std = np.std(x)
s = x[-1]
return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan
return _median_filter
df.y.rolling(window).apply(median_filter(num_std=3), raw=True)