How to use Pandas vector methods based on rolling custom function that involves entire row and prior data - pandas

While its easy to use pandas rolling method to apply standard formulas, but i find it hard if it involves multiple column with limited past rows. Using the following code to better elaborate: -
import numpy as np
import pandas as pd
#create dummy pandas
df=pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
def func1(shortdf):
#dummy formula
#use last row of col1 multiply by sum of col2
return (shortdf.col1.tail(1).values[0]+shortdf.col2.sum())*3.14
for idx, i in df.iterrows():
if idx>3:
#only interested in the last 3 rows from position of dataframe
df.loc[idx,'col3']=func1(df.iloc[idx-3:idx])
I currently use this iterrow method which needless to say is extremely slow. can anyone has better suggestion?

Option 1
So shift is the solution here. You do have to use rolling for the summation, and then shift that series after the addition and multiplication.
df = pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})
ans = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
You can check to see that ans is the same as df['col3'] by using ans.eq(df['col3']). Once you see that all but the first few are the same, just change ans to df['col3'] and you should be all set.
Option 2
Without additional information about the customized weight function, it is hard to help. However, this option may be a solution as it separates the rolling calculation at the cost of using more memory.
# df['col3'] = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)
s = df['col2']
stride = pd.DataFrame([s.shift(x).values[::-1][:3] for x in range(len(s))[::-1]])
res = pd.concat([df, stride], axis=1)
# here you can perform your custom weight function
res['final'] = ((res[0] + res[1] + res[2] + res['col1']) * 3.14).shift(1)
stride is adapted from this question and the calculation is concatenated row-wise to the original dataframe. In this way each column has the value needed to compute whatever it is you may need.
res['final'] is identical to option 1's ans

Related

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

Using IQR method to remove outliers does not change shape of data frame

I'm trying to remove outliers using IQR method. However, the shape of my df remains the same.
Here is the code:
def IQR_outliers(df):
Q1=df.quantile(0.25)
Q3=df.quantile(0.75)
IQR=Q3-Q1
df=df[~((df<(Q1-1.5*IQR)) | (df>(Q3+1.5*IQR)))]
return df
IQR_outliers(df['Distance'])
IQR_outliers(df['Price'])
Your function considers the whole object that is passed, but you're only passing a single series each time you use it. You're also not capturing the output. All of these things stack on top of each to make your problem pretty complex.
So here's what I would do:
add a column argument to your function
modifying the function to only consider that column when selecting rows from the entire dataframe
pipe the dataframe to that function a couple of times
So that's:
def IQR_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
df = df.loc[lambda df: ~((df[column] < (Q1 - 1.5 * IQR)) | (df[column] > (Q3 + 1.5 * IQR)))]
return df
revised_df = df.pipe(IQR_outliers, 'Distance').pipe(IQR_outliers, 'Price')
Note that the way you've demonstrated this, you'll very likely drop rows where Distance is an outlier even if Price is not. If you don't want to do that, you'll need to stack your dataframe, apply this function to a groupby operation, and then optionally unstack the dataframe

What is the most efficient method for calculating per-row historical values in a large pandas dataframe?

Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:
Was_Sold (Y/N)
Color
Size_Group
Shape
Date_Made
Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).
Then for every row in df_a, I want to:
Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
The df_b toys must also be made before the given df_a toy
Then find the ratio of those sold (So count sold / count all matched)
What is the most efficient means to make those per-row calculations above?
The best I've came up with so far is something like the below.
(Note code might have an error or two as I'm rough typing from a different use case)
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
prior_count = len(prior_toys)
# Now find the ones that were sold
prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
Using .itertuples() is useful, but this is still pretty slow. Is there a more efficient method or something I'm missing?
EDIT
Added the below script which will emulated data for the above scenario:
import numpy as np
import pandas as pd
colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000
date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
df_a = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_a),
'Size_Group': np.random.choice(sizes, size_df_a),
'Shape': np.random.choice(shapes, size_df_a),
'Was_Sold': np.random.choice(sold, size_df_a),
'Date_Made': random_dates(date_start, date_end, n=size_df_a)
}
)
df_b = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_b),
'Size_Group': np.random.choice(sizes, size_df_b),
'Shape': np.random.choice(shapes, size_df_b),
'Was_Sold': np.random.choice(sold, size_df_b),
'Date_Made': random_dates(date_start, date_end, n=size_df_b)
}
)
First of all, I think your computation would be much more efficient using relational database and SQL query. Indeed, the filters can be done by indexing columns, performing a database join, some advance filtering and count the result. An optimized relational database can generate an efficient algorithm based on a simple SQL query (hash-based row grouping, binary search, fast intersection of sets, etc.). Pandas is sadly not very good to perform efficiently advanced requests like this. It is also very slow to iterate over pandas dataframe although I am not sure this can be alleviated in this case using only pandas. Hopefully you can use some Numpy and Python tricks and (partially) implement what fast relational database engines would do.
Additionally, pure-Python object types are slow, especially (unicode) strings. Thus, **converting column types to efficient ones in a first place can save a lot of time (and memory). For example, there is no need for the Was_Sold column to contains "Y"/"N" string objects: a boolean can just be used in that case. Thus let us convert that:
df_b.Was_Sold = df_b.Was_Sold == "Y"
Finally, the current algorithm has a bad complexity: O(Na * Nb) where Na is the number of rows in df_a and Nb is the number of rows in df_b. This is not easy to improve though due to the non-trivial conditions. A first solution is to group df_b by col column ahead-of-time so to avoid an expensive complete iteration of df_b (previously done with df_b[col] == relevant_val). Then, the date of the precomputed groups can be sorted so to perform a fast binary search later. Then you can use Numpy to count boolean values efficiently (using np.sum).
Note that doing prior_toys['Was_Sold'] is a bit faster than prior_toys.Was_Sold.
Here is the resulting code:
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Split df_b by col and sort each (indexed) group by date
colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
curColGroup = colGroups[relevant_val]
prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
prior_toys = curColGroup[:prior_count]
# Now find the ones that were sold
prior_sold_count = prior_toys['Was_Sold'].values.sum()
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
This is 5.5 times faster on my machine.
The iteration of the pandas dataframe is a major source of slowdown. Indeed, prior_toys['Was_Sold'] takes half the computation time because of the huge overhead of pandas internal function calls repeated Na times... Using Numba may help to reduce the cost of the slow iteration. Note that the complexity can be increased by splitting colGroups in subgroups ahead of time (O(Na log Nb)). This should help to completely remove the overhead of prior_sold_count. The resulting program should be about 10 time faster than the original one.

Find dates and difference between extreme observations

he function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.
While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform. Pandas offers a wide range of method that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.
easiest is an aggregation with groupby and then do a select
# make index a column
df = df.reset_index()
# get min of holdings for each ticker
lowest = df[['ticker','holdings']].groupby('ticker').min()
print(lowest)
# select lowest my performing a left join (solutions with original)
# this gives only the matching rows of df in return
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
print(lowest_dates)
If you just want a series of Date you can use this function.
def getLowest(df):
df = df.reset_index()
lowest = df[['ticker','holdings']].groupby('ticker').min()
lowest_dates = lowest.merge(df, on=['ticker','holdings'], how='left')
return lowest_dates['Date']
From my point of view it would be better to return the entire dataframe, to know which ticker was lowest when. In this case you can :
return lowest_dates

The fastest way to get filtered data checking substring value within ndarray

I have a big array of data:
>>> len(b)
6636849
>>> print(b)
[['60D19E9E-4E2C-11E2-AA9A-52540027E502' '100015361']
['60D19EB6-4E2C-11E2-AA9A-52540027E502' '100015385']
['60D19ECE-4E2C-11E2-AA9A-52540027E502' '100015409']
...,
['8CC90633-447E-11E6-B010-005056A76B49' '106636785']
['F8C74244-447E-11E6-B010-005056A76B49' '106636809']
['F8C7425C-447E-11E6-B010-005056A76B49' '106636833']]
I need to get the filtered dataset, i.e, everything containing (or starting with) '106' in the string). Something like the following code with substring operation instead of math operation:
>>> len(b[b[:,1] > '10660600'])
30850
I don't think numpy is well suited for this type of operation. You can do it simply using basic python operations. Here it is with some sample data a:
import random # for the test data
a = []
for i in range(10000):
a.append(["".join(random.sample('abcdefg',3)), "".join(random.sample('01234567890',8))])
answer = [i for i in a if i[1].find('106') != -1]
Keep in mind that startswith is going to be a lot faster than find, because find has to look for matching substrings in all positions.
It's not too clear why you need do this with such a large list/array in the first place, and there might be a better solution when it comes to not including these values in the list in the first place.
Here's a simple pandas solution
import pandas as pd
df = pd.DataFrame(b, columns=['1st String', '2nd String'])
df_filtered = df[df['2nd String'].str.contains('106')]
This gives you
In [29]: df_filtered
Out[29]:
1st String 2nd String
3 8CC90633-447E-11E6-B010-005056A76B49 106636785
4 F8C74244-447E-11E6-B010-005056A76B49 106636809
5 F8C7425C-447E-11E6-B010-005056A76B49 106636833
Update: Timing Results
Using Benjamin's list a as the test sample:
In [20]: %timeit [i for i in a if i[1].find('106') != -1]
100 loops, best of 3: 2.2 ms per loop
In [21]: %timeit df[df['2nd String'].str.contains('106')]
100 loops, best of 3: 5.94 ms per loop
So it looks like Benjamin's answer is actually about 3x faster. This surprises me since I was under the impression that the operation in pandas is vectorized. Moreover, the speed ratio does not change when a is 100 times longer.
Look at the functions in the np.char submodule:
data = [['60D19E9E-4E2C-11E2-AA9A-52540027E502', '100015361'],
['60D19EB6-4E2C-11E2-AA9A-52540027E502', '100015385'],
['60D19ECE-4E2C-11E2-AA9A-52540027E502', '100015409'],
['8CC90633-447E-11E6-B010-005056A76B49', '106636785'],
['F8C74244-447E-11E6-B010-005056A76B49', '106636809'],
['F8C7425C-447E-11E6-B010-005056A76B49', '106636833']]
data = np.array([r[1] for r in data], np.str)
idx = np.char.startswith(data, '106')
print(idx)