Vectorizing apply(list) and explode in pandas dataframe

Vectorizing apply(list) and explode in pandas dataframe - pandas

I have a dataframe with dates in an integer format (timedelta in days from some arbitrary date) and using another column weeks I'd like to add 7 days to the start_date column for every week > 1 and explode that into another row.
So records with 1 week would remain the same, 2 weeks would get get one additional row, 3 weeks would get 2 additional rows, etc - each additinal row would have start_date incremented by 7.
It's fairly trivial using pd.apply with axis=1, but I can't seem to wrap my head around a vectorized method to solve this.
import pandas as pd
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
Starting df
product start_date weeks
0 a 1000 1
1 b 1000 2
2 c 1000 3
Current approach
df['dates'] = df.apply(lambda x: [x['start_date']+i*7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename({'dates':'start_date'})
Output
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014

Use loc + index.repeat to scale up the DataFrame, then groupby cumcount to add the multiple, then drop the column:
# Scale up DataFrame
df = df.loc[df.index.repeat(df['weeks'])]
# Create Dates Column grouping by the index (level=0)
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
# Drop Column
df = df.drop('start_date', axis=1)
df:
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014
Timing Information:
import pandas as pd
sample_df = pd.DataFrame({'product': ['a', 'b', 'c'],
'start_date': [1000, 1000, 1000],
'weeks': [1, 2, 3]})
OP's Original Code
def orig(df):
df['dates'] = df.apply(
lambda x: [x['start_date'] + i * 7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename(
{'dates': 'start_date'})
%timeit orig(sample_df)
3.53 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This answer:
def fn(df):
df = df.loc[df.index.repeat(df['weeks'])]
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
df = df.drop('start_date', axis=1)
%timeit fn(sample_df)
1.63 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
OP's Answer
def fn2(df):
df['x'] = df['weeks'].apply(lambda x: range(x))
df = df.explode('x')
df['start_date'] = df['start_date'] + (df['x'] * 7)
df.drop(columns='x', inplace=True)
%timeit fn2(sample_df)
2.71 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Figured it out, same concept just a slightly different order.
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
df['x'] = df['weeks'].apply(lambda x:range(x))
df = df.explode('x')
df['start_date'] = df['start_date']+(df['x']*7)
df.drop(columns='x', inplace=True)

Related

Number of months between two dates while one date is given

Input df
Date1
2019-01-23
2020-02-01
note: The type of Date1 is datetime64[ns].
Goal
I want to calculate month diff between Date1 column and '2019-01-01'.
Try and Ref
I try the answers from this post , but it failed as below:
df['Date1'].dt.to_period('M') - pd.to_datetime('2019-01-01').to_period('M')
note:pandas version: 1.1.5

Your solution should be changed by convert periods to integers and for second value is used one element list ['2019-01-01']:
df['new'] = (df['Date1'].dt.to_period('M').astype(int) -
pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
print (df)
Date1 new
0 2019-01-23 0
1 2020-02-01 13
If compare solutions:
rng = pd.date_range('1900-04-03', periods=3000, freq='MS')
df = pd.DataFrame({'Date1': rng})
In [106]: %%timeit
...: date_ref = pd.to_datetime('2019-01-01')
...: df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
...:
1.57 ms ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [107]: %%timeit
...: df['new'] = (df['Date1'].dt.to_period('M').astype(int) - pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
...:
1.32 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Apply are loops under the hood, so slowier:
In [109]: %%timeit
...: start = pd.to_datetime("2019-01-01")
...: df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
...:
25.7 s ± 729 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [110]: %%timeit
...: rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
...: mon = rd.apply(lambda x: ((x.years * 12) + x.months))
...: df['Diff'] = mon
...:
94.2 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I think this should work:
date_ref = pd.to_datetime('2019-01-01')
df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
month_delta = (date2.year - date1.year)*12 + (date2.month - date1.month)
output:
Date1 mo_since_2019_01
0 2019-01-23 0
1 2020-02-01 13

With this solution, you won't need further imports as it simply calculates the length of the pd.date_range() between your fixed start date and varying end date:
def relative_months(start, end, freq="M"):
if start < end:
x = len(pd.date_range(start=start,end=end,freq=freq))
else:
x = - len(pd.date_range(start=end,end=start,freq=freq))
return x
start = pd.to_datetime("2019-01-01")
df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
In your specific case, I think anon01's solution should be the quickest/ favorable; my variant however allows the use of generic frequency strings for date offsets like 'M', 'D', … and allows you to specifically handle the edge case of "negative" relative offsets (i.e. what happens if your comparison date is not earlier than all dates in Date1).

Try:
rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
mon = rd.apply(lambda x: ((x.years * 12) + x.months))
df['Diff'] = mon
Input:
Date1
0 2019-01-23
1 2020-02-01
2 2020-05-01
3 2020-06-01
Output:
Date1 Diff
0 2019-01-23 0
1 2020-02-01 13
2 2020-05-01 16
3 2020-06-01 17

pandas: finding date of specified dayofweek close to other date

I am looking to replace apply method with something faster for the following problem:
having day_of_week and closest_date columns I need to find the found_dates, that are of the specified day_of_week and are closest prior (backwards) to the closest_date, allowing equal to closest_date results.
initial df:
closest_date day_of_week
0 2009-06-01 6
1 2014-09-02 0
2 2014-10-11 4
3 2015-01-02 3
4 2015-07-11 4
I need to speed-up the following working code:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row['closest_date'] - Week(weekday=row['day_of_week'])
df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
below is just to fix where found_date should be equal to closest_date, but returns a week before.
import numpy as np
df['closest_date_dayofweek'] = df['closest_date'].dt.dayofweek
df['found_date'] = np.where(df['closest_date_dayofweek']==df['day_of_week'],
df['closest_date'],
df['found_date'])
df = df.drop(['closest_date_dayofweek'], axis=1)
that returns the following df
closest_date day_of_week found_date
0 2009-06-01 6 2009-05-31
1 2014-09-02 0 2014-09-01
2 2014-10-11 4 2014-10-10
3 2015-01-02 3 2015-01-01
4 2015-07-11 4 2015-07-10
5 2015-08-08 4 2015-08-07
The problem with the code above is the apply method, which is slow. Any ideas of how to speed up?
Thank you!

Because possible only 7 values you can use loop with filter only matched rows by another column:
for i in range(7):
m = df['day_of_week'].eq(i)
df.loc[m, 'date'] = df.loc[m, 'closest_date'] - Week(weekday=i)
And then new column is not necessary, use:
df['date'] = np.where(df['closest_date'].dt.dayofweek==df['day_of_week'],
df['closest_date'], df['date'])
Performance for 5000 rows:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row.closest_date - Week(weekday=row['day_of_week'])
df = pd.concat([df] * 1000, ignore_index=True)
In [137]: %timeit df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
550 ms ± 77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [138]: %%timeit
...: for i in range(7):
...: m = df['day_of_week'].eq(i)
...: df.loc[m, 'date1'] = df.loc[m, 'closest_date'] - Week(weekday=i)
...:
38.1 ms ± 883 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas dataframe apply to multiple column

I am trying to use apply function to my DataFrame.
The apply use a custom function that returns 2 values and that needs to populate the row of 2 columns on my DataFrame.
I put a simple example below:
df = DataFrame ({'a' : 10})
I wish to create two columns: b and c.
b equals 1 if a is above 0.
c equals 1 if a is above 0.
def compute_b_c(a):
if a > 0:
return 1, 1
else:
return 0,0
I tried this but it returns key error:
df[['b', 'c']] = df.a.apply(compute_b_c)

It is possible with DataFrame constructor,also 1,1 and 0,0 are like tuples (1,1) and (0,0):
df = pd.DataFrame ({'a' : [10, -1, 9]})
def compute_b_c(a):
if a > 0:
return (1,1)
else:
return (0,0)
df[['b', 'c']] = pd.DataFrame(df.a.apply(compute_b_c).tolist())
print (df)
a b c
0 10 1 1
1 -1 0 0
2 9 1 1
Performance:
#10k rows
df = pd.DataFrame ({'a' : [10, -1, 9] * 10000})
In [79]: %timeit df[['b', 'c']] = pd.DataFrame(df.a.apply(compute_b_c).tolist())
22.6 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [80]: %timeit df[['b', 'c']] = df.apply(lambda row: compute_b_c(row['a']), result_type='expand', axis=1)
5.25 s ± 84.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use result_type parameter of pandas.DataFrame.apply. Applicable only if you use apply function on df(DataFrame) and not df.a(Series)
df[['b', 'c']] = df.apply(lambda row: compute_b_c(row['a']), result_type='expand', axis=1)

Faster way to change each value in dataframe ACCORDING to original value

I have a dataframe with 30000 columns and 4000 rows. Each cell entry contains an integer. For EVERY entry, I want to multiply the original contents with log(k/m),
where k is the total number of rows ie.4000
and m is the total number of non zero rows for THAT PARTICULAR COLUMN.
My current code uses apply:
for column in df.columns:
m = len(df[column].to_numpy().nonzero())
df[column] = df[column].apply(lambda x: x * np.log10(4000/m))
This takes me hours (????). I hope there is some faster way to do it, anyone have any ideas?
Thanks

First generate sample data:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(4, 5)*500, columns=['A', 'B', 'C', 'D', 'E']).astype(int).replace(range(100, 200), 0)
Result:
A B C D E
0 348 0 0 275 359
1 211 490 342 240 0
2 0 364 219 29 0
3 368 91 87 265 265
Next I define a vector with containing the non-zero column counts:
non_zeros = df.ne(0).sum().values
# Giving me: array([3, 3, 3, 4, 2], dtype=int64)
From there I find the log-factor to each column:
faktor = np.mat(np.log10(len(df)/ non_zeros))
# giving me: matrix([[0.12493874, 0.12493874, 0.12493874, 0. , 0.30103 ]])
Then multiplying each column with it's factor and converting back to DataFrame:
res = np.multiply(np.mat(df), faktor)
df = pd.DataFrame(res)
With this solution you come around the non-tight loops in Python.
Hope it will bring some help.

#Dennis Hansen's answer is good, but if you're still need to iterate over column I would recommend not to use apply in your solution.
a = pd.DataFrame(np.random.rand(10000)) # define an arib. dataframe
a.iloc[5:500] = 0 # set some values to zero
Solution with apply performance:
>> %%timeit
>> b = a.apply(lambda x: x * np.log10(10000/len(a.to_numpy().nonzero())))
1.53 ms ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Solution without apply performance:
>> %%timeit
>> b = a*np.log10(10000/len(a.to_numpy().nonzero()))
849 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

how to iterate items of list in columns of dataframe

Here is my dataframe :
import pandas as pd
df = pd.DataFrame({'animal':['dog','cat','rabbit','pig'],'color':['red','green','blue','purple'],\
'season':['spring,','summer','fall','winter']})
and I have a list
l = ['dog','green','purple']
with these dataframe and list, I wanna add another column to df, which is actually a result if column 'animal' or column 'color' matched some item of l(list).
so, the result(dataframe) I want is below(I wanna express a table):
pd.DataFrame({'animal':['dog','cat','rabbit','pig'],
'color':['red','green','blue','purple'],
'season':['spring,','summer','fall','winter'],
'tar_rm':[1,1,0,1] })
Do I have to iterate list in each rows of column?
I believe one of pandas' advantage is broadcasting but i'm not sure it's possible here...

Use:
cols = ['animal','color']
df['tar_rm'] = df[cols].isin(l).any(axis=1).astype(int)
print (df)
animal color season tar_rm
0 dog red spring 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1
Details:
First compare filtered columns of DataFrame by DataFrame.isin:
print (df[cols].isin(l))
animal color
0 True False
1 False True
2 False False
3 False True
Then test if at least one True per rows by DataFrame.any:
print (df[cols].isin(l).any(axis=1))
0 True
1 True
2 False
3 True
dtype: bool
An last cast boolean to integers:
print (df[cols].isin(l).any(axis=1).astype(int))
0 1
1 1
2 0
3 1
dtype: int32
If performance is important compare by isin each column separately, convert to numpy array, chain by bitwise OR and last cast to integers:
df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
Performance: Depends of number of tows, number of matched rows and number of values of list, so best test in real data:
l = ['dog','green','purple']
df = pd.concat([df] * 100000, ignore_index=True).sample(1)
In [173]: %timeit df['tar_rm'] = df[['animal','color']].isin(l).any(axis=1).astype(int)
2.11 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [174]: %timeit df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
487 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [175]: %timeit df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
805 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

using numpy
df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
Output
animal color season tar_rm
0 dog red spring, 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Vectorizing apply(list) and explode in pandas dataframe - pandas

Related

Number of months between two dates while one date is given

pandas: finding date of specified dayofweek close to other date

Pandas dataframe apply to multiple column

Faster way to change each value in dataframe ACCORDING to original value

how to iterate items of list in columns of dataframe

Categories

Resources