Vectorizing apply(list) and explode in pandas dataframe - pandas

I have a dataframe with dates in an integer format (timedelta in days from some arbitrary date) and using another column weeks I'd like to add 7 days to the start_date column for every week > 1 and explode that into another row.
So records with 1 week would remain the same, 2 weeks would get get one additional row, 3 weeks would get 2 additional rows, etc - each additinal row would have start_date incremented by 7.
It's fairly trivial using pd.apply with axis=1, but I can't seem to wrap my head around a vectorized method to solve this.
import pandas as pd
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
Starting df
product start_date weeks
0 a 1000 1
1 b 1000 2
2 c 1000 3
Current approach
df['dates'] = df.apply(lambda x: [x['start_date']+i*7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename({'dates':'start_date'})
Output
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014

Use loc + index.repeat to scale up the DataFrame, then groupby cumcount to add the multiple, then drop the column:
# Scale up DataFrame
df = df.loc[df.index.repeat(df['weeks'])]
# Create Dates Column grouping by the index (level=0)
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
# Drop Column
df = df.drop('start_date', axis=1)
df:
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014
Timing Information:
import pandas as pd
sample_df = pd.DataFrame({'product': ['a', 'b', 'c'],
'start_date': [1000, 1000, 1000],
'weeks': [1, 2, 3]})
OP's Original Code
def orig(df):
df['dates'] = df.apply(
lambda x: [x['start_date'] + i * 7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename(
{'dates': 'start_date'})
%timeit orig(sample_df)
3.53 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This answer:
def fn(df):
df = df.loc[df.index.repeat(df['weeks'])]
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
df = df.drop('start_date', axis=1)
%timeit fn(sample_df)
1.63 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
OP's Answer
def fn2(df):
df['x'] = df['weeks'].apply(lambda x: range(x))
df = df.explode('x')
df['start_date'] = df['start_date'] + (df['x'] * 7)
df.drop(columns='x', inplace=True)
%timeit fn2(sample_df)
2.71 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Figured it out, same concept just a slightly different order.
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
df['x'] = df['weeks'].apply(lambda x:range(x))
df = df.explode('x')
df['start_date'] = df['start_date']+(df['x']*7)
df.drop(columns='x', inplace=True)

Related

Number of months between two dates while one date is given

Input df
Date1
2019-01-23
2020-02-01
note: The type of Date1 is datetime64[ns].
Goal
I want to calculate month diff between Date1 column and '2019-01-01'.
Try and Ref
I try the answers from this post , but it failed as below:
df['Date1'].dt.to_period('M') - pd.to_datetime('2019-01-01').to_period('M')
note:pandas version: 1.1.5
Your solution should be changed by convert periods to integers and for second value is used one element list ['2019-01-01']:
df['new'] = (df['Date1'].dt.to_period('M').astype(int) -
pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
print (df)
Date1 new
0 2019-01-23 0
1 2020-02-01 13
If compare solutions:
rng = pd.date_range('1900-04-03', periods=3000, freq='MS')
df = pd.DataFrame({'Date1': rng})
In [106]: %%timeit
...: date_ref = pd.to_datetime('2019-01-01')
...: df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
...:
1.57 ms ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [107]: %%timeit
...: df['new'] = (df['Date1'].dt.to_period('M').astype(int) - pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
...:
1.32 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Apply are loops under the hood, so slowier:
In [109]: %%timeit
...: start = pd.to_datetime("2019-01-01")
...: df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
...:
25.7 s ± 729 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [110]: %%timeit
...: rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
...: mon = rd.apply(lambda x: ((x.years * 12) + x.months))
...: df['Diff'] = mon
...:
94.2 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I think this should work:
date_ref = pd.to_datetime('2019-01-01')
df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
month_delta = (date2.year - date1.year)*12 + (date2.month - date1.month)
output:
Date1 mo_since_2019_01
0 2019-01-23 0
1 2020-02-01 13
With this solution, you won't need further imports as it simply calculates the length of the pd.date_range() between your fixed start date and varying end date:
def relative_months(start, end, freq="M"):
if start < end:
x = len(pd.date_range(start=start,end=end,freq=freq))
else:
x = - len(pd.date_range(start=end,end=start,freq=freq))
return x
start = pd.to_datetime("2019-01-01")
df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
In your specific case, I think anon01's solution should be the quickest/ favorable; my variant however allows the use of generic frequency strings for date offsets like 'M', 'D', … and allows you to specifically handle the edge case of "negative" relative offsets (i.e. what happens if your comparison date is not earlier than all dates in Date1).
Try:
rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
mon = rd.apply(lambda x: ((x.years * 12) + x.months))
df['Diff'] = mon
Input:
Date1
0 2019-01-23
1 2020-02-01
2 2020-05-01
3 2020-06-01
Output:
Date1 Diff
0 2019-01-23 0
1 2020-02-01 13
2 2020-05-01 16
3 2020-06-01 17

pandas: finding date of specified dayofweek close to other date

I am looking to replace apply method with something faster for the following problem:
having day_of_week and closest_date columns I need to find the found_dates, that are of the specified day_of_week and are closest prior (backwards) to the closest_date, allowing equal to closest_date results.
initial df:
closest_date day_of_week
0 2009-06-01 6
1 2014-09-02 0
2 2014-10-11 4
3 2015-01-02 3
4 2015-07-11 4
I need to speed-up the following working code:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row['closest_date'] - Week(weekday=row['day_of_week'])
df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
below is just to fix where found_date should be equal to closest_date, but returns a week before.
import numpy as np
df['closest_date_dayofweek'] = df['closest_date'].dt.dayofweek
df['found_date'] = np.where(df['closest_date_dayofweek']==df['day_of_week'],
df['closest_date'],
df['found_date'])
df = df.drop(['closest_date_dayofweek'], axis=1)
that returns the following df
closest_date day_of_week found_date
0 2009-06-01 6 2009-05-31
1 2014-09-02 0 2014-09-01
2 2014-10-11 4 2014-10-10
3 2015-01-02 3 2015-01-01
4 2015-07-11 4 2015-07-10
5 2015-08-08 4 2015-08-07
The problem with the code above is the apply method, which is slow. Any ideas of how to speed up?
Thank you!
Because possible only 7 values you can use loop with filter only matched rows by another column:
for i in range(7):
m = df['day_of_week'].eq(i)
df.loc[m, 'date'] = df.loc[m, 'closest_date'] - Week(weekday=i)
And then new column is not necessary, use:
df['date'] = np.where(df['closest_date'].dt.dayofweek==df['day_of_week'],
df['closest_date'], df['date'])
Performance for 5000 rows:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row.closest_date - Week(weekday=row['day_of_week'])
df = pd.concat([df] * 1000, ignore_index=True)
In [137]: %timeit df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
550 ms ± 77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [138]: %%timeit
...: for i in range(7):
...: m = df['day_of_week'].eq(i)
...: df.loc[m, 'date1'] = df.loc[m, 'closest_date'] - Week(weekday=i)
...:
38.1 ms ± 883 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas dataframe apply to multiple column

I am trying to use apply function to my DataFrame.
The apply use a custom function that returns 2 values and that needs to populate the row of 2 columns on my DataFrame.
I put a simple example below:
df = DataFrame ({'a' : 10})
I wish to create two columns: b and c.
b equals 1 if a is above 0.
c equals 1 if a is above 0.
def compute_b_c(a):
if a > 0:
return 1, 1
else:
return 0,0
I tried this but it returns key error:
df[['b', 'c']] = df.a.apply(compute_b_c)
It is possible with DataFrame constructor,also 1,1 and 0,0 are like tuples (1,1) and (0,0):
df = pd.DataFrame ({'a' : [10, -1, 9]})
def compute_b_c(a):
if a > 0:
return (1,1)
else:
return (0,0)
df[['b', 'c']] = pd.DataFrame(df.a.apply(compute_b_c).tolist())
print (df)
a b c
0 10 1 1
1 -1 0 0
2 9 1 1
Performance:
#10k rows
df = pd.DataFrame ({'a' : [10, -1, 9] * 10000})
In [79]: %timeit df[['b', 'c']] = pd.DataFrame(df.a.apply(compute_b_c).tolist())
22.6 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [80]: %timeit df[['b', 'c']] = df.apply(lambda row: compute_b_c(row['a']), result_type='expand', axis=1)
5.25 s ± 84.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use result_type parameter of pandas.DataFrame.apply. Applicable only if you use apply function on df(DataFrame) and not df.a(Series)
df[['b', 'c']] = df.apply(lambda row: compute_b_c(row['a']), result_type='expand', axis=1)

Faster way to change each value in dataframe ACCORDING to original value

I have a dataframe with 30000 columns and 4000 rows. Each cell entry contains an integer. For EVERY entry, I want to multiply the original contents with log(k/m),
where k is the total number of rows ie.4000
and m is the total number of non zero rows for THAT PARTICULAR COLUMN.
My current code uses apply:
for column in df.columns:
m = len(df[column].to_numpy().nonzero())
df[column] = df[column].apply(lambda x: x * np.log10(4000/m))
This takes me hours (????). I hope there is some faster way to do it, anyone have any ideas?
Thanks
First generate sample data:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(4, 5)*500, columns=['A', 'B', 'C', 'D', 'E']).astype(int).replace(range(100, 200), 0)
Result:
A B C D E
0 348 0 0 275 359
1 211 490 342 240 0
2 0 364 219 29 0
3 368 91 87 265 265
Next I define a vector with containing the non-zero column counts:
non_zeros = df.ne(0).sum().values
# Giving me: array([3, 3, 3, 4, 2], dtype=int64)
From there I find the log-factor to each column:
faktor = np.mat(np.log10(len(df)/ non_zeros))
# giving me: matrix([[0.12493874, 0.12493874, 0.12493874, 0. , 0.30103 ]])
Then multiplying each column with it's factor and converting back to DataFrame:
res = np.multiply(np.mat(df), faktor)
df = pd.DataFrame(res)
With this solution you come around the non-tight loops in Python.
Hope it will bring some help.
#Dennis Hansen's answer is good, but if you're still need to iterate over column I would recommend not to use apply in your solution.
a = pd.DataFrame(np.random.rand(10000)) # define an arib. dataframe
a.iloc[5:500] = 0 # set some values to zero
Solution with apply performance:
>> %%timeit
>> b = a.apply(lambda x: x * np.log10(10000/len(a.to_numpy().nonzero())))
1.53 ms ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Solution without apply performance:
>> %%timeit
>> b = a*np.log10(10000/len(a.to_numpy().nonzero()))
849 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

how to iterate items of list in columns of dataframe

Here is my dataframe :
import pandas as pd
df = pd.DataFrame({'animal':['dog','cat','rabbit','pig'],'color':['red','green','blue','purple'],\
'season':['spring,','summer','fall','winter']})
and I have a list
l = ['dog','green','purple']
with these dataframe and list, I wanna add another column to df, which is actually a result if column 'animal' or column 'color' matched some item of l(list).
so, the result(dataframe) I want is below(I wanna express a table):
pd.DataFrame({'animal':['dog','cat','rabbit','pig'],
'color':['red','green','blue','purple'],
'season':['spring,','summer','fall','winter'],
'tar_rm':[1,1,0,1] })
Do I have to iterate list in each rows of column?
I believe one of pandas' advantage is broadcasting but i'm not sure it's possible here...
Use:
cols = ['animal','color']
df['tar_rm'] = df[cols].isin(l).any(axis=1).astype(int)
print (df)
animal color season tar_rm
0 dog red spring 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1
Details:
First compare filtered columns of DataFrame by DataFrame.isin:
print (df[cols].isin(l))
animal color
0 True False
1 False True
2 False False
3 False True
Then test if at least one True per rows by DataFrame.any:
print (df[cols].isin(l).any(axis=1))
0 True
1 True
2 False
3 True
dtype: bool
An last cast boolean to integers:
print (df[cols].isin(l).any(axis=1).astype(int))
0 1
1 1
2 0
3 1
dtype: int32
If performance is important compare by isin each column separately, convert to numpy array, chain by bitwise OR and last cast to integers:
df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
Performance: Depends of number of tows, number of matched rows and number of values of list, so best test in real data:
l = ['dog','green','purple']
df = pd.concat([df] * 100000, ignore_index=True).sample(1)
In [173]: %timeit df['tar_rm'] = df[['animal','color']].isin(l).any(axis=1).astype(int)
2.11 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [174]: %timeit df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
487 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [175]: %timeit df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
805 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
using numpy
df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
Output
animal color season tar_rm
0 dog red spring, 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1