pandas: finding date of specified dayofweek close to other date - pandas

I am looking to replace apply method with something faster for the following problem:
having day_of_week and closest_date columns I need to find the found_dates, that are of the specified day_of_week and are closest prior (backwards) to the closest_date, allowing equal to closest_date results.
initial df:
closest_date day_of_week
0 2009-06-01 6
1 2014-09-02 0
2 2014-10-11 4
3 2015-01-02 3
4 2015-07-11 4
I need to speed-up the following working code:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row['closest_date'] - Week(weekday=row['day_of_week'])
df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
below is just to fix where found_date should be equal to closest_date, but returns a week before.
import numpy as np
df['closest_date_dayofweek'] = df['closest_date'].dt.dayofweek
df['found_date'] = np.where(df['closest_date_dayofweek']==df['day_of_week'],
df['closest_date'],
df['found_date'])
df = df.drop(['closest_date_dayofweek'], axis=1)
that returns the following df
closest_date day_of_week found_date
0 2009-06-01 6 2009-05-31
1 2014-09-02 0 2014-09-01
2 2014-10-11 4 2014-10-10
3 2015-01-02 3 2015-01-01
4 2015-07-11 4 2015-07-10
5 2015-08-08 4 2015-08-07
The problem with the code above is the apply method, which is slow. Any ideas of how to speed up?
Thank you!

Because possible only 7 values you can use loop with filter only matched rows by another column:
for i in range(7):
m = df['day_of_week'].eq(i)
df.loc[m, 'date'] = df.loc[m, 'closest_date'] - Week(weekday=i)
And then new column is not necessary, use:
df['date'] = np.where(df['closest_date'].dt.dayofweek==df['day_of_week'],
df['closest_date'], df['date'])
Performance for 5000 rows:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row.closest_date - Week(weekday=row['day_of_week'])
df = pd.concat([df] * 1000, ignore_index=True)
In [137]: %timeit df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
550 ms ± 77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [138]: %%timeit
...: for i in range(7):
...: m = df['day_of_week'].eq(i)
...: df.loc[m, 'date1'] = df.loc[m, 'closest_date'] - Week(weekday=i)
...:
38.1 ms ± 883 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Frequency Table from All DataFrame Data

Want to generate frequency table from all values in DataFrame. I do not want the values from the index and index can be destroyed.
Sample data:
col_list = ['ob1','ob2','ob3','ob4', 'ob5']
df = pd.DataFrame(np.random.uniform(73.965,74.03,size=(25, 5)).astype(float), columns=col_list)
My attempt based off this answer:
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
df2 = df.apply(pd.Series.value_counts, bins=my_bins)
Code crashes, can't find another example that does what I'm trying.
Desired out put is a frequency table with counts for all values in bins. Something like this:
data_range
Frequency
73.965<=73.97
1
73.97<=73.975
0
73.98<=73.985
3
73.99<=73.995
2
And so on.
Your approach/code works fine with me.
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
out1 = (
df.apply(pd.Series.value_counts, bins=my_bins)
.sum(axis=1).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#32.6 ms ± 803 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here is different approach (using cut) that seems to be ~12x faster than apply.
my_bins = np.arange(73.965, 74.030, 0.005)
​
labels = [f"{np.around(l, 3)}<={np.around(r, 3)}"
for l, r in zip(my_bins[:-1], my_bins[1:])]
​
out2 = (
pd.Series(pd.cut(df.to_numpy().flatten(),
my_bins, labels=labels))
.value_counts(sort=False).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#2.42 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Output :
print(out2)
data_range Frequency
0 73.965<=73.97 16
1 73.97<=73.975 0
2 73.975<=73.98 15
3 73.98<=73.985 12
4 73.985<=73.99 7
.. ... ...
7 74.0<=74.005 8
8 74.005<=74.01 9
9 74.01<=74.015 7
10 74.015<=74.02 7
11 74.02<=74.025 11
[12 rows x 2 columns]

Create new dataframe column with isoweekday from datetime

I have this dataframe and I want to make a new column for which day of the week the collisions were on.
collision_date
0 2020-03-14
1 2020-07-26
2 2009-02-03
3 2009-02-28
4 2009-02-09
I have tried variations of this but nothing works.
df["day of the week"] = df["collision_date"].isoweekday()
df["day of the week"] = df["collision_date"].apply(isoweekday)
Assuming collision_date is datetime we can use dt.weekday (+1 to match isoweekday returning 1-7 instead of 0-6):
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Turn into Number
df['day of week'] = df['collision_date'].dt.weekday + 1
The slower option with apply is to call isoweekday per date:
from datetime import date
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Call isoweekday per date
df['day of week'] = df['collision_date'].apply(date.isoweekday)
df:
collision_date day of week
0 2020-03-14 6
1 2020-07-26 7
2 2009-02-03 2
3 2009-02-28 6
4 2009-02-09 1
Timing Information via timeit:
Sample Data with 1000 rows:
df = pd.DataFrame({
'collision_date': pd.date_range(start='now', periods=1000, freq='D')
})
dt.weekday:
%timeit df['collision_date'].dt.weekday + 1
261 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
apply:
%timeit df['collision_date'].apply(date.isoweekday)
2.53 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Vectorizing apply(list) and explode in pandas dataframe

I have a dataframe with dates in an integer format (timedelta in days from some arbitrary date) and using another column weeks I'd like to add 7 days to the start_date column for every week > 1 and explode that into another row.
So records with 1 week would remain the same, 2 weeks would get get one additional row, 3 weeks would get 2 additional rows, etc - each additinal row would have start_date incremented by 7.
It's fairly trivial using pd.apply with axis=1, but I can't seem to wrap my head around a vectorized method to solve this.
import pandas as pd
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
Starting df
product start_date weeks
0 a 1000 1
1 b 1000 2
2 c 1000 3
Current approach
df['dates'] = df.apply(lambda x: [x['start_date']+i*7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename({'dates':'start_date'})
Output
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014
Use loc + index.repeat to scale up the DataFrame, then groupby cumcount to add the multiple, then drop the column:
# Scale up DataFrame
df = df.loc[df.index.repeat(df['weeks'])]
# Create Dates Column grouping by the index (level=0)
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
# Drop Column
df = df.drop('start_date', axis=1)
df:
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014
Timing Information:
import pandas as pd
sample_df = pd.DataFrame({'product': ['a', 'b', 'c'],
'start_date': [1000, 1000, 1000],
'weeks': [1, 2, 3]})
OP's Original Code
def orig(df):
df['dates'] = df.apply(
lambda x: [x['start_date'] + i * 7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename(
{'dates': 'start_date'})
%timeit orig(sample_df)
3.53 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This answer:
def fn(df):
df = df.loc[df.index.repeat(df['weeks'])]
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
df = df.drop('start_date', axis=1)
%timeit fn(sample_df)
1.63 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
OP's Answer
def fn2(df):
df['x'] = df['weeks'].apply(lambda x: range(x))
df = df.explode('x')
df['start_date'] = df['start_date'] + (df['x'] * 7)
df.drop(columns='x', inplace=True)
%timeit fn2(sample_df)
2.71 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Figured it out, same concept just a slightly different order.
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
df['x'] = df['weeks'].apply(lambda x:range(x))
df = df.explode('x')
df['start_date'] = df['start_date']+(df['x']*7)
df.drop(columns='x', inplace=True)

Number of months between two dates while one date is given

Input df
Date1
2019-01-23
2020-02-01
note: The type of Date1 is datetime64[ns].
Goal
I want to calculate month diff between Date1 column and '2019-01-01'.
Try and Ref
I try the answers from this post , but it failed as below:
df['Date1'].dt.to_period('M') - pd.to_datetime('2019-01-01').to_period('M')
note:pandas version: 1.1.5
Your solution should be changed by convert periods to integers and for second value is used one element list ['2019-01-01']:
df['new'] = (df['Date1'].dt.to_period('M').astype(int) -
pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
print (df)
Date1 new
0 2019-01-23 0
1 2020-02-01 13
If compare solutions:
rng = pd.date_range('1900-04-03', periods=3000, freq='MS')
df = pd.DataFrame({'Date1': rng})
In [106]: %%timeit
...: date_ref = pd.to_datetime('2019-01-01')
...: df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
...:
1.57 ms ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [107]: %%timeit
...: df['new'] = (df['Date1'].dt.to_period('M').astype(int) - pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
...:
1.32 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Apply are loops under the hood, so slowier:
In [109]: %%timeit
...: start = pd.to_datetime("2019-01-01")
...: df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
...:
25.7 s ± 729 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [110]: %%timeit
...: rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
...: mon = rd.apply(lambda x: ((x.years * 12) + x.months))
...: df['Diff'] = mon
...:
94.2 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I think this should work:
date_ref = pd.to_datetime('2019-01-01')
df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
month_delta = (date2.year - date1.year)*12 + (date2.month - date1.month)
output:
Date1 mo_since_2019_01
0 2019-01-23 0
1 2020-02-01 13
With this solution, you won't need further imports as it simply calculates the length of the pd.date_range() between your fixed start date and varying end date:
def relative_months(start, end, freq="M"):
if start < end:
x = len(pd.date_range(start=start,end=end,freq=freq))
else:
x = - len(pd.date_range(start=end,end=start,freq=freq))
return x
start = pd.to_datetime("2019-01-01")
df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
In your specific case, I think anon01's solution should be the quickest/ favorable; my variant however allows the use of generic frequency strings for date offsets like 'M', 'D', … and allows you to specifically handle the edge case of "negative" relative offsets (i.e. what happens if your comparison date is not earlier than all dates in Date1).
Try:
rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
mon = rd.apply(lambda x: ((x.years * 12) + x.months))
df['Diff'] = mon
Input:
Date1
0 2019-01-23
1 2020-02-01
2 2020-05-01
3 2020-06-01
Output:
Date1 Diff
0 2019-01-23 0
1 2020-02-01 13
2 2020-05-01 16
3 2020-06-01 17

how to iterate items of list in columns of dataframe

Here is my dataframe :
import pandas as pd
df = pd.DataFrame({'animal':['dog','cat','rabbit','pig'],'color':['red','green','blue','purple'],\
'season':['spring,','summer','fall','winter']})
and I have a list
l = ['dog','green','purple']
with these dataframe and list, I wanna add another column to df, which is actually a result if column 'animal' or column 'color' matched some item of l(list).
so, the result(dataframe) I want is below(I wanna express a table):
pd.DataFrame({'animal':['dog','cat','rabbit','pig'],
'color':['red','green','blue','purple'],
'season':['spring,','summer','fall','winter'],
'tar_rm':[1,1,0,1] })
Do I have to iterate list in each rows of column?
I believe one of pandas' advantage is broadcasting but i'm not sure it's possible here...
Use:
cols = ['animal','color']
df['tar_rm'] = df[cols].isin(l).any(axis=1).astype(int)
print (df)
animal color season tar_rm
0 dog red spring 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1
Details:
First compare filtered columns of DataFrame by DataFrame.isin:
print (df[cols].isin(l))
animal color
0 True False
1 False True
2 False False
3 False True
Then test if at least one True per rows by DataFrame.any:
print (df[cols].isin(l).any(axis=1))
0 True
1 True
2 False
3 True
dtype: bool
An last cast boolean to integers:
print (df[cols].isin(l).any(axis=1).astype(int))
0 1
1 1
2 0
3 1
dtype: int32
If performance is important compare by isin each column separately, convert to numpy array, chain by bitwise OR and last cast to integers:
df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
Performance: Depends of number of tows, number of matched rows and number of values of list, so best test in real data:
l = ['dog','green','purple']
df = pd.concat([df] * 100000, ignore_index=True).sample(1)
In [173]: %timeit df['tar_rm'] = df[['animal','color']].isin(l).any(axis=1).astype(int)
2.11 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [174]: %timeit df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
487 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [175]: %timeit df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
805 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
using numpy
df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
Output
animal color season tar_rm
0 dog red spring, 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1