pandas: finding date of specified dayofweek close to other date

pandas: finding date of specified dayofweek close to other date - pandas

I am looking to replace apply method with something faster for the following problem:
having day_of_week and closest_date columns I need to find the found_dates, that are of the specified day_of_week and are closest prior (backwards) to the closest_date, allowing equal to closest_date results.
initial df:
closest_date day_of_week
0 2009-06-01 6
1 2014-09-02 0
2 2014-10-11 4
3 2015-01-02 3
4 2015-07-11 4
I need to speed-up the following working code:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row['closest_date'] - Week(weekday=row['day_of_week'])
df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
below is just to fix where found_date should be equal to closest_date, but returns a week before.
import numpy as np
df['closest_date_dayofweek'] = df['closest_date'].dt.dayofweek
df['found_date'] = np.where(df['closest_date_dayofweek']==df['day_of_week'],
df['closest_date'],
df['found_date'])
df = df.drop(['closest_date_dayofweek'], axis=1)
that returns the following df
closest_date day_of_week found_date
0 2009-06-01 6 2009-05-31
1 2014-09-02 0 2014-09-01
2 2014-10-11 4 2014-10-10
3 2015-01-02 3 2015-01-01
4 2015-07-11 4 2015-07-10
5 2015-08-08 4 2015-08-07
The problem with the code above is the apply method, which is slow. Any ideas of how to speed up?
Thank you!

Because possible only 7 values you can use loop with filter only matched rows by another column:
for i in range(7):
m = df['day_of_week'].eq(i)
df.loc[m, 'date'] = df.loc[m, 'closest_date'] - Week(weekday=i)
And then new column is not necessary, use:
df['date'] = np.where(df['closest_date'].dt.dayofweek==df['day_of_week'],
df['closest_date'], df['date'])
Performance for 5000 rows:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row.closest_date - Week(weekday=row['day_of_week'])
df = pd.concat([df] * 1000, ignore_index=True)
In [137]: %timeit df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
550 ms ± 77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [138]: %%timeit
...: for i in range(7):
...: m = df['day_of_week'].eq(i)
...: df.loc[m, 'date1'] = df.loc[m, 'closest_date'] - Week(weekday=i)
...:
38.1 ms ± 883 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Frequency Table from All DataFrame Data

Want to generate frequency table from all values in DataFrame. I do not want the values from the index and index can be destroyed.
Sample data:
col_list = ['ob1','ob2','ob3','ob4', 'ob5']
df = pd.DataFrame(np.random.uniform(73.965,74.03,size=(25, 5)).astype(float), columns=col_list)
My attempt based off this answer:
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
df2 = df.apply(pd.Series.value_counts, bins=my_bins)
Code crashes, can't find another example that does what I'm trying.
Desired out put is a frequency table with counts for all values in bins. Something like this:
data_range
Frequency
73.965<=73.97
1
73.97<=73.975
0
73.98<=73.985
3
73.99<=73.995
2
And so on.

Your approach/code works fine with me.
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
out1 = (
df.apply(pd.Series.value_counts, bins=my_bins)
.sum(axis=1).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#32.6 ms ± 803 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here is different approach (using cut) that seems to be ~12x faster than apply.
my_bins = np.arange(73.965, 74.030, 0.005)

labels = [f"{np.around(l, 3)}<={np.around(r, 3)}"
for l, r in zip(my_bins[:-1], my_bins[1:])]

out2 = (
pd.Series(pd.cut(df.to_numpy().flatten(),
my_bins, labels=labels))
.value_counts(sort=False).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#2.42 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Output :
print(out2)
data_range Frequency
0 73.965<=73.97 16
1 73.97<=73.975 0
2 73.975<=73.98 15
3 73.98<=73.985 12
4 73.985<=73.99 7
.. ... ...
7 74.0<=74.005 8
8 74.005<=74.01 9
9 74.01<=74.015 7
10 74.015<=74.02 7
11 74.02<=74.025 11
[12 rows x 2 columns]

Create new dataframe column with isoweekday from datetime

I have this dataframe and I want to make a new column for which day of the week the collisions were on.
collision_date
0 2020-03-14
1 2020-07-26
2 2009-02-03
3 2009-02-28
4 2009-02-09
I have tried variations of this but nothing works.
df["day of the week"] = df["collision_date"].isoweekday()
df["day of the week"] = df["collision_date"].apply(isoweekday)

Assuming collision_date is datetime we can use dt.weekday (+1 to match isoweekday returning 1-7 instead of 0-6):
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Turn into Number
df['day of week'] = df['collision_date'].dt.weekday + 1
The slower option with apply is to call isoweekday per date:
from datetime import date
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Call isoweekday per date
df['day of week'] = df['collision_date'].apply(date.isoweekday)
df:
collision_date day of week
0 2020-03-14 6
1 2020-07-26 7
2 2009-02-03 2
3 2009-02-28 6
4 2009-02-09 1
Timing Information via timeit:
Sample Data with 1000 rows:
df = pd.DataFrame({
'collision_date': pd.date_range(start='now', periods=1000, freq='D')
})
dt.weekday:
%timeit df['collision_date'].dt.weekday + 1
261 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
apply:
%timeit df['collision_date'].apply(date.isoweekday)
2.53 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Vectorizing apply(list) and explode in pandas dataframe

I have a dataframe with dates in an integer format (timedelta in days from some arbitrary date) and using another column weeks I'd like to add 7 days to the start_date column for every week > 1 and explode that into another row.
So records with 1 week would remain the same, 2 weeks would get get one additional row, 3 weeks would get 2 additional rows, etc - each additinal row would have start_date incremented by 7.
It's fairly trivial using pd.apply with axis=1, but I can't seem to wrap my head around a vectorized method to solve this.
import pandas as pd
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
Starting df
product start_date weeks
0 a 1000 1
1 b 1000 2
2 c 1000 3
Current approach
df['dates'] = df.apply(lambda x: [x['start_date']+i*7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename({'dates':'start_date'})
Output
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014

Use loc + index.repeat to scale up the DataFrame, then groupby cumcount to add the multiple, then drop the column:
# Scale up DataFrame
df = df.loc[df.index.repeat(df['weeks'])]
# Create Dates Column grouping by the index (level=0)
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
# Drop Column
df = df.drop('start_date', axis=1)
df:
product weeks dates
0 a 1 1000
1 b 2 1000
1 b 2 1007
2 c 3 1000
2 c 3 1007
2 c 3 1014
Timing Information:
import pandas as pd
sample_df = pd.DataFrame({'product': ['a', 'b', 'c'],
'start_date': [1000, 1000, 1000],
'weeks': [1, 2, 3]})
OP's Original Code
def orig(df):
df['dates'] = df.apply(
lambda x: [x['start_date'] + i * 7 for i in range(x['weeks'])], axis=1)
df = df.explode('dates').drop(columns=['start_date']).rename(
{'dates': 'start_date'})
%timeit orig(sample_df)
3.53 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This answer:
def fn(df):
df = df.loc[df.index.repeat(df['weeks'])]
df['dates'] = df['start_date'].add(df.groupby(level=0).cumcount().mul(7))
df = df.drop('start_date', axis=1)
%timeit fn(sample_df)
1.63 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
OP's Answer
def fn2(df):
df['x'] = df['weeks'].apply(lambda x: range(x))
df = df.explode('x')
df['start_date'] = df['start_date'] + (df['x'] * 7)
df.drop(columns='x', inplace=True)
%timeit fn2(sample_df)
2.71 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Figured it out, same concept just a slightly different order.
df = pd.DataFrame({'product':['a','b','c'], 'start_date':[1000,1000,1000], 'weeks':[1,2,3]})
df['x'] = df['weeks'].apply(lambda x:range(x))
df = df.explode('x')
df['start_date'] = df['start_date']+(df['x']*7)
df.drop(columns='x', inplace=True)

Number of months between two dates while one date is given

Input df
Date1
2019-01-23
2020-02-01
note: The type of Date1 is datetime64[ns].
Goal
I want to calculate month diff between Date1 column and '2019-01-01'.
Try and Ref
I try the answers from this post , but it failed as below:
df['Date1'].dt.to_period('M') - pd.to_datetime('2019-01-01').to_period('M')
note:pandas version: 1.1.5

Your solution should be changed by convert periods to integers and for second value is used one element list ['2019-01-01']:
df['new'] = (df['Date1'].dt.to_period('M').astype(int) -
pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
print (df)
Date1 new
0 2019-01-23 0
1 2020-02-01 13
If compare solutions:
rng = pd.date_range('1900-04-03', periods=3000, freq='MS')
df = pd.DataFrame({'Date1': rng})
In [106]: %%timeit
...: date_ref = pd.to_datetime('2019-01-01')
...: df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
...:
1.57 ms ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [107]: %%timeit
...: df['new'] = (df['Date1'].dt.to_period('M').astype(int) - pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
...:
1.32 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Apply are loops under the hood, so slowier:
In [109]: %%timeit
...: start = pd.to_datetime("2019-01-01")
...: df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
...:
25.7 s ± 729 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [110]: %%timeit
...: rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
...: mon = rd.apply(lambda x: ((x.years * 12) + x.months))
...: df['Diff'] = mon
...:
94.2 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I think this should work:
date_ref = pd.to_datetime('2019-01-01')
df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
month_delta = (date2.year - date1.year)*12 + (date2.month - date1.month)
output:
Date1 mo_since_2019_01
0 2019-01-23 0
1 2020-02-01 13

With this solution, you won't need further imports as it simply calculates the length of the pd.date_range() between your fixed start date and varying end date:
def relative_months(start, end, freq="M"):
if start < end:
x = len(pd.date_range(start=start,end=end,freq=freq))
else:
x = - len(pd.date_range(start=end,end=start,freq=freq))
return x
start = pd.to_datetime("2019-01-01")
df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
In your specific case, I think anon01's solution should be the quickest/ favorable; my variant however allows the use of generic frequency strings for date offsets like 'M', 'D', … and allows you to specifically handle the edge case of "negative" relative offsets (i.e. what happens if your comparison date is not earlier than all dates in Date1).

Try:
rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
mon = rd.apply(lambda x: ((x.years * 12) + x.months))
df['Diff'] = mon
Input:
Date1
0 2019-01-23
1 2020-02-01
2 2020-05-01
3 2020-06-01
Output:
Date1 Diff
0 2019-01-23 0
1 2020-02-01 13
2 2020-05-01 16
3 2020-06-01 17

how to iterate items of list in columns of dataframe

Here is my dataframe :
import pandas as pd
df = pd.DataFrame({'animal':['dog','cat','rabbit','pig'],'color':['red','green','blue','purple'],\
'season':['spring,','summer','fall','winter']})
and I have a list
l = ['dog','green','purple']
with these dataframe and list, I wanna add another column to df, which is actually a result if column 'animal' or column 'color' matched some item of l(list).
so, the result(dataframe) I want is below(I wanna express a table):
pd.DataFrame({'animal':['dog','cat','rabbit','pig'],
'color':['red','green','blue','purple'],
'season':['spring,','summer','fall','winter'],
'tar_rm':[1,1,0,1] })
Do I have to iterate list in each rows of column?
I believe one of pandas' advantage is broadcasting but i'm not sure it's possible here...

Use:
cols = ['animal','color']
df['tar_rm'] = df[cols].isin(l).any(axis=1).astype(int)
print (df)
animal color season tar_rm
0 dog red spring 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1
Details:
First compare filtered columns of DataFrame by DataFrame.isin:
print (df[cols].isin(l))
animal color
0 True False
1 False True
2 False False
3 False True
Then test if at least one True per rows by DataFrame.any:
print (df[cols].isin(l).any(axis=1))
0 True
1 True
2 False
3 True
dtype: bool
An last cast boolean to integers:
print (df[cols].isin(l).any(axis=1).astype(int))
0 1
1 1
2 0
3 1
dtype: int32
If performance is important compare by isin each column separately, convert to numpy array, chain by bitwise OR and last cast to integers:
df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
Performance: Depends of number of tows, number of matched rows and number of values of list, so best test in real data:
l = ['dog','green','purple']
df = pd.concat([df] * 100000, ignore_index=True).sample(1)
In [173]: %timeit df['tar_rm'] = df[['animal','color']].isin(l).any(axis=1).astype(int)
2.11 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [174]: %timeit df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
487 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [175]: %timeit df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
805 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

using numpy
df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
Output
animal color season tar_rm
0 dog red spring, 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pandas: finding date of specified dayofweek close to other date - pandas

Related

Frequency Table from All DataFrame Data

Create new dataframe column with isoweekday from datetime

Vectorizing apply(list) and explode in pandas dataframe

Number of months between two dates while one date is given

how to iterate items of list in columns of dataframe

Categories

Resources