Pandas dataframe diff except some rows? - pandas

df
end_date dt_eps
0 20200930 0.9625
1 20200630 0.5200
2 20200331 0.2130
3 20191231 1.2700
4 20190930 -0.1017
5 20190630 -0.1058
6 20190331 0.0021
7 20181231 0.0100
Note: the value of end_date must be the last day of each year quarter and the sequence is sorted by near and the type is string.
Goal
create q_dt_eps column: calculate the diff of dt_eps between the nearest day but it is the same as dt_eps when the quarter is Q1. For example, the q_dt_eps for 20200930 is 0.4425(0.9625-0.5200) while 20200331 is 1.2700.
Try
df['q_dt_eps']=df['dt_eps'].diff(periods=-1)
But it could not return the same value of dt_eps when the quarter is Q1.

You can just convert the date to datetime, extract the quarter of the date, and then create your new column using np.where, keeping the original value when quarter is equal to 1, otherwise using the shifted value.
import numpy as np
import pandas as pd
df = pd.DataFrame({'end_date':['20200930', '20200630', '20200331',
'20191231', '20190930', '20190630', '20190331', '20181231'],
'dt_eps':[0.9625, 0.52, 0.213, 1.27, -.1017, -.1058, .0021, .01]})
df['end_date'] = pd.to_datetime(df['end_date'], format='%Y%m%d')
df['qtr'] = df['end_date'].dt.quarter
df['q_dt_eps'] = np.where(df['qtr']==1, df['dt_eps'], df['dt_eps'].diff(-1))
df
end_date dt_eps qtr q_dt_eps
0 2020-09-30 0.9625 3 0.4425
1 2020-06-30 0.5200 2 0.3070
2 2020-03-31 0.2130 1 0.2130
3 2019-12-31 1.2700 4 1.3717
4 2019-09-30 -0.1017 3 0.0041
5 2019-06-30 -0.1058 2 -0.1079
6 2019-03-31 0.0021 1 0.0021
7 2018-12-31 0.0100 4 NaN

Related

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:
Current Dataset:
ID DAY
0 1 22.03.2020
1 1 18.04.2020
2 2 10.05.2020
3 2 13.01.2020
4 3 30.03.2020
5 3 31.03.2020
6 3 24.02.2021
Goal:
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021
Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...
You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
mask = (df
.sort_values(by=['ID', 'DAY'])
.groupby('ID')['DAY']
.diff().lt('30d')
.sort_index()
)
df[~mask]
NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint
output:
ID DAY
0 1 2020-03-22
2 2 2020-10-05
3 2 2020-01-13
4 3 2020-03-30
6 3 2021-02-24
Thus another approach might be to resample the data per group to 30days:
(df
.groupby('ID')
.resample('30d', on='DAY').first()
.dropna()
.convert_dtypes()
.reset_index(drop=True)
)
output:
ID DAY
0 1 2020-03-22
1 2 2020-01-13
2 2 2020-10-05
3 3 2020-03-30
4 3 2021-02-24
You can try group by ID column and diff the DAY column in each group
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
from datetime import timedelta
m = timedelta(days=30)
out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)
ID DAY
0 1 2020-03-22
1 2 2020-05-10
2 2 2020-01-13
3 3 2020-03-30
4 3 2021-02-24
To convert to original date format, you can use dt.strftime
out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021

Pandas groupby and rolling window

I`m trying to calculate the sum of one field for a specific period of time, after grouping function is applied.
My dataset look like this:
Date Company Country Sold
01.01.2020 A BE 1
02.01.2020 A BE 0
03.01.2020 A BE 1
03.01.2020 A BE 1
04.01.2020 A BE 1
05.01.2020 B DE 1
06.01.2020 B DE 0
I would like to add a new column per each row, that calculates the sum of Sold (per each group "Company, Country" for the last 7 days - not including the current day
Date Company Country Sold LastWeek_Count
01.01.2020 A BE 1 0
02.01.2020 A BE 0 1
03.01.2020 A BE 1 1
03.01.2020 A BE 1 1
04.01.2020 A BE 1 3
05.01.2020 B DE 1 0
06.01.2020 B DE 0 1
I tried the following, but it is also including the current date, and it gives differnt values for the same date, i.e 03.01.2020
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(7, on ='Date')['Sold'].sum().reset_index()
Is there a buildin function in pandas that I can use to perform these calculations?
You can use a .rolling window of 8 and then subtract the sum of the Date (for each grouped row) to effectively get the previous 7 days. For this sample data, we should also pass min_periods=1 (otherwise you will get NaN values, but for your actual dataset, you will need to decide what you want to do with windows that are < 8).
Then from the .rolling window of 8, simply do another .groupby of the relevant columns but also include Date this time, and take the max value of the newly created LastWeek_Count column. You need to take the max, because you have multiple records per day, so by taking the max, you are taking the total aggregated amount per Date.
Then, create a series that takes the grouped by sum per Date. In the final step subtract the sum by date from the rolling 8-day max, which is a workaround to how you can get the sum of the previous 7 days, as there is not a parameter for an offset with .rolling:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(8, min_periods=1, on='Date')['Sold'].sum().reset_index()['Sold']
df['LastWeek_Count'] = df.groupby(['Company', 'Country', 'Date'])['LastWeek_Count'].transform('max')
s = df.groupby(['Company', 'Country', 'Date'])['Sold'].transform('sum')
df['LastWeek_Count'] = (df['LastWeek_Count']-s).astype(int)
Out[17]:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0
1 2020-01-02 A BE 0 1
2 2020-01-03 A BE 1 1
3 2020-01-03 A BE 1 1
4 2020-01-04 A BE 1 3
5 2020-01-05 B DE 1 0
6 2020-01-06 B DE 0 1
One way would be to first consolidate the Sold value of each group (['Date', 'Company', 'Country']) on a single line using a temporary DF.
After that, apply your .groupby with .rolling with an interval of 8 rows.
After calculating the sum, subtract the value of each line with the value in Sold column and add that column in the original DF with .merge
#convert Date column to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
#create a temporary DataFrame
df2 = df.groupby(['Date', 'Company', 'Country'])['Sold'].sum().reset_index()
#calc the lastweek
df2['LastWeek_Count'] = (df2.groupby(['Company', 'Country'])
.rolling(8, min_periods=1, on = 'Date')['Sold']
.sum().reset_index(drop=True)
)
#subtract the value of 'lastweek' from the current 'Sold'
df2['LastWeek_Count'] = df2['LastWeek_Count'] - df2['Sold']
#add th2 new column in the original DF
df.merge(df2.drop(columns=['Sold']), on = ['Date', 'Company', 'Country'])
#output:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0.0
1 2020-01-02 A BE 0 1.0
2 2020-01-03 A BE 1 1.0
3 2020-01-03 A BE 1 1.0
4 2020-01-04 A BE 1 3.0
5 2020-01-05 B DE 1 0.0
6 2020-01-06 B DE 0 1.0

Pandas Lambda Function Format Month and Day

I have a DF "ltyc" that looks like this:
month day wind_speed
0 1 1 11.263604
1 1 2 11.971495
2 1 3 11.989080
3 1 4 12.558736
4 1 5 11.850899
And, i apply a lambda function:
ltyc['date'] = pd.to_datetime(ltyc["month"], format='%m').apply(lambda dt: dt.replace(year=2020))
To get it to look like this:
month day wind_speed date
0 1 1 11.263604 2020-01-01
1 1 2 11.971495 2020-01-01
2 1 3 11.989080 2020-01-01
3 1 4 12.558736 2020-01-01
4 1 5 11.850899 2020-01-01
Except, I need it to look like this so that the days change also...but I cannot figure out how to format the lambda statement to do this instead as this is what I need.
month day wind_speed date
0 1 1 11.263604 2020-01-01
1 1 2 11.971495 2020-01-02
2 1 3 11.989080 2020-01-03
3 1 4 12.558736 2020-01-04
4 1 5 11.850899 2020-01-05
I have tried this:
ltyc['date'] = pd.to_datetime(ltyc["month"], format='%m%d').apply(lambda dt: dt.replace(year=2020))
and i get this error:
ValueError: time data '1' does not match format '%m%d' (match)
Thank you for help since i'm trying to figure out the lambda functions.
create a series with value 2020 and name year. Concat it to ['month', 'day'] and passing to pd.to_datetime. As long as, you passing a dataframe with columns names in this order year, month, date, pd.to_datetime will convert it to the appropriate datetime series.
#Allolz suggestion:
ltyc['date'] = pd.to_datetime(ltyc[['day', 'month']].assign(year=2020))
Out[367]:
month day wind_speed date
0 1 1 11.263604 2020-01-01
1 1 2 11.971495 2020-01-02
2 1 3 11.989080 2020-01-03
3 1 4 12.558736 2020-01-04
4 1 5 11.850899 2020-01-05
Or you may use reindex to create the sub-dataframe to pass to pd.to_datetime
ltyc['date'] = pd.to_datetime(ltyc.reindex(['year','month','day'],
axis=1, fill_value=2020))
Original:
s = pd.Series([2020]*len(ltyc), name='year')
ltyc['date'] = pd.to_datetime(pd.concat([s, ltyc[['month','day']]], axis=1))
This is similar to a previous answer, but does not persist the 'helper' column with the year. In brief, we pass a data frame with three columns (year, month, day) to the to_datetime() function.
ltyc['date'] = pd.to_datetime(ltyc
.assign(year=2020)
.filter(['year', 'month', 'day'])
)
You could also use your method and add month and day together with .astype(str) and then add %d to the format. The problem with your lambda is that you only considered month, so this is how you would consider month and day.
ltyc['date'] = (pd.to_datetime(ltyc["month"].astype(str) + '-' + ltyc["day"].astype(str),
format='%m-%d')
.apply(lambda dt: dt.replace(year=2020)))
output:
month day wind_speed date
0 1 1 11.263604 2020-01-01
1 1 2 11.971495 2020-01-02
2 1 3 11.989080 2020-01-03
3 1 4 12.558736 2020-01-04
4 1 5 11.850899 2020-01-05

Calculate the number of weekends (Saturdays and Sundays), between two dates

I have a data frame with two date columns, a start and end date. How will I find the number of weekends between the start and end dates using pandas or python date-times
I know that pandas has DatetimeIndex which returns values 0 to 6 for each day of the week, starting Monday
# create a data-frame
import pandas as pd
df = pd.DataFrame({'start_date':['4/5/19','4/5/19','1/5/19','28/4/19'],
'end_date': ['4/5/19','5/5/19','4/5/19','5/5/19']})
# convert objects to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
# Trying to get the date index between dates as a prelim step but fails
pd.DatetimeIndex(df['end_date'] - df['start_date']).weekday
I'm expecting the result to be this: (weekend_count includes both start and end dates)
start_date end_date weekend_count
4/5/2019 4/5/2019 1
4/5/2019 5/5/2019 2
1/5/2019 4/5/2019 1
28/4/2019 5/5/2019 3
IIUC
df['New']=[pd.date_range(x,y).weekday.isin([5,6]).sum() for x , y in zip(df.start_date,df.end_date)]
df
start_date end_date New
0 2019-05-04 2019-05-04 1
1 2019-05-04 2019-05-05 2
2 2019-05-01 2019-05-04 1
3 2019-04-28 2019-05-05 3
Try with:
df['weekend_count']=((df.end_date-df.start_date).dt.days+1)-np.busday_count(
df.start_date.dt.date,df.end_date.dt.date)
print(df)
start_date end_date weekend_count
0 2019-05-04 2019-05-04 1
1 2019-05-04 2019-05-05 2
2 2019-05-01 2019-05-04 1
3 2019-04-28 2019-05-05 3

week number from given date in pandas

I have a data frame with two columns Date and value.
I want to add new column named week_number that basically is how many weeks back from the given date
import pandas as pd
df = pd.DataFrame(columns=['Date','value'])
df['Date'] = [ '04-02-2019','03-02-2019','28-01-2019','20-01-2019']
df['value'] = [10,20,30,40]
df
Date value
0 04-02-2019 10
1 03-02-2019 20
2 28-01-2019 30
3 20-01-2019 40
suppose given date is 05-02-2019.
Then I need to add a column week_number in a way such that how many weeks back the Date column date is from given date.
The output should be
Date value week_number
0 04-02-2019 10 1
1 03-02-2019 20 1
2 28-01-2019 30 2
3 20-01-2019 40 3
how can I do this in pandas
First convert column to datetimes by to_datetime with dayfirst=True, then subtract from right side by rsub, convert timedeltas to days, get modulo by 7 and add 1:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['week_number'] = df['Date'].rsub(pd.Timestamp('2019-02-05')).dt.days // 7 + 1
#alternative
#df['week_number'] = (pd.Timestamp('2019-02-05') - df['Date']).dt.days // 7 + 1
print (df)
Date value week_number
0 2019-02-04 10 1
1 2019-02-03 20 1
2 2019-01-28 30 2
3 2019-01-20 40 3