Subtract dates row by row if the ids are the same - pandas

I'd like to subtract dates if the next row's id is the same. I'm able to subtract dates, but stuck on creating the condition to check if the next row has the same id.
d = {'date':['2021-01', '2020-01', '2020-05', '2021-01'], 'id':['a', 'a', 'b', 'b']}
df = pd.DataFrame(data=d)
date id
2021-01 a
2020-01 a
2020-05 b
2021-01 b
My code
df = df.sort_values(by=['id', 'date'])
df['date_diff'] = pd.to_datetime(df['date']) - pd.to_datetime(df['date'].shift())
result
date id date_diff
2020-01 a NaT
2021-01 a 366 days
2020-05 b -245 days
2021-01 b 245 days
Expected result should as below, which the dates only be subtracted when the ids are the same.

Chain with groupby
df['date'] = pd.to_datetime(df['date'])
df['date_diff'] = df.groupby('id')['date'].diff()

df['date']=pd.to_datetime(df['date'])
df['date_diff']=df.groupby('id')['date'].diff()

Related

Selecting the most recent and the 6th most recent months from a dataframe

I have a dataframe with 24 months of dates. How do I create a new dataframe that only include the most recent month in the dataframe and the 6th/nth most recent month.
You can test for equality of year and month for current date or current date minus 6 months.
df = pd.DataFrame({"Date":pd.date_range(dt.date(2019,9,1), dt.date(2021,9,1), freq="M")})
t = pd.to_datetime("today")
td = t - pd.Timedelta(days=365//2)
mask = (df.Date.dt.year.eq(t.year) & df.Date.dt.month.eq(t.month)) | (df.Date.dt.year.eq(td.year) & df.Date.dt.month.eq(td.month))
df2 = df[mask]
print(df2)
output
Date
11 2020-08-31
17 2021-02-28

Pandas groupby and rolling window

I`m trying to calculate the sum of one field for a specific period of time, after grouping function is applied.
My dataset look like this:
Date Company Country Sold
01.01.2020 A BE 1
02.01.2020 A BE 0
03.01.2020 A BE 1
03.01.2020 A BE 1
04.01.2020 A BE 1
05.01.2020 B DE 1
06.01.2020 B DE 0
I would like to add a new column per each row, that calculates the sum of Sold (per each group "Company, Country" for the last 7 days - not including the current day
Date Company Country Sold LastWeek_Count
01.01.2020 A BE 1 0
02.01.2020 A BE 0 1
03.01.2020 A BE 1 1
03.01.2020 A BE 1 1
04.01.2020 A BE 1 3
05.01.2020 B DE 1 0
06.01.2020 B DE 0 1
I tried the following, but it is also including the current date, and it gives differnt values for the same date, i.e 03.01.2020
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(7, on ='Date')['Sold'].sum().reset_index()
Is there a buildin function in pandas that I can use to perform these calculations?
You can use a .rolling window of 8 and then subtract the sum of the Date (for each grouped row) to effectively get the previous 7 days. For this sample data, we should also pass min_periods=1 (otherwise you will get NaN values, but for your actual dataset, you will need to decide what you want to do with windows that are < 8).
Then from the .rolling window of 8, simply do another .groupby of the relevant columns but also include Date this time, and take the max value of the newly created LastWeek_Count column. You need to take the max, because you have multiple records per day, so by taking the max, you are taking the total aggregated amount per Date.
Then, create a series that takes the grouped by sum per Date. In the final step subtract the sum by date from the rolling 8-day max, which is a workaround to how you can get the sum of the previous 7 days, as there is not a parameter for an offset with .rolling:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(8, min_periods=1, on='Date')['Sold'].sum().reset_index()['Sold']
df['LastWeek_Count'] = df.groupby(['Company', 'Country', 'Date'])['LastWeek_Count'].transform('max')
s = df.groupby(['Company', 'Country', 'Date'])['Sold'].transform('sum')
df['LastWeek_Count'] = (df['LastWeek_Count']-s).astype(int)
Out[17]:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0
1 2020-01-02 A BE 0 1
2 2020-01-03 A BE 1 1
3 2020-01-03 A BE 1 1
4 2020-01-04 A BE 1 3
5 2020-01-05 B DE 1 0
6 2020-01-06 B DE 0 1
One way would be to first consolidate the Sold value of each group (['Date', 'Company', 'Country']) on a single line using a temporary DF.
After that, apply your .groupby with .rolling with an interval of 8 rows.
After calculating the sum, subtract the value of each line with the value in Sold column and add that column in the original DF with .merge
#convert Date column to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
#create a temporary DataFrame
df2 = df.groupby(['Date', 'Company', 'Country'])['Sold'].sum().reset_index()
#calc the lastweek
df2['LastWeek_Count'] = (df2.groupby(['Company', 'Country'])
.rolling(8, min_periods=1, on = 'Date')['Sold']
.sum().reset_index(drop=True)
)
#subtract the value of 'lastweek' from the current 'Sold'
df2['LastWeek_Count'] = df2['LastWeek_Count'] - df2['Sold']
#add th2 new column in the original DF
df.merge(df2.drop(columns=['Sold']), on = ['Date', 'Company', 'Country'])
#output:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0.0
1 2020-01-02 A BE 0 1.0
2 2020-01-03 A BE 1 1.0
3 2020-01-03 A BE 1 1.0
4 2020-01-04 A BE 1 3.0
5 2020-01-05 B DE 1 0.0
6 2020-01-06 B DE 0 1.0

How to add month column in dataframe based on dates in data?

I want to categorize data by month column
e.g.
date Month
2009-05-01==>May
I want to check outcomes by monthly
In this table I am excluding years and only want to keep months.
This is simple when using pd.Series.dt.month_name (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.month_name.html):
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2000-01-01', '2010-01-01', freq='1M')
})
df['month'] = df.date.dt.month_name()
df.head()
Output
date month
0 2000-01-31 January
1 2000-02-29 February
2 2000-03-31 March
3 2000-04-30 April
4 2000-05-31 May

week number from given date in pandas

I have a data frame with two columns Date and value.
I want to add new column named week_number that basically is how many weeks back from the given date
import pandas as pd
df = pd.DataFrame(columns=['Date','value'])
df['Date'] = [ '04-02-2019','03-02-2019','28-01-2019','20-01-2019']
df['value'] = [10,20,30,40]
df
Date value
0 04-02-2019 10
1 03-02-2019 20
2 28-01-2019 30
3 20-01-2019 40
suppose given date is 05-02-2019.
Then I need to add a column week_number in a way such that how many weeks back the Date column date is from given date.
The output should be
Date value week_number
0 04-02-2019 10 1
1 03-02-2019 20 1
2 28-01-2019 30 2
3 20-01-2019 40 3
how can I do this in pandas
First convert column to datetimes by to_datetime with dayfirst=True, then subtract from right side by rsub, convert timedeltas to days, get modulo by 7 and add 1:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['week_number'] = df['Date'].rsub(pd.Timestamp('2019-02-05')).dt.days // 7 + 1
#alternative
#df['week_number'] = (pd.Timestamp('2019-02-05') - df['Date']).dt.days // 7 + 1
print (df)
Date value week_number
0 2019-02-04 10 1
1 2019-02-03 20 1
2 2019-01-28 30 2
3 2019-01-20 40 3

How to add a yearly amount to daily data in Pandas

I have two DataFrames in pandas. One of them has data every month, the other one has data every year. I need to do some computation where the yearly value is added to the monthly value.
Something like this:
df1, monthly:
2013-01-01 1
2013-02-01 1
...
2014-01-01 1
2014-02-01 1
...
2015-01-01 1
df2, yearly:
2013-01-01 1
2014-01-01 2
2015-01-01 3
And I want to produce something like this:
2013-01-01 (1+1) = 2
2013-02-01 (1+1) = 2
...
2014-01-01 (1+2) = 3
2014-02-01 (1+2) = 3
...
2015-01-01 (1+3) = 4
Where the value of the monthly data is added to the value of the yearly data depending on the year (first value in the parenthesis is the monthly data, second value is the yearly data).
Assuming your "month" column is called date in the Dataframe df, then you can obtain the year by using the dt member:
pd.to_datetime(df.date).dt.year
Add a column like that to your month DataFrame, and call it year. (See this for an explanation).
Now do the same to the year DataFrame.
Do a merge on the month and year DataFrames, specifying how=left.
In the resulting DataFrame, you will have both columns. Now just add them.
Example
month_df = pd.DataFrame({
'date': ['2013-01-01', '2013-02-01', '2014-02-01'],
'amount': [1, 2, 3]})
year_df = pd.DataFrame({
'date': ['2013-01-01', '2014-02-01', '2015-01-01'],
'amount': [7, 8, 9]})
month_df['year'] = pd.to_datetime(month_df.date).dt.year
year_df['year'] = pd.to_datetime(year_df.date).dt.year
>>> pd.merge(
month_df,
year_df,
left_on='year',
right_on='year',
how='left')
amount_x date_x year amount_y date_y
0 1 2013-01-01 2013 7 2013-01-01
1 2 2013-02-01 2013 7 2013-01-01
2 3 2014-02-01 2014 8 2014-02-01