I have a dataframe with two date columns (format: YYYY-MM-DD). I want to create one row for each year between those two dates. The rows would be identical with a new column which specifies the year. For example, if the dates are 2018-01-01 and 2020-01-01 then there would be three rows with same data and a new column with values 2018, 2019, and 2020.
You can use a custom function to compute the range then explode the column:
# Ensure to have datetime
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
# Create the new column
date_range = lambda x: range(x['date1'].year, x['date2'].year+1)
df = df.assign(year=df.apply(date_range, axis=1)).explode('year', ignore_index=True)
Output:
>>> df
date1 date2 year
0 2018-01-01 2020-01-01 2018
1 2018-01-01 2020-01-01 2019
2 2018-01-01 2020-01-01 2020
This should work for you:
import pandas
# some sample data
df = pandas.DataFrame(data={
'foo': ['bar', 'baz'],
'date1':['2018-01-01', '2022-01-01'],
'date2':['2020-01-01', '2017-01-01']
})
# cast date columns to datetime
for col in ['date1', 'date2']:
df[col] = pandas.to_datetime(df[col])
# reset index to ensure that selection by length of index works
df = df.reset_index(drop=True)
# the range of years between the two dates, and iterate through the resulting
# series to unpack the range of years and add a new row with the original data and the year
for i, years in df.apply(
lambda x: range(
min(x.date1, x.date2).year,
max(x.date1, x.date2).year + 1
),
axis='columns'
).iteritems():
for year in years:
new_index = len(df.index)
df.loc[new_index] = df.loc[i].values
df.loc[new_index, 'year'] = int(year)
output:
>>> df
foo date1 date2 year
0 bar 2018-01-01 2020-01-01 NaN
1 baz 2022-01-01 2017-01-01 NaN
2 bar 2018-01-01 2020-01-01 2018.0
3 bar 2018-01-01 2020-01-01 2019.0
4 bar 2018-01-01 2020-01-01 2020.0
5 baz 2022-01-01 2017-01-01 2017.0
6 baz 2022-01-01 2017-01-01 2018.0
7 baz 2022-01-01 2017-01-01 2019.0
8 baz 2022-01-01 2017-01-01 2020.0
9 baz 2022-01-01 2017-01-01 2021.0
10 baz 2022-01-01 2017-01-01 2022.0
Related
df
end_date dt_eps
0 20200930 0.9625
1 20200630 0.5200
2 20200331 0.2130
3 20191231 1.2700
4 20190930 -0.1017
5 20190630 -0.1058
6 20190331 0.0021
7 20181231 0.0100
Note: the value of end_date must be the last day of each year quarter and the sequence is sorted by near and the type is string.
Goal
create q_dt_eps column: calculate the diff of dt_eps between the nearest day but it is the same as dt_eps when the quarter is Q1. For example, the q_dt_eps for 20200930 is 0.4425(0.9625-0.5200) while 20200331 is 1.2700.
Try
df['q_dt_eps']=df['dt_eps'].diff(periods=-1)
But it could not return the same value of dt_eps when the quarter is Q1.
You can just convert the date to datetime, extract the quarter of the date, and then create your new column using np.where, keeping the original value when quarter is equal to 1, otherwise using the shifted value.
import numpy as np
import pandas as pd
df = pd.DataFrame({'end_date':['20200930', '20200630', '20200331',
'20191231', '20190930', '20190630', '20190331', '20181231'],
'dt_eps':[0.9625, 0.52, 0.213, 1.27, -.1017, -.1058, .0021, .01]})
df['end_date'] = pd.to_datetime(df['end_date'], format='%Y%m%d')
df['qtr'] = df['end_date'].dt.quarter
df['q_dt_eps'] = np.where(df['qtr']==1, df['dt_eps'], df['dt_eps'].diff(-1))
df
end_date dt_eps qtr q_dt_eps
0 2020-09-30 0.9625 3 0.4425
1 2020-06-30 0.5200 2 0.3070
2 2020-03-31 0.2130 1 0.2130
3 2019-12-31 1.2700 4 1.3717
4 2019-09-30 -0.1017 3 0.0041
5 2019-06-30 -0.1058 2 -0.1079
6 2019-03-31 0.0021 1 0.0021
7 2018-12-31 0.0100 4 NaN
I have a data frame with two date columns, a start and end date. How will I find the number of weekends between the start and end dates using pandas or python date-times
I know that pandas has DatetimeIndex which returns values 0 to 6 for each day of the week, starting Monday
# create a data-frame
import pandas as pd
df = pd.DataFrame({'start_date':['4/5/19','4/5/19','1/5/19','28/4/19'],
'end_date': ['4/5/19','5/5/19','4/5/19','5/5/19']})
# convert objects to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
# Trying to get the date index between dates as a prelim step but fails
pd.DatetimeIndex(df['end_date'] - df['start_date']).weekday
I'm expecting the result to be this: (weekend_count includes both start and end dates)
start_date end_date weekend_count
4/5/2019 4/5/2019 1
4/5/2019 5/5/2019 2
1/5/2019 4/5/2019 1
28/4/2019 5/5/2019 3
IIUC
df['New']=[pd.date_range(x,y).weekday.isin([5,6]).sum() for x , y in zip(df.start_date,df.end_date)]
df
start_date end_date New
0 2019-05-04 2019-05-04 1
1 2019-05-04 2019-05-05 2
2 2019-05-01 2019-05-04 1
3 2019-04-28 2019-05-05 3
Try with:
df['weekend_count']=((df.end_date-df.start_date).dt.days+1)-np.busday_count(
df.start_date.dt.date,df.end_date.dt.date)
print(df)
start_date end_date weekend_count
0 2019-05-04 2019-05-04 1
1 2019-05-04 2019-05-05 2
2 2019-05-01 2019-05-04 1
3 2019-04-28 2019-05-05 3
I want to select all the previous 6 months records for a customer whenever a particular transaction is done by the customer.
Data looks like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
1 01/01/2017 8 Y
2 10/01/2018 6 Moved
2 02/01/2018 12 Z
Here, I want to see for the Description "Moved" and then select all the last 6 months for every Cust_ID.
Output should look like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
2 10/01/2018 6 Moved
I want to do this in python. Please help.
Idea is created Series of datetimes filtered by Moved and shifted by MonthOffset, last filter by Series.map values less like this offsets:
EDIT: Get all datetimes for each Moved values:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
df = df.sort_values(['Cust_ID','Transaction_Date'])
df['g'] = df['Description'].iloc[::-1].eq('Moved').cumsum()
s = (df[df['Description'].eq('Moved')]
.set_index(['Cust_ID','g'])['Transaction_Date'] - pd.offsets.MonthOffset(6))
mask = df.join(s.rename('a'), on=['Cust_ID','g'])['a'] < df['Transaction_Date']
df1 = df[mask].drop('g', axis=1)
EDIT1: Get all datetimes for Moved with minimal datetimes per groups, another Moved per groups are removed:
print (df)
Cust_ID Transaction_Date Amount Description
0 1 10/01/2017 12 X
1 1 01/23/2017 15 Moved
2 1 03/01/2017 8 Y
3 1 08/08/2017 12 Moved
4 2 10/01/2018 6 Moved
5 2 02/01/2018 12 Z
#convert to datetimes
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
#mask for filter Moved rows
mask = df['Description'].eq('Moved')
#filter and sorting this rows
df1 = df[mask].sort_values(['Cust_ID','Transaction_Date'])
print (df1)
Cust_ID Transaction_Date Amount Description
1 1 2017-01-23 15 Moved
3 1 2017-08-08 12 Moved
4 2 2018-10-01 6 Moved
#get duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date'] - pd.offsets.MonthOffset(6)
print (s)
Cust_ID
1 2016-07-23
2 2018-04-01
Name: Transaction_Date, dtype: datetime64[ns]
#create mask for filter out another Moved (get only first for each group)
m2 = ~mask.reindex(df.index, fill_value=False)
df1 = df[(df['Cust_ID'].map(s) < df['Transaction_Date']) & m2]
print (df1)
Cust_ID Transaction_Date Amount Description
0 1 2017-10-01 12 X
1 1 2017-01-23 15 Moved
2 1 2017-03-01 8 Y
4 2 2018-10-01 6 Moved
EDIT2:
#get last duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID', keep='last')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date']
print (s)
Cust_ID
1 2017-08-08
2 2018-10-01
Name: Transaction_Date, dtype: datetime64[ns]
m2 = ~mask.reindex(df.index, fill_value=False)
#filter by between Moved and next 6 months
df3 = df[df['Transaction_Date'].between(df['Cust_ID'].map(s), df['Cust_ID'].map(s + pd.offsets.MonthOffset(6))) & m2]
print (df3)
Cust_ID Transaction_Date Amount Description
3 1 2017-08-08 12 Moved
0 1 2017-10-01 12 X
4 2 2018-10-01 6 Moved
I have a pandas df or irrigation demand data that has daily values from 1900 to 2099. I resampled the df to get the monthly average and then resampled and backfilled the monthly averages on a daily frequency, so that the average daily value for each month, was input as the daily value for every day of that month.
My problem is that the first month was not backfilled and there is only a value for the last day of that month (1900-01-31).
Here is my code, any suggestions on what I am doing wrong?
I2 = pd.DataFrame(IrrigDemand, columns = ['Year', 'Month', 'Day', 'IrrigArea_1', 'IrrigArea_2','IrrigArea_3','IrrigArea_4','IrrigArea_5'],dtype=float)
# set dates as index
I2.set_index('Year')
# make a column of dates in datetime format
dates = pd.to_datetime(I2[['Year', 'Month', 'Day']])
# add the column of dates to df
I2['dates'] = pd.Series(dates, index=I2.index)
# set dates as index of df
I2.set_index('dates')
# delete the three string columns replaced with datetime values
I2.drop(['Year', 'Month', 'Day'],inplace=True,axis=1)
# calculate the average daily value for each month
I2_monthly_average = I2.reset_index().set_index('dates').resample('m').mean()
I2_daily_average = I2_monthly_average.resample('d').bfill()
There is problem first day is not added by resample('m'), so necessary add it manually:
# make a column of dates in datetime format and assign to index
I2.index = pd.to_datetime(I2[['Year', 'Month', 'Day']])
# delete the three string columns replaced with datetime values
I2.drop(['Year', 'Month', 'Day'],inplace=True,axis=1)
# calculate the average daily value for each month
I2_monthly_average = I2.resample('m').mean()
first_day = I2_monthly_average.index[0].replace(day = 1)
I2_monthly_average.loc[first_day] = I2_monthly_average.iloc[0]
I2_daily_average = I2_monthly_average.resample('d').bfill()
Sample:
rng = pd.date_range('2017-04-03', periods=10, freq='20D')
I2 = pd.DataFrame({'a': range(10)}, index=rng)
print (I2)
a
2017-04-03 0
2017-04-23 1
2017-05-13 2
2017-06-02 3
2017-06-22 4
2017-07-12 5
2017-08-01 6
2017-08-21 7
2017-09-10 8
2017-09-30 9
I2_monthly_average = I2.resample('m').mean()
print (I2_monthly_average)
a
2017-04-30 0.5
2017-05-31 2.0
2017-06-30 3.5
2017-07-31 5.0
2017-08-31 6.5
2017-09-30 8.5
first_day = I2_monthly_average.index[0].replace(day = 1)
I2_monthly_average.loc[first_day] = I2_monthly_average.iloc[0]
print (I2_monthly_average)
a
2017-04-30 0.5
2017-05-31 2.0
2017-06-30 3.5
2017-07-31 5.0
2017-08-31 6.5
2017-09-30 8.5
2017-04-01 0.5 <- added first day
I2_daily_average = I2_monthly_average.resample('d').bfill()
print (I2_daily_average.head())
a
2017-04-01 0.5
2017-04-02 0.5
2017-04-03 0.5
2017-04-04 0.5
2017-04-05 0.5
I have two DataFrames in pandas. One of them has data every month, the other one has data every year. I need to do some computation where the yearly value is added to the monthly value.
Something like this:
df1, monthly:
2013-01-01 1
2013-02-01 1
...
2014-01-01 1
2014-02-01 1
...
2015-01-01 1
df2, yearly:
2013-01-01 1
2014-01-01 2
2015-01-01 3
And I want to produce something like this:
2013-01-01 (1+1) = 2
2013-02-01 (1+1) = 2
...
2014-01-01 (1+2) = 3
2014-02-01 (1+2) = 3
...
2015-01-01 (1+3) = 4
Where the value of the monthly data is added to the value of the yearly data depending on the year (first value in the parenthesis is the monthly data, second value is the yearly data).
Assuming your "month" column is called date in the Dataframe df, then you can obtain the year by using the dt member:
pd.to_datetime(df.date).dt.year
Add a column like that to your month DataFrame, and call it year. (See this for an explanation).
Now do the same to the year DataFrame.
Do a merge on the month and year DataFrames, specifying how=left.
In the resulting DataFrame, you will have both columns. Now just add them.
Example
month_df = pd.DataFrame({
'date': ['2013-01-01', '2013-02-01', '2014-02-01'],
'amount': [1, 2, 3]})
year_df = pd.DataFrame({
'date': ['2013-01-01', '2014-02-01', '2015-01-01'],
'amount': [7, 8, 9]})
month_df['year'] = pd.to_datetime(month_df.date).dt.year
year_df['year'] = pd.to_datetime(year_df.date).dt.year
>>> pd.merge(
month_df,
year_df,
left_on='year',
right_on='year',
how='left')
amount_x date_x year amount_y date_y
0 1 2013-01-01 2013 7 2013-01-01
1 2 2013-02-01 2013 7 2013-01-01
2 3 2014-02-01 2014 8 2014-02-01