Backfilling a pandas dataframe missed the first month - pandas

I have a pandas df or irrigation demand data that has daily values from 1900 to 2099. I resampled the df to get the monthly average and then resampled and backfilled the monthly averages on a daily frequency, so that the average daily value for each month, was input as the daily value for every day of that month.
My problem is that the first month was not backfilled and there is only a value for the last day of that month (1900-01-31).
Here is my code, any suggestions on what I am doing wrong?
I2 = pd.DataFrame(IrrigDemand, columns = ['Year', 'Month', 'Day', 'IrrigArea_1', 'IrrigArea_2','IrrigArea_3','IrrigArea_4','IrrigArea_5'],dtype=float)
# set dates as index
I2.set_index('Year')
# make a column of dates in datetime format
dates = pd.to_datetime(I2[['Year', 'Month', 'Day']])
# add the column of dates to df
I2['dates'] = pd.Series(dates, index=I2.index)
# set dates as index of df
I2.set_index('dates')
# delete the three string columns replaced with datetime values
I2.drop(['Year', 'Month', 'Day'],inplace=True,axis=1)
# calculate the average daily value for each month
I2_monthly_average = I2.reset_index().set_index('dates').resample('m').mean()
I2_daily_average = I2_monthly_average.resample('d').bfill()

There is problem first day is not added by resample('m'), so necessary add it manually:
# make a column of dates in datetime format and assign to index
I2.index = pd.to_datetime(I2[['Year', 'Month', 'Day']])
# delete the three string columns replaced with datetime values
I2.drop(['Year', 'Month', 'Day'],inplace=True,axis=1)
# calculate the average daily value for each month
I2_monthly_average = I2.resample('m').mean()
first_day = I2_monthly_average.index[0].replace(day = 1)
I2_monthly_average.loc[first_day] = I2_monthly_average.iloc[0]
I2_daily_average = I2_monthly_average.resample('d').bfill()
Sample:
rng = pd.date_range('2017-04-03', periods=10, freq='20D')
I2 = pd.DataFrame({'a': range(10)}, index=rng)
print (I2)
a
2017-04-03 0
2017-04-23 1
2017-05-13 2
2017-06-02 3
2017-06-22 4
2017-07-12 5
2017-08-01 6
2017-08-21 7
2017-09-10 8
2017-09-30 9
I2_monthly_average = I2.resample('m').mean()
print (I2_monthly_average)
a
2017-04-30 0.5
2017-05-31 2.0
2017-06-30 3.5
2017-07-31 5.0
2017-08-31 6.5
2017-09-30 8.5
first_day = I2_monthly_average.index[0].replace(day = 1)
I2_monthly_average.loc[first_day] = I2_monthly_average.iloc[0]
print (I2_monthly_average)
a
2017-04-30 0.5
2017-05-31 2.0
2017-06-30 3.5
2017-07-31 5.0
2017-08-31 6.5
2017-09-30 8.5
2017-04-01 0.5 <- added first day
I2_daily_average = I2_monthly_average.resample('d').bfill()
print (I2_daily_average.head())
a
2017-04-01 0.5
2017-04-02 0.5
2017-04-03 0.5
2017-04-04 0.5
2017-04-05 0.5

Related

Pandas -- get dates closest to nth day of month

This Python code identifies the rows where the day of the month equals 5. For a month that does not have day 5, because it is a weekend or holiday, I want the mask to be True for the earlier date that is closest to day 5. I could write a loop to identify such dates, but is there an array formula to do this?
import pandas as pd
infile = "dates.csv"
df = pd.read_csv(infile)
dtimes = pd.to_datetime(df.iloc[:,0])
mask = (dtimes.dt.day == 5)
For test purpose I created the following DataFrame (with a single column):
xxx
0 2022-11-03
1 2022-11-04
2 2022-11-07
3 2022-12-02
4 2022-12-05
5 2022-12-06
6 2023-01-04
7 2023-01-05
8 2023-01-06
9 2023-02-02
10 2023-02-03
11 2023-02-06
12 2023-02-07
13 2023-04-02
14 2023-04-05
15 2023-04-06
Because I based my solution on groupby method, I created dtimes
as a Series with the index equal to values:
wrk = pd.to_datetime(df.iloc[:,0])
dtimes = pd.Series(wrk.values, index=wrk)
Then, to find the valid date within the current group of dates
(a single month), I defined the followig function:
def findDate(grp):
if grp.size == 0:
return None
dd = grp.dt.day
if dd.eq(5).any():
dd = dd[dd.eq(5)]
else:
dd = dd[dd.lt(5)]
return dd.index[-1]
To find valid dates, for "existing" months, run:
validDates = dtimes.groupby(pd.Grouper(freq='M')).apply(findDate).dropna()
The result is:
xxx
2022-11-30 2022-11-04
2022-12-31 2022-12-05
2023-01-31 2023-01-05
2023-02-28 2023-02-03
2023-04-30 2023-04-05
dtype: datetime64[ns]
And to create your mask, run:
mask = dtimes.isin(validDates).values
To see the filtered rows, run:
df[mask]
getting:
xxx
1 2022-11-04
4 2022-12-05
7 2023-01-05
10 2023-02-03
14 2023-04-05

Create a row for each year between two dates

I have a dataframe with two date columns (format: YYYY-MM-DD). I want to create one row for each year between those two dates. The rows would be identical with a new column which specifies the year. For example, if the dates are 2018-01-01 and 2020-01-01 then there would be three rows with same data and a new column with values 2018, 2019, and 2020.
You can use a custom function to compute the range then explode the column:
# Ensure to have datetime
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
# Create the new column
date_range = lambda x: range(x['date1'].year, x['date2'].year+1)
df = df.assign(year=df.apply(date_range, axis=1)).explode('year', ignore_index=True)
Output:
>>> df
date1 date2 year
0 2018-01-01 2020-01-01 2018
1 2018-01-01 2020-01-01 2019
2 2018-01-01 2020-01-01 2020
This should work for you:
import pandas
# some sample data
df = pandas.DataFrame(data={
'foo': ['bar', 'baz'],
'date1':['2018-01-01', '2022-01-01'],
'date2':['2020-01-01', '2017-01-01']
})
# cast date columns to datetime
for col in ['date1', 'date2']:
df[col] = pandas.to_datetime(df[col])
# reset index to ensure that selection by length of index works
df = df.reset_index(drop=True)
# the range of years between the two dates, and iterate through the resulting
# series to unpack the range of years and add a new row with the original data and the year
for i, years in df.apply(
lambda x: range(
min(x.date1, x.date2).year,
max(x.date1, x.date2).year + 1
),
axis='columns'
).iteritems():
for year in years:
new_index = len(df.index)
df.loc[new_index] = df.loc[i].values
df.loc[new_index, 'year'] = int(year)
output:
>>> df
foo date1 date2 year
0 bar 2018-01-01 2020-01-01 NaN
1 baz 2022-01-01 2017-01-01 NaN
2 bar 2018-01-01 2020-01-01 2018.0
3 bar 2018-01-01 2020-01-01 2019.0
4 bar 2018-01-01 2020-01-01 2020.0
5 baz 2022-01-01 2017-01-01 2017.0
6 baz 2022-01-01 2017-01-01 2018.0
7 baz 2022-01-01 2017-01-01 2019.0
8 baz 2022-01-01 2017-01-01 2020.0
9 baz 2022-01-01 2017-01-01 2021.0
10 baz 2022-01-01 2017-01-01 2022.0

Pandas add row to datetime indexed dataframe

I cannot find a solution for this problem. I would like to add future dates to a datetime indexed Pandas dataframe for model prediction purposes.
Here is where I am right now:
new_datetime = df2.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
And this is where I am stuck. The append examples online only seem always seem to show examples with ignore_index=True , and in my case, I want to use the proper datetime indexing.
Suppose you have this df:
date value
0 2020-01-31 00:00:00 1
1 2020-02-01 00:00:00 2
2 2020-02-02 00:00:00 3
then an alternative for adding future days is
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='D', closed='right')}))
which returns
date value
0 2020-01-31 00:00:00 1.0
1 2020-02-01 00:00:00 2.0
2 2020-02-02 00:00:00 3.0
0 2020-02-03 00:00:00 NaN
1 2020-02-04 00:00:00 NaN
2 2020-02-05 00:00:00 NaN
3 2020-02-06 00:00:00 NaN
4 2020-02-07 00:00:00 NaN
where the frequency is D (days) day and the period is 6 days.
I think I was making this more difficult than necessary because I was using a datetime index instead of the typical integer index. By leaving the 'date' field as a regular column instead of an index adding the rows is straightforward.
One thing I did do was add a reindex command so I did not end up with wonky duplicate index values:
df = df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=21, freq='D', closed='right')}))
df = df.reset_index() # resets index
i also needed this and i solve merging the code that you share with the code on this other response add to a dataframe as I go with datetime index and end out with the following code that work for me.
data=raw.copy()
new_datetime = data.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
today_df = pd.DataFrame({'value': 301.124},index=new_datetime)
data = data.append(today_df)
data.tail()
here 'value' is the header of your own dataframe

How to plot time series and group years together?

I have a dataframe that looks like below, the date is the index. How would I plot a time series showing a line for each of the years? I have tried df.plot(figsize=(15,4)) but this gives me one line.
Date Value
2008-01-31 22
2008-02-28 17
2008-03-31 34
2008-04-30 29
2009-01-31 33
2009-02-28 42
2009-03-31 45
2009-04-30 39
2019-01-31 17
2019-02-28 12
2019-03-31 11
2019-04-30 12
2020-01-31 24
2020-02-28 34
2020-03-31 43
2020-04-30 45
You can just do a groupby using year.
df = pd.read_clipboard()
df = df.set_index(pd.DatetimeIndex(df['Date']))
df.groupby(df.index.year)['Value'].plot()
In case you want to use year as series of data and compare day to day:
import matplotlib.pyplot as plt
# Create a date column from index (easier to manipulate)
df["date_column"] = pd.to_datetime(df.index)
# Create a year column
df["year"] = df["date_column"].dt.year
# Create a month-day column
df["month_day"] = (df["date_column"].dt.month).astype(str).str.zfill(2) + \
"-" + df["date_column"].dt.day.astype(str).str.zfill(2)
# Plot. Pivot will create for each year a column and these columns will be used as series.
df.pivot('month_day', 'year', 'Value').plot(kind='line', figsize=(12, 8), marker='o' )
plt.title("Values per Month-Day - Year comparison", y=1.1, fontsize=14)
plt.xlabel("Month-Day", labelpad=12, fontsize=12)
plt.ylabel("Value", labelpad=12, fontsize=12);

How to convert object to hour and add to date?

i have the following data frame :
correction = ['2.0','-2.5','4.5','-3.0']
date = ['2015-05-19 20:45:00','2017-04-29 17:15:00','2011-05-09 10:40:00','2016-12-18 16:10:00']
i want to convert correction as hours and add it to the date. i tried the following code, but it get the error.
df['correction'] = pd.to_timedelta(df['correction'],unit='h')
df['date'] =pd.DatetimeIndex(df['date'])
df['date'] = df['date'] + df['correction']
I get the error in converting correction to timedelta as:
ValueError: no units specified
For me works cast to float column correction:
df['correction'] = pd.to_timedelta(df['correction'].astype(float),unit='h')
df['date'] = pd.DatetimeIndex(df['date'])
df['date'] = df['date'] + df['correction']
print (df)
correction date
0 02:00:00 2015-05-19 22:45:00
1 -1 days +21:30:00 2017-04-29 14:45:00
2 04:30:00 2011-05-09 15:10:00
3 -1 days +21:00:00 2016-12-18 13:10:00