I have dataframe with multiple columns that contain aggregate values per month.
I would like to break up the monthly aggregates into daily ones.
The original dataframe looks like this:
df = pd.DataFrame({'month': ['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December'],
'col_A': np.random.randint(0, 1000, size=12),
'col_B': np.random.randint(0, 1000, size=12)})
print(df)
month col_A col_B
0 January 102 330
1 February 435 458
2 March 860 87
3 April 270 372
4 May 106 99
5 June 71 871
6 July 700 663
7 August 20 130
8 September 614 661
9 October 121 308
10 November 466 769
11 December 214 343
My aim is to convert the monthly aggregates to a daily view that looks like this for the whole year:
date col_A col_B
0 2022-01-01 3.290323 10.645161
1 2022-01-02 3.290323 10.645161
2 2022-01-03 3.290323 10.645161
3 2022-01-04 3.290323 10.645161
4 2022-01-05 3.290323 10.645161
I converted the month to a datetime object and added
df['month'] = df.month.apply(lambda x: datetime.datetime.strptime(x, "%B") + relativedelta(years = 122))
Then I tried to interpolate with resampling as suggested below, however, the results I get are not the same as well as it interpolates between the points instead of dividing the values with the number of days in a month.
Converting monthly values into daily using pandas interpolation
First generate the dates for each month, assuming in year 2022.
df['date'] = pd.to_datetime(df['month'] + ' 2022')\
.apply(pd.date_range, freq='MS', periods=2)\
.apply(lambda ds: pd.date_range(*ds, closed='left'))
Second divide the values by number of days per month
df['col_A'] /= df['date'].apply(len)
df['col_B'] /= df['date'].apply(len)
Explode the dates column and the divided values get copied.
df.explode('date')
Related
How can I group data into months from dates where a data frame has both categorical and numerical data in pandas. I tried the groupby function but I think it won't work with categorical data. There are multiple values in the categorical column. Sample data:
Date
Campaign_Name
No_of_Male_Viewers
No_of_Female_Viewers
2021-06-12
Dove_birds
1268
7656
2021-02-05
Pantene_winner
657
8964
2021-09-15
Budweiser_wazap
7642
76
2021-05-13
Pantene_winner
425
6578
2021-12-12
Budweiser_wazap
9867
111
2021-09-09
Dove_birds
1578
11456
2021-05-24
Pantene_winner
678
7475
2021-09-27
Budweiser_wazap
8742
96
2021-09-09
Dove_soft
1175
15486
Now I need to group the data months wise and show for example that Budweiser_wazap in September gained a total audience of xxxx and in December gained xxxx audience and so on for the other campaigns as well.
Expected output sample:
Month
Campaign_Name
No_of_Male_Viewers
No_of_Female_Viewers
February
Pantene_winner
657
8964
September
Budweiser_wazap
16384
172
Since Budweiser_wazap campaign ran twice in September, the resulting output for No_of_Male_Viewers is: 7642 + 8742 = 16384, and for No_of_Female_Viewers is: 76 + 96 = 172.
USE-
##Get Month Name for each date
df['Month'] = df['Date'].dt.month_name()
#Groupby `Month` & `Campaign_Name`
df.groupby(['Month', 'Campaign_Name'])[['No_of_Male_viewers', 'No_of_Female_viewers']].sum().reset_index()
df
Sample Reproducible code-
import pandas as pd
import numpy as np
from pandas import DataFrame
df = pd.DataFrame({
'Date' : ['2015-06-08', '2015-08-05', '2015-05-06', '2015-05-05', '2015-07-08', '2015-05-07', '2015-06-05', '2015-07-05'],
'Sym' : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'],
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
})
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month_name()
df
df output-
Date Sym Data2 Data3 Month
0 2015-06-08 aapl 11 5 June
1 2015-08-05 aapl 8 8 August
2 2015-05-06 aapl 10 6 May
3 2015-05-05 aapl 15 1 May
4 2015-07-08 aaww 110 50 July
5 2015-05-07 aaww 60 100 May
6 2015-06-05 aaww 100 60 June
7 2015-07-05 aaww 40 120 July
Groupby Condition-
df.groupby(['Month', 'Sym'])[['Data2', 'Data3']].sum().reset_index()
Output-
Month Sym Data2 Data3
0 August aapl 8 8
1 July aaww 150 170
2 June aapl 11 5
3 June aaww 100 60
4 May aapl 25 7
5 May aaww 60 100
Ref link- Pandas - dataframe groupby - how to get sum of multiple columns
If you use strftime('%B') - that automatically extracts Month names - you can reach the same result with one line of code :)
#download dataframe from Stackoverflw and convert column to datetime
df=pd.read_clipboard()
df['Date']=pd.to_datetime(df['Date'], infer_datetime_format=True)
#'%B' - will return full month name, '%b' - 3-Letter- like Dec, Sep
df.groupby([df['Date'].dt.strftime('%B'), "Campaign_Name"]).sum()
Here is how the output looks like:
Is it possible to convert an entire column of decimal Day-Of-Year into datetime format YYYY-mm-dd HH:MM ? I tried counting the amount of seconds and minutes in a day, but decimal DOY is different from decimal hours.
Example:
DOY = 181.82015046296297
Converted to:
Timestamp('2021-06-05 14:00:00')
Here the date would be a datetime object appearing only as 2021-06-05 14:00:00 in my dataframe. And the year I am interested in is 2021.
Use Timedelta to create an offset from the first day of year
Input data:
>>> df
DayOfYear
0 254
1 156
2 303
3 32
4 100
5 8
6 329
7 82
8 218
9 293
df['Date'] = pd.to_datetime('2021') \
+ df['DayOfYear'].sub(1).apply(pd.Timedelta, unit='D')
Output result:
>>> df
DayOfYear Date
0 254 2021-09-11
1 156 2021-06-05
2 303 2021-10-30
3 32 2021-02-01
4 100 2021-04-10
5 8 2021-01-08
6 329 2021-11-25
7 82 2021-03-23
8 218 2021-08-06
9 293 2021-10-20
I tried so many methods recommended by other threads, but failed to make my code work.
So... I want to load the csv file arranged like below to the dataframe.
year, 2021
month, march
date, 28
here, are, values
42.1, 28.7, 27.0, 9.54, 12.23, 22.25
I had a hard time dealing with this csv file(actually this is just a concise example of mine) because of the irregularity, letters and numbers-mixed formats and comma and space-mixed delimiters of this data.
I want this dataset to be placed left-aligned in the dataframe like,
year 2021 NaN NaN NaN NaN
month march NaN NaN NaN NaN
date 28 NaN NaN NaN NaN
here are values NaN NaN NaN
42.1 28.7 27.0 9.54 12.23 22.25
Sorry that I cannot show you what I have done so far, because I have a bunch of versions of code from the methods I searched.
If all values refer to the same year, month and date, you need to have a DataFrame where each line is an observation of value, i.e.
year = 2021
month = 'march'
date = 28
values = [42.1, 28.7, 27.0, 9.54, 12.23, 22.25]
df = pd.DataFrame({
'year': np.repeat(year, len(values)),
'month': np.repeat(month, len(values)),
'date': np.repeat(date, len(values)),
'value': values
})
yielding
year month date value
0 2021 march 28 42.10
1 2021 march 28 28.70
2 2021 march 28 27.00
3 2021 march 28 9.54
4 2021 march 28 12.23
5 2021 march 28 22.25
If you want it transposed, you can do
df = df.T
that gives
0 1 2 3 4 5
year 2021 2021 2021 2021 2021 2021
month march march march march march march
date 28 28 28 28 28 28
value 42.1 28.7 27.0 9.54 12.23 22.25
So I have weekly sales data:
# Create the dataframe
test_df = pd.DataFrame({'year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018], 'week': [9, 10, 11, 12, 13, 14, 15, 16, 17], 'sales': [100, 200, 100, 300, 200, 100, 200, 100, 300]})
# Convert the week and
test_df['date'] = test_df["year"].astype(str) + '-' + test_df["week"].astype(str)
test_df['date'] = pd.to_datetime(test_df['date'] + '0', format='%Y-%W%w')
test_df
This is the resulting dataframe:
year week sales date
0 2018 9 100 2018-03-04
1 2018 10 200 2018-03-11
2 2018 11 100 2018-03-18
3 2018 12 300 2018-03-25
4 2018 13 200 2018-04-01
5 2018 14 100 2018-04-08
6 2018 15 200 2018-04-15
7 2018 16 100 2018-04-22
8 2018 17 300 2018-04-29
Now I would like to smooth this data out and resample it to months or quarters, in order to make more stable long term predictions. However, when I resample the data to quarterly or monthly data, each period will have an uneven number of weeks, some 4 and some 5 (or 13, 12, 11 in the case of quarterly):
test_df = test_df.set_index('date')
test_df = test_df.resample('M').sum()
test_df.drop(columns=['year', 'week'])
This gives:
sales
date
2018-03-31 700
2018-04-30 900
Now for months, I understand, months have different numbers of weeks. However, quarters should always have the same number of weeks if the first week start on January 1st, right?
My question is, am I missing something in the conversion from year-week -> date? This would be an issue if I create different subsequences out of this to train a prediction model.
I have sales by year:
pd.DataFrame({'year':[2015,2016,2017],'value':['12','24','30']})
year value
0 2015 12
1 2016 24
2 2017 36
I want to extrapolate to months:
yyyymm value
201501 1 (ie 12/12, etc)
201502 1
...
201512 1
201601 2
...
201712 3
any suggestions?
One idea is use cross join with helper DataFrame, convert columns to strings and add 0 by Series.str.zfill:
df1 = pd.DataFrame({'m': range(1, 13), 'a' : 1})
df = df.assign(a = 1).merge(df1).drop('a', 1)
df['year'] = df['year'].astype(str) + df.pop('m').astype(str).str.zfill(2)
df = df.rename(columns={'year':'yyyymm'})
Another solution is create MultiIndex and use DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['year'], range(1, 13)], names=['yyyymm','m'])
df = df.set_index('year').reindex(mux, level=0).reset_index()
df['yyyymm'] = df['yyyymm'].astype(str) + df.pop('m').astype(str).str.zfill(2)
print (df.head(15))
yyyymm value
0 201501 12
1 201502 12
2 201503 12
3 201504 12
4 201505 12
5 201506 12
6 201507 12
7 201508 12
8 201509 12
9 201510 12
10 201511 12
11 201512 12
12 201601 24
13 201602 24
14 201603 24