How can I group data into months from dates where a data frame has both categorical and numerical data in pandas. I tried the groupby function but I think it won't work with categorical data. There are multiple values in the categorical column. Sample data:
Date
Campaign_Name
No_of_Male_Viewers
No_of_Female_Viewers
2021-06-12
Dove_birds
1268
7656
2021-02-05
Pantene_winner
657
8964
2021-09-15
Budweiser_wazap
7642
76
2021-05-13
Pantene_winner
425
6578
2021-12-12
Budweiser_wazap
9867
111
2021-09-09
Dove_birds
1578
11456
2021-05-24
Pantene_winner
678
7475
2021-09-27
Budweiser_wazap
8742
96
2021-09-09
Dove_soft
1175
15486
Now I need to group the data months wise and show for example that Budweiser_wazap in September gained a total audience of xxxx and in December gained xxxx audience and so on for the other campaigns as well.
Expected output sample:
Month
Campaign_Name
No_of_Male_Viewers
No_of_Female_Viewers
February
Pantene_winner
657
8964
September
Budweiser_wazap
16384
172
Since Budweiser_wazap campaign ran twice in September, the resulting output for No_of_Male_Viewers is: 7642 + 8742 = 16384, and for No_of_Female_Viewers is: 76 + 96 = 172.
USE-
##Get Month Name for each date
df['Month'] = df['Date'].dt.month_name()
#Groupby `Month` & `Campaign_Name`
df.groupby(['Month', 'Campaign_Name'])[['No_of_Male_viewers', 'No_of_Female_viewers']].sum().reset_index()
df
Sample Reproducible code-
import pandas as pd
import numpy as np
from pandas import DataFrame
df = pd.DataFrame({
'Date' : ['2015-06-08', '2015-08-05', '2015-05-06', '2015-05-05', '2015-07-08', '2015-05-07', '2015-06-05', '2015-07-05'],
'Sym' : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'],
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
})
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month_name()
df
df output-
Date Sym Data2 Data3 Month
0 2015-06-08 aapl 11 5 June
1 2015-08-05 aapl 8 8 August
2 2015-05-06 aapl 10 6 May
3 2015-05-05 aapl 15 1 May
4 2015-07-08 aaww 110 50 July
5 2015-05-07 aaww 60 100 May
6 2015-06-05 aaww 100 60 June
7 2015-07-05 aaww 40 120 July
Groupby Condition-
df.groupby(['Month', 'Sym'])[['Data2', 'Data3']].sum().reset_index()
Output-
Month Sym Data2 Data3
0 August aapl 8 8
1 July aaww 150 170
2 June aapl 11 5
3 June aaww 100 60
4 May aapl 25 7
5 May aaww 60 100
Ref link- Pandas - dataframe groupby - how to get sum of multiple columns
If you use strftime('%B') - that automatically extracts Month names - you can reach the same result with one line of code :)
#download dataframe from Stackoverflw and convert column to datetime
df=pd.read_clipboard()
df['Date']=pd.to_datetime(df['Date'], infer_datetime_format=True)
#'%B' - will return full month name, '%b' - 3-Letter- like Dec, Sep
df.groupby([df['Date'].dt.strftime('%B'), "Campaign_Name"]).sum()
Here is how the output looks like:
Related
I have a dataset like this
import pandas as pd
df = pd.DataFrame(
{
"id": {0: 1, 1: 1, 2: 1, 3: 2, 4: 2},
"price": {0: 20, 1: 41, 2: 61, 3: 68, 4: 10},
"date_month_start": {
0: "2021-06-12",
1: "2021-11-13",
2: "2022-02-27",
3: "2021-04-14",
4: "2021-07-11",
},
"date_month_end": {
0: "2021-09-14",
1: "2022-01-13",
2: "2022-04-12",
3: "2021-06-18",
4: "2021-10-16",
},
}
)
print(df)
id price date_month_start date_month_end
0 1 20 2021-06-12 2021-09-14
1 1 41 2021-11-13 2022-01-13
2 1 61 2022-02-27 2022-04-12
3 2 68 2021-04-14 2021-06-18
4 2 10 2021-07-11 2021-10-16
But I would like to create a column for first of month that falls between start and end date and repeat rows (except first of month date) if there are more than one first of months falls between start and end date.
For instance if the start date is March 12, 2021 and end date is June 04, 2021, than I would like to have a new column April 1st 2021, May 1st 2021, June 1st 2021. As we have three values for the new column so, I would like to repeat rows by copying other column values except the new one.
The output data should look like:
id price date_month_start date_month_end date_month
0 1 20 2021-06-12 2021-09-14 2021-07-01
1 1 20 2021-06-12 2021-09-14 2022-08-01
2 1 20 2021-06-12 2021-09-14 2022-09-01
3 1 41 2021-11-13 2022-01-13 2021-12-01
4 1 41 2021-11-13 2022-01-13 2022-01-01
5 1 61 2022-02-27 2022-04-12 2022-03-01
6 1 61 2022-02-27 2022-04-12 2022-04-01
7 2 68 2021-04-14 2021-06-18 2021-05-01
8 2 68 2021-04-14 2021-06-18 2021-06-01
9 2 10 2021-07-11 2021-10-16 2021-08-01
10 2 10 2021-07-11 2021-10-16 2021-09-01
11 2 10 2021-07-11 2021-10-16 2021-10-01
I am new in python, anyone has any direction how to do it? I can get first day of month from date column, but it is a whole different thing.
Here is one way to do it:
from pandas.tseries.offsets import MonthEnd
# Convert into Pandas datetimes
df['date_month_start'] = pd.to_datetime(df['date_month_start'])
df['date_month_end'] = pd.to_datetime(df['date_month_end'])
# For each row of 'df', find month starts between start and end date
# Duplicate the row and add new column
# Store new intermediate dataframe in list (dfs)
dfs = []
for i in range(df.shape[0]):
temp_df = df.loc[i, :]
new_month = pd.Series(
[
temp_df["date_month_start"] + MonthEnd(i) + pd.Timedelta(1, "d")
for i in range(1, 13)
if temp_df["date_month_start"] + MonthEnd(i) + pd.Timedelta(1, "d")
< temp_df["date_month_end"]
]
)
temp_df = pd.DataFrame([temp_df.to_list() for _ in range(len(new_month))])
temp_df[4] = new_month
dfs.append(temp_df)
# Concat intermediate dataframes into one
new_df = dfs[0]
for df in dfs[1:]:
new_df = pd.concat([new_df, df])
# Cleanup
new_df.columns = ["id", "price", "date_month_start", "date_month_end", "date_month"]
new_df = new_df.reset_index(drop=True)
print(new_df)
# Output
id price date_month_start date_month_end date_month
0 1 20 2021-06-12 2021-09-14 2021-07-01
1 1 20 2021-06-12 2021-09-14 2021-08-01
2 1 20 2021-06-12 2021-09-14 2021-09-01
3 1 41 2021-11-13 2022-01-13 2021-12-01
4 1 41 2021-11-13 2022-01-13 2022-01-01
5 1 61 2022-02-27 2022-04-12 2022-03-01
6 1 61 2022-02-27 2022-04-12 2022-04-01
7 2 68 2021-04-14 2021-06-18 2021-05-01
8 2 68 2021-04-14 2021-06-18 2021-06-01
9 2 10 2021-07-11 2021-10-16 2021-08-01
10 2 10 2021-07-11 2021-10-16 2021-09-01
11 2 10 2021-07-11 2021-10-16 2021-10-01
I have dataframe with multiple columns that contain aggregate values per month.
I would like to break up the monthly aggregates into daily ones.
The original dataframe looks like this:
df = pd.DataFrame({'month': ['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December'],
'col_A': np.random.randint(0, 1000, size=12),
'col_B': np.random.randint(0, 1000, size=12)})
print(df)
month col_A col_B
0 January 102 330
1 February 435 458
2 March 860 87
3 April 270 372
4 May 106 99
5 June 71 871
6 July 700 663
7 August 20 130
8 September 614 661
9 October 121 308
10 November 466 769
11 December 214 343
My aim is to convert the monthly aggregates to a daily view that looks like this for the whole year:
date col_A col_B
0 2022-01-01 3.290323 10.645161
1 2022-01-02 3.290323 10.645161
2 2022-01-03 3.290323 10.645161
3 2022-01-04 3.290323 10.645161
4 2022-01-05 3.290323 10.645161
I converted the month to a datetime object and added
df['month'] = df.month.apply(lambda x: datetime.datetime.strptime(x, "%B") + relativedelta(years = 122))
Then I tried to interpolate with resampling as suggested below, however, the results I get are not the same as well as it interpolates between the points instead of dividing the values with the number of days in a month.
Converting monthly values into daily using pandas interpolation
First generate the dates for each month, assuming in year 2022.
df['date'] = pd.to_datetime(df['month'] + ' 2022')\
.apply(pd.date_range, freq='MS', periods=2)\
.apply(lambda ds: pd.date_range(*ds, closed='left'))
Second divide the values by number of days per month
df['col_A'] /= df['date'].apply(len)
df['col_B'] /= df['date'].apply(len)
Explode the dates column and the divided values get copied.
df.explode('date')
Is it possible to convert an entire column of decimal Day-Of-Year into datetime format YYYY-mm-dd HH:MM ? I tried counting the amount of seconds and minutes in a day, but decimal DOY is different from decimal hours.
Example:
DOY = 181.82015046296297
Converted to:
Timestamp('2021-06-05 14:00:00')
Here the date would be a datetime object appearing only as 2021-06-05 14:00:00 in my dataframe. And the year I am interested in is 2021.
Use Timedelta to create an offset from the first day of year
Input data:
>>> df
DayOfYear
0 254
1 156
2 303
3 32
4 100
5 8
6 329
7 82
8 218
9 293
df['Date'] = pd.to_datetime('2021') \
+ df['DayOfYear'].sub(1).apply(pd.Timedelta, unit='D')
Output result:
>>> df
DayOfYear Date
0 254 2021-09-11
1 156 2021-06-05
2 303 2021-10-30
3 32 2021-02-01
4 100 2021-04-10
5 8 2021-01-08
6 329 2021-11-25
7 82 2021-03-23
8 218 2021-08-06
9 293 2021-10-20
So I have weekly sales data:
# Create the dataframe
test_df = pd.DataFrame({'year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018], 'week': [9, 10, 11, 12, 13, 14, 15, 16, 17], 'sales': [100, 200, 100, 300, 200, 100, 200, 100, 300]})
# Convert the week and
test_df['date'] = test_df["year"].astype(str) + '-' + test_df["week"].astype(str)
test_df['date'] = pd.to_datetime(test_df['date'] + '0', format='%Y-%W%w')
test_df
This is the resulting dataframe:
year week sales date
0 2018 9 100 2018-03-04
1 2018 10 200 2018-03-11
2 2018 11 100 2018-03-18
3 2018 12 300 2018-03-25
4 2018 13 200 2018-04-01
5 2018 14 100 2018-04-08
6 2018 15 200 2018-04-15
7 2018 16 100 2018-04-22
8 2018 17 300 2018-04-29
Now I would like to smooth this data out and resample it to months or quarters, in order to make more stable long term predictions. However, when I resample the data to quarterly or monthly data, each period will have an uneven number of weeks, some 4 and some 5 (or 13, 12, 11 in the case of quarterly):
test_df = test_df.set_index('date')
test_df = test_df.resample('M').sum()
test_df.drop(columns=['year', 'week'])
This gives:
sales
date
2018-03-31 700
2018-04-30 900
Now for months, I understand, months have different numbers of weeks. However, quarters should always have the same number of weeks if the first week start on January 1st, right?
My question is, am I missing something in the conversion from year-week -> date? This would be an issue if I create different subsequences out of this to train a prediction model.
[this is tail of my DataFrame for around 1000 entries][1]
Open Close High Change mx_profitable
Date
2018-06-06 263.00 270.15 271.4 7.15 8.40
2018-06-08 268.95 273.00 273.9 4.05 4.95
2018-06-11 273.30 274.00 278.4 0.70 5.10
2018-06-12 274.00 282.85 284.4 8.85 10.40
I have to sort out the entries of only certain dates, for example, 25th of every month.
I think need DatetimeIndex.day with boolean indexing:
df[df.index.day == 25]
Sample:
rng = pd.date_range('2017-04-03', periods=1000)
df = pd.DataFrame({'a': range(1000)}, index=rng)
print (df.head())
a
2017-04-03 0
2017-04-04 1
2017-04-05 2
2017-04-06 3
2017-04-07 4
df1 = df[df.index.day == 25]
print (df1.head())
a
2017-04-25 22
2017-05-25 52
2017-06-25 83
2017-07-25 113
2017-08-25 144