Pandas Convert Year/Month Int Columns to Datetime and Quarter Average - pandas

I have data in a df that is separated into a year and month column and I'm trying to find the average of observed data columns. I cannot find online how to convert the 'year' and 'month' columns to datetime and then to find the Q1, Q2, Q3, etc. averages.
year month data
0 2021 1 7.100427005789888
1 2021 2 7.22523237179488
2 2021 3 8.301528122415217
3 2021 4 6.843885683760697
4 2021 5 6.12365177832918
5 2021 6 6.049659188034206
6 2021 7 5.271174524400343
7 2021 8 5.098493589743587
8 2021 9 6.260155982906011
I need the final data to look like -
year Quarter Q data
2021 1 7.542395833
2021 2 6.33906555
2021 3 5.543274699
I've tried variations of this to change the 'year' and 'month' columns to datetime but it gives a long date starting with year = 1970
df.iloc[:, 1:2] = df.iloc[:, 1:2].apply(pd.to_datetime)
year month wind_speed_ms
0 2021 1970-01-01 00:00:00.000000001 7.100427
1 2021 1970-01-01 00:00:00.000000002 7.225232
2 2021 1970-01-01 00:00:00.000000003 8.301528
3 2021 1970-01-01 00:00:00.000000004 6.843886
4 2021 1970-01-01 00:00:00.000000005 6.123652
5 2021 1970-01-01 00:00:00.000000006 6.049659
6 2021 1970-01-01 00:00:00.000000007 5.271175
7 2021 1970-01-01 00:00:00.000000008 5.098494
8 2021 1970-01-01 00:00:00.000000009 6.260156
Thank you,

I hope this will work for you
# I created period column combining year and month column
df["period"]=df.apply(lambda x:f"{int(x.year)}-{int(x.month)}",axis=1).apply(pd.to_datetime).dt.to_period('Q')
# I applied groupby to period
df=df.groupby("period").mean().reset_index()
df["Quarter"] = df.period.astype(str).str[-2:]
df = df[["year","Quarter","data"]]
df.rename(columns={"data":"Q data"})
year Quarter Q data
0 2021.0 Q1 7.542396
1 2021.0 Q2 6.339066
2 2021.0 Q3 5.543275

Related

Produce weekly and quarterly stats from a monthly figure

I have a sample of a table as below:
Customer Ref
Bear Rate
Distance
Month
Revenue
ABA-IFNL-001
1000
01/01/2022
-135
ABA-IFNL-001
1000
01/02/2022
-135
ABA-IFNL-001
1000
01/03/2022
-135
ABA-IFNL-001
1000
01/04/2022
-135
ABA-IFNL-001
1000
01/05/2022
-135
ABA-IFNL-001
1000
01/06/2022
-135
I also have a sample of a calendar table as below:
Date
Year
Week
Quarter
WeekDay
Qtr Start
Qtr End
Week Day
04/11/2022
2022
45
4
Fri
30/09/2022
29/12/2022
1
05/11/2022
2022
45
4
Sat
30/09/2022
29/12/2022
2
06/11/2022
2022
45
4
Sun
30/09/2022
29/12/2022
3
07/11/2022
2022
45
4
Mon
30/09/2022
29/12/2022
4
08/11/2022
2022
45
4
Tue
30/09/2022
29/12/2022
5
09/11/2022
2022
45
4
Wed
30/09/2022
29/12/2022
6
10/11/2022
2022
45
4
Thu
30/09/2022
29/12/2022
7
11/11/2022
2022
46
4
Fri
30/09/2022
29/12/2022
1
12/11/2022
2022
46
4
Sat
30/09/2022
29/12/2022
2
13/11/2022
2022
46
4
Sun
30/09/2022
29/12/2022
3
14/11/2022
2022
46
4
Mon
30/09/2022
29/12/2022
4
15/11/2022
2022
46
4
Tue
30/09/2022
29/12/2022
5
16/11/2022
2022
46
4
Wed
30/09/2022
29/12/2022
6
17/11/2022
2022
46
4
Thu
30/09/2022
29/12/2022
7
How can I join/link the tables to report on revenue over weekly and quarterly periods using the calendar table? I can put into two tables if needed as an output eg:
Quarter Starting
31/12/2021
01/04/2022
01/07/2022
30/09/2022
Quarter
1
2
3
4
Revenue
500
400
540
540
Week Date Start
31/12/2021
07/01/2022
14/01/2022
21/01/2022
Week
41
42
43
44
Revenue
33.75
33.75
33.75
33.75
I am using alteryx for this but wouldnt mind explaination of possible logic in sql to apply it into the system
Thanks
Before I get into the answer, you're going to have an issue regarding data integrity. All the revenue data is aggregated at a monthly level, where your quarters start and end on someday within the month.
For example - Q4 starts September 30th (Friday) and ends Dec. 29th (Thursday). You may have a day or two that bleeds from another month into the quarters which might throw off the data a bit (esp. if there's a large amount of revenue during the days that bleed into a quarter.
Additionally, your revenue is aggregated at a monthly level - unless you have more granular data (weekly, daily would be best), it doesn't make sense to do a weekly calculation since you'll probably just be dividing revenue by 4.
That being said - You'll want to use a cross tab feature in alteryx to get the data how you want it. But before you do that, we want to aggregate your data at a quarterly level first.
You can do this with an if statement or some other data cleansing tool (sorry, been a while since I used alteryx). Something like:
# Pseudo code - this won't actually work!
# For determining quarter
if (month) between (30/09/2022,29/12/2022) then 4
where you can derive the logic from your calendar table. Then once you have the quarter, you can join in the Quarter Start date based on your quarter calculation.
Now you have a nice clean table that might look something like this:
Month
Revenue
Quarter
Quarter Start Date
01/01/2022
-135
4
30/09/2022
01/01/2022
-135
4
30/09/2022
Aggregate on your quarter to get a cleaner table
Quarter Start Date
Quarter
revenue
30/09/2022
4
300
Then use cross tab, where you pivot on the Quarter start date.
For SQL, you'd be pivoting the data. Essentially, taking the value from a row of data, and converting it into a column. It will look a bit janky because the data is so customized, but here's a good question that goes over pivioting - Simple way to transpose columns and rows in SQL?

SQL query to Find highest value in table and sum the corresponding value

I would like to group Highest values in month column group by year and Sum the value column
value
Year
Month
4
2019
10
1
2019
11
5
2019
11
1
2019
11
1
2019
12
8
2019
12
1
2019
12
1
2020
1
10
2020
1
3
2021
1
2
2021
2
11
2021
2
1
2021
2
3
2021
2
2
2021
3
In above table I would like to extract highest value of month group by year
in year 2019 highest month is 12 so there are 3 rows and sum of value column will be 10
The output should be
value
Year
Month
10
2019
12
11
2020
1
2
2021
3
supposing that the table is called "example_table" you can use the following query:
select sum(example_table.value), example_table.year, example_table.month
from example_table
join (
select year, max(month) "month"
from example_table
group by year
) sub on example_table.year = sub.year and example_table.month = sub.month
group by example_table.year, example_table.month
order by example_table.year

Compare Cumulative Sales per Year-End

Using this sample dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2020',
freq='M'), 10000)})
I am trying to compare 12 month time frames with seaborn plots for a sub-grouping of category. For example, I'd like to compare the cumulative 12 months for each year ending 4-30 vs. the same time period for each year. I cannot wrap my head around how to get a running total of data for each respective year (5/1/17-4/30/18, 5/1/18-4/30/19, 5/1/19-4/30/20). The dates are just examples - I'd like to be able to compare different year-end data points, even better would be able to compare 365 days. For instance, I'd love to compare 3/15/19-3/14/20 to 3/15/18-3/14/19, etc.
I envision a graph for each 'Category' (A,B,C,D) with lines for each respective year representing the running total starting with zero on May 1, building through April 30 of the next year. The x axis would be the month (starting with May 1) & y axis would be 'Units_Sold' as it grows.
Any help would be greatly appreciated!
One way to convert the date to fiscal quarters and extract the fiscal year:
df = pd.DataFrame({'Date':pd.date_range('2019-01-01', '2019-12-31', freq='M'),
'Values':np.arange(12)})
df['fiscal_year'] = df.Date.dt.to_period('Q-APR').dt.qyear
Output:
Date Values fiscal_year
0 2019-01-31 0 2019
1 2019-02-28 1 2019
2 2019-03-31 2 2019
3 2019-04-30 3 2019
4 2019-05-31 4 2020
5 2019-06-30 5 2020
6 2019-07-31 6 2020
7 2019-08-31 7 2020
8 2019-09-30 8 2020
9 2019-10-31 9 2020
10 2019-11-30 10 2020
11 2019-12-31 11 2020
And now you can group by fiscal_year to your heart's content.

define a not std year period for pandas cumsum

I have a long time series with datetime index. I want to cumsum yearly but I want define my years as 1 oct to 30 sept next year
ex: cum sum on 1 oct 2018 to 30 sept 2019
Thank for your help!
One way is to manually mask 10,11,12 as next year:
# toy data
s = pd.DatetimeIndex(['2017-09-01', '2017-10-01', '2017-11-01'])
df = pd.DataFrame([0,1,2], index=s)
# mask Oct, Nov, Dec
groups = np.where(df.index.month > 9, df.index.year + 1, df.index.year)
# array([2017, 2018, 2018], dtype=int64)
df.groupby(groups).cumsum()
Second option is to convert the index to fiscal year:
groups = df.index.to_period('Q-SEP').qyear
# Int64Index([2017, 2018, 2018], dtype='int64')
Output:
0
2017-09-01 0
2017-10-01 1
2017-11-01 3

How to Find Week number, Period and year from Date in Redshift? (Week Starting with Wednesday and ends up with Tuesday)

Need to find weekNumber like 1,2,3,4 but the week starts with Wednesday and ends with Tuesday from date column and after the 4th week, again the week restart by again as the 1st week and so on (no need to consider month).
Need to find the Period based on weekNumber only, 4 weeks as 1 Period and Periods end with 13 (period 1-13) will restart again 1st period.
(4 weeks = 1 period) (no need to consider month).
Now need to calculate the businessyear based on Period. 13 Periods as One businessyear. (13 periods = 1 year)
Calculation logic:
7 days * 4 weeks = 28 days = 1 period
13 periods = 1 businessyear
Example:
A year has 365 days normally
In my scenario, 4 weeks * 7 days = 28 days
28 days *13 periods = 364 days
The remaining days will come as the 5th week and period 14.
Datekey date Year semistor Quarter Month DayName DayNum Wnumber
20090101 01-01-2009 2009 1 1 January 1 Thursday 1 0
20090102 02-01-2009 2009 1 1 January 1 Friday 2 0
20090103 03-01-2009 2009 1 1 January 1 Saturday 3 0
20090104 04-01-2009 2009 1 1 January 1 Sunday 0
20090105 05-01-2009 2009 1 1 January 1 Monday 0
20090106 06-01-2009 2009 1 1 January 1 Tuesday 6 0
20090107 07-01-2009 2009 1 1 January 1 Wednesday 0 0
20090108 08-01-2009 2009 1 1 January 1 Thursday 1 1
20090109 09-01-2009 2009 1 1 January 1 Friday 2 1
20090110 10-01-2009 2009 1 1 January 1 Saturday 3 1
20090111 11-01-2009 2009 1 1 January 1 Sunday 4 1
20090112 12-01-2009 2009 1 1 January 1 Monday 5 1
20090113 13-01-2009 2009 1 1 January 1 Tuesday 6 1
20090114 14-01-2009 2009 1 1 January 1 Wednesday 0 1
No need to consider the month in my scenario, need to consider leap year also (2016, 2020).
The traditional way to do this type of thing is to create a calendar table in the database. Then, your queries can simply JOIN to the calendar table to extract the relevant value.
I find that the easiest way to create the calendar table is to use Excel. Simply write some formulas that provide the desired values and Copy Down for the next decade or so. Then, save the sheet as CSV and load it into the database.
This way, you can totally avoid complex calculations involving database functions and you can use whatever rules you wish.