define a not std year period for pandas cumsum - pandas

I have a long time series with datetime index. I want to cumsum yearly but I want define my years as 1 oct to 30 sept next year
ex: cum sum on 1 oct 2018 to 30 sept 2019
Thank for your help!

One way is to manually mask 10,11,12 as next year:
# toy data
s = pd.DatetimeIndex(['2017-09-01', '2017-10-01', '2017-11-01'])
df = pd.DataFrame([0,1,2], index=s)
# mask Oct, Nov, Dec
groups = np.where(df.index.month > 9, df.index.year + 1, df.index.year)
# array([2017, 2018, 2018], dtype=int64)
df.groupby(groups).cumsum()
Second option is to convert the index to fiscal year:
groups = df.index.to_period('Q-SEP').qyear
# Int64Index([2017, 2018, 2018], dtype='int64')
Output:
0
2017-09-01 0
2017-10-01 1
2017-11-01 3

Related

big query SQL - repeatedly/recursively change a row's column in the select statement based on the values in previous row

I have table like below
customer
date
end date
1
jan 1 2021
jan 30 2021
1
jan 2 2021
jan 31 2021
1
jan 3 2021
feb 1 2021
1
jan 27 2021
feb 26 2021
1
feb 3 2021
mar 5 2021
2
jan 2 2021
jan 31 2021
2
jan 10 2021
feb 9 2021
2
feb 10 2021
mar 12 2021
Now, I wanted to update the value in the 'end date' column of a row based on the values in the previous row 'end date' and the current row 'date'.
Say if the date in current row < end date of the previous row, I wanted to update the end date of the current row = (end date of the previous row).
I Wanted to do this repeated for all the rows (grouped by customer).
I want the output as below. Just need it in the select statement instead of a updating/inserting in a table.
Note - in below as the second row(end date) is updated with the value in the first row (jan 30 2021), now the third row value (jan 3 2021) is evaluated against the updated value in the second row (which is jan 30 2021) but not with the second row value before update (jan 31 2021).
customer
date
end date
1
jan 1 2021
jan 30 2021
1
jan 2 2021
jan 30 2021 [updated because current date < previous end date]
1
jan 3 2021
jan 30 2021[updated because current date < previous end date]
1
jan 27 2021
jan 30 2021 [updated because current date < previous end date]
1
feb 3 2021
mar 5 2021
2
jan 2 2021
jan 31 2021
2
jan 10 2021
jan 31 2021[updated because current date < previous end date]
2
feb 10 2021
mar 12 2021
I think I should go this way. I use the datasource twice just to get the way its needed to perform the operation without updating or inserting into the table.
input table:
1|2021-01-01|2021-01-30
1|2021-01-02|2021-01-31
1|2021-01-03|2021-02-01
1|2021-01-27|2021-02-26
1|2021-02-03|2021-03-05
2|2021-01-02|2021-01-31
2|2021-01-10|2021-02-09
2|2021-02-10|2021-03-12
code:
with num_raw_data as (
SELECT row_number() over(partition by customer)as num, customer,date_init,date_end
FROM `project-id.data-set.table`
), analyzed_data as(
select r.num,
r.customer,
r.date_init,
r.date_end,
case when date_init<(select date_end from num_raw_data where num=r.num-1 and customer=r.customer and EXTRACT(month FROM r.date_init)=EXTRACT(month FROM date_init)) then 1 else 0 end validation
from num_raw_data r
)
select customer,
date_init,
case when validation !=0 then (select MIN(date_end) from analyzed_data where validation=0 and customer=ad.customer and date_init<ad.date_end) else date_end end as date_end
from analyzed_data ad
order by customer,num
output:
1|2021-01-01|2021-01-30
1|2021-01-02|2021-01-30
1|2021-01-03|2021-01-30
1|2021-01-27|2021-01-30
1|2021-02-03|2021-03-05
2|2021-01-02|2021-01-31
2|2021-01-10|2021-01-31
2|2021-02-10|2021-03-12
Using column validation from analyzed_data to get to know where I should be looking for changes. I'm not sure if its fast (probably not) but it works for the scenario you bring in your question.

Pandas Convert Year/Month Int Columns to Datetime and Quarter Average

I have data in a df that is separated into a year and month column and I'm trying to find the average of observed data columns. I cannot find online how to convert the 'year' and 'month' columns to datetime and then to find the Q1, Q2, Q3, etc. averages.
year month data
0 2021 1 7.100427005789888
1 2021 2 7.22523237179488
2 2021 3 8.301528122415217
3 2021 4 6.843885683760697
4 2021 5 6.12365177832918
5 2021 6 6.049659188034206
6 2021 7 5.271174524400343
7 2021 8 5.098493589743587
8 2021 9 6.260155982906011
I need the final data to look like -
year Quarter Q data
2021 1 7.542395833
2021 2 6.33906555
2021 3 5.543274699
I've tried variations of this to change the 'year' and 'month' columns to datetime but it gives a long date starting with year = 1970
df.iloc[:, 1:2] = df.iloc[:, 1:2].apply(pd.to_datetime)
year month wind_speed_ms
0 2021 1970-01-01 00:00:00.000000001 7.100427
1 2021 1970-01-01 00:00:00.000000002 7.225232
2 2021 1970-01-01 00:00:00.000000003 8.301528
3 2021 1970-01-01 00:00:00.000000004 6.843886
4 2021 1970-01-01 00:00:00.000000005 6.123652
5 2021 1970-01-01 00:00:00.000000006 6.049659
6 2021 1970-01-01 00:00:00.000000007 5.271175
7 2021 1970-01-01 00:00:00.000000008 5.098494
8 2021 1970-01-01 00:00:00.000000009 6.260156
Thank you,
I hope this will work for you
# I created period column combining year and month column
df["period"]=df.apply(lambda x:f"{int(x.year)}-{int(x.month)}",axis=1).apply(pd.to_datetime).dt.to_period('Q')
# I applied groupby to period
df=df.groupby("period").mean().reset_index()
df["Quarter"] = df.period.astype(str).str[-2:]
df = df[["year","Quarter","data"]]
df.rename(columns={"data":"Q data"})
year Quarter Q data
0 2021.0 Q1 7.542396
1 2021.0 Q2 6.339066
2 2021.0 Q3 5.543275

Resampling time-series data with pyspark

I have timeseries data that looks a bit like this (timestamp, value):
14 Dec 2020 1000
15 Jan 2021 1000
20 Jan 2021 1000
18 Feb 2021 1000
03 Mar 2021 1000
I'm essentially trying to get monthly values, smoothing out the value for every month. Each row represents the "value" between the two dates, so if we wanted to calculate the value for January, we'd need the value to represent:
15 days of January from the value in December + 5 days between Jan 15 - Jan 20 + 11 days between Jan 20 - Feb 18.
Value would be calculated as number of days relevant to the current month / length of whole interval * value:
Value for Jan: (15/32) * 1000 + (5/5) * 1000 + (11/28) * 1000
I've tried using resampling with the window function, but resampling on 1 month gives me an exception and also it simply returns the intervals instead of resampling everything.
Any advice is appreciated. Thanks.
You can interpolate the values between the dates using sequence, then group by the month and average over the values in each month.
EDIT: used an UDF from this answer because sequence is not supported for Spark 2.2
import pyspark.sql.functions as F
from pyspark.sql.types import *
import datetime
def generate_date_series(start, stop):
return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]
spark.udf.register("generate_date_series", generate_date_series, ArrayType(DateType()))
result = df.withColumn(
'timestamp',
F.to_date(F.col('timestamp'), 'dd MMM yyyy')
).withColumn(
'next_timestamp',
F.expr("""
generate_date_series(
lag(timestamp, 1, timestamp + interval 1 day) -- need a default value for the last row
over(order by timestamp) + interval 1 day, -- don't want to include the previous date
timestamp
)
""")
).select(
F.explode('next_timestamp').alias('timestamp'),
(F.col('value') / F.size('next_timestamp')).alias('value')
).groupBy(
F.year('timestamp').alias('year'),
F.month('timestamp').alias('month')
).agg(
F.sum('value').alias('value')
).orderBy('year', 'month')
result.show(truncate=False)
+----+-----+------------------+
|year|month|value |
+----+-----+------------------+
|2020|12 |531.25 |
|2021|1 |1848.0603448275874|
|2021|2 |1389.920424403183 |
|2021|3 |230.76923076923077|
+----+-----+------------------+

How to prepare month by month time series data in R from aggregated data containing part wise sales vs month?

I am having a dataframe which consists of month wise sales data for many parts:
For eg
Partno Month Qty
Part 1 June 2019 20
Part 1 July 2019 25
Part 1 Sep 2019 30
Part 2 Mar 2019 45
Part 3 Aug 2019 40
Part 3 Nov 2019 21
I want to convert this data into a month by month time series, which makes it easier for time series forecasting, Once I make it into a ts object
Month Part1 Part 2 Part 3
Jan 0 0 0
Feb 0 0 0
Mar 0 45 0
Apr 0 0 0
May 0 0 0
June 20 0 0
July 25 0 0
Aug 0 0 0
Sept 0 30 0
Oct 0 0 20
Nov 0 0 21
Dec 0 0 0
I am quite baffled as to how this can be carried out in R. Any solutions for the same would be highly useful, as I plan build some forecasting models in R.
Looking forward to hearing from you all!
Assume the data DF shown reproducibly in the Note at the end.
First convert DF to zoo splitting it by the first column and converting the Month column to yearmon class. Then convert that to ts class, extend it to Jan to Dec, and set any NAs to 0. (If you don't need the 0 months at the beginning and end omit the yrs and window lines.)
library(zoo)
z <- read.zoo(DF, split = 1, index = 2, FUN = as.yearmon, format = "%b %Y")
tt <- as.ts(z)
yrs <- as.integer(range(time(tt))) # start and end years
tt <- window(tt, start = yrs[1], end = yrs[2] + 11/12, extend = TRUE)
tt[is.na(tt)] <- 0
tt
giving:
Part 1 Part 2 Part 3
Jan 2019 0 0 0
Feb 2019 0 0 0
Mar 2019 0 45 0
Apr 2019 0 20 0
May 2019 0 0 0
Jun 2019 20 0 0
Jul 2019 25 0 0
Aug 2019 0 0 0
Sep 2019 30 0 0
Oct 2019 0 0 20
Nov 2019 0 0 21
Dec 2019 0 0 0
Note
Lines <- "Partno, Month, Qty
Part 1, Jun 2019, 20
Part 1, Jul 2019, 25
Part 1, Sep 2019, 30
Part 2, Mar 2019, 45
Part 2, Apr 2019, 20
Part 3, Oct 2019, 20
Part 3, Nov 2019, 21"
DF <- read.csv(text = Lines, strip.white = TRUE)

Compare Cumulative Sales per Year-End

Using this sample dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2020',
freq='M'), 10000)})
I am trying to compare 12 month time frames with seaborn plots for a sub-grouping of category. For example, I'd like to compare the cumulative 12 months for each year ending 4-30 vs. the same time period for each year. I cannot wrap my head around how to get a running total of data for each respective year (5/1/17-4/30/18, 5/1/18-4/30/19, 5/1/19-4/30/20). The dates are just examples - I'd like to be able to compare different year-end data points, even better would be able to compare 365 days. For instance, I'd love to compare 3/15/19-3/14/20 to 3/15/18-3/14/19, etc.
I envision a graph for each 'Category' (A,B,C,D) with lines for each respective year representing the running total starting with zero on May 1, building through April 30 of the next year. The x axis would be the month (starting with May 1) & y axis would be 'Units_Sold' as it grows.
Any help would be greatly appreciated!
One way to convert the date to fiscal quarters and extract the fiscal year:
df = pd.DataFrame({'Date':pd.date_range('2019-01-01', '2019-12-31', freq='M'),
'Values':np.arange(12)})
df['fiscal_year'] = df.Date.dt.to_period('Q-APR').dt.qyear
Output:
Date Values fiscal_year
0 2019-01-31 0 2019
1 2019-02-28 1 2019
2 2019-03-31 2 2019
3 2019-04-30 3 2019
4 2019-05-31 4 2020
5 2019-06-30 5 2020
6 2019-07-31 6 2020
7 2019-08-31 7 2020
8 2019-09-30 8 2020
9 2019-10-31 9 2020
10 2019-11-30 10 2020
11 2019-12-31 11 2020
And now you can group by fiscal_year to your heart's content.