Resampling time-series data with pyspark - dataframe

I have timeseries data that looks a bit like this (timestamp, value):
14 Dec 2020 1000
15 Jan 2021 1000
20 Jan 2021 1000
18 Feb 2021 1000
03 Mar 2021 1000
I'm essentially trying to get monthly values, smoothing out the value for every month. Each row represents the "value" between the two dates, so if we wanted to calculate the value for January, we'd need the value to represent:
15 days of January from the value in December + 5 days between Jan 15 - Jan 20 + 11 days between Jan 20 - Feb 18.
Value would be calculated as number of days relevant to the current month / length of whole interval * value:
Value for Jan: (15/32) * 1000 + (5/5) * 1000 + (11/28) * 1000
I've tried using resampling with the window function, but resampling on 1 month gives me an exception and also it simply returns the intervals instead of resampling everything.
Any advice is appreciated. Thanks.

You can interpolate the values between the dates using sequence, then group by the month and average over the values in each month.
EDIT: used an UDF from this answer because sequence is not supported for Spark 2.2
import pyspark.sql.functions as F
from pyspark.sql.types import *
import datetime
def generate_date_series(start, stop):
return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]
spark.udf.register("generate_date_series", generate_date_series, ArrayType(DateType()))
result = df.withColumn(
'timestamp',
F.to_date(F.col('timestamp'), 'dd MMM yyyy')
).withColumn(
'next_timestamp',
F.expr("""
generate_date_series(
lag(timestamp, 1, timestamp + interval 1 day) -- need a default value for the last row
over(order by timestamp) + interval 1 day, -- don't want to include the previous date
timestamp
)
""")
).select(
F.explode('next_timestamp').alias('timestamp'),
(F.col('value') / F.size('next_timestamp')).alias('value')
).groupBy(
F.year('timestamp').alias('year'),
F.month('timestamp').alias('month')
).agg(
F.sum('value').alias('value')
).orderBy('year', 'month')
result.show(truncate=False)
+----+-----+------------------+
|year|month|value |
+----+-----+------------------+
|2020|12 |531.25 |
|2021|1 |1848.0603448275874|
|2021|2 |1389.920424403183 |
|2021|3 |230.76923076923077|
+----+-----+------------------+

Related

Subtraction of inventory from Demand in BigQuery everday and adding new inventory

Here's how my data looks like:
date
sku
inventory_added
demand
22nd Nov 2021
XYZ
70
18
23rd Nov 2021
XYZ
0
18
24th Nov 2021
XYZ
0
50
25th Nov 2021
XYZ
0
15
26th Nov 2021
XYZ
80
30
27th Nov 2021
XYZ
0
20
28th Nov 2021
XYZ
0
15
29th Nov 2021
XYZ
0
20
30th Nov 2021
XYZ
0
10
1st Dec 2021
XYZ
100
40
2nd Dec 2021
XYZ
0
10
I want to create a new column named solution using BigQuery SQL where in the 1st row, i.e. 22nd Nov 2021, I want formula as - inventory_added - demand.
This will give me 1st row's value for solution will be 52.
Now what I am not able to do is from 2nd row:
So, next now, will be 52 (remaining inventory from previous day) + 0 (inventory_added on 23rd Nov 2021) - 18 (demand on 23 Nov 2021). This is equal to 34.
Similarly going to next row, i.e. 24th November:
value in solution will be 34 + 0 - 50 = -16. Now since it is negative, it should be put as 0.
I tried this - MAX(solutions, 0).
The result will look like this:
date
sku
inventory_added
demand
solution
22nd Nov 2021
XYZ
70
18
52
23rd Nov 2021
XYZ
0
18
34
24th Nov 2021
XYZ
0
50
0
25th Nov 2021
XYZ
0
15
0
26th Nov 2021
XYZ
80
30
50
27th Nov 2021
XYZ
0
20
30
28th Nov 2021
XYZ
0
15
15
29th Nov 2021
XYZ
0
20
0
30th Nov 2021
XYZ
0
10
0
1st Dec 2021
XYZ
100
40
60
2nd Dec 2021
XYZ
0
10
50
I am not sure if this can be accomplished by BigQuery, but all suggestions are welcome.
Thanks!
Without the condition "it is negative, it should be put as 0" you may use window (in BigQuery terms - analytic) variant of SUM() function:
SELECT *,
SUM(inventory_added - demand) OVER (PARTITION BY sku ORDER BY date) AS solution
FROM source_table
With this condition the output become iterative, and you must use recursive CTE (if available in BigQuery) or iterative stored procedure.
I see that recursive CTE is not available in BigQuery ... Can you provide a pseudo code may as a starting point for stored procedures? – Shantanu Jain
CREATE PROCEDURE procname()
BEGIN
CREATE temptable;
OPEN CURSOR FOR SELECT * FROM datatable ORDER BY date;
SET #solution = 0;
FETCH CURSOR INTO #date, #sku, #inventory_added, #demand;
LOOP ​
​ SET #solution = GREATEST(#solution + #inventory_added - #demand, 0);
​ INSERT INTO temptable VALUES (#date, #sku, #inventory_added, #demand, #solution);
FETCH CURSOR INTO #date, #sku, #inventory_added, #demand;
UNTIL NO_ROWS_IN_CURSOR END LOOP;
SELECT * FROM temptable;
DROP temptable;
END
AS an option - consider use of recently introduced FOR...IN Loop
declare result int64;
declare prev_sku string;
create temp table results as (select *, 0 as solution from your_table where false);
set (result, prev_sku) = (0, '');
for record in (select *, parse_date('%d %B %Y', regexp_replace(date, r'(\d*)(\w*)( \w{3} \d{4})', r'\1 \3')) dt from your_table order by sku, dt) do
if record.sku != prev_sku then set result = 0; end if;
set result = result + record.inventory_added - record.demand;
if result < 0 then set result = 0; end if;
insert into results values(record.date, record.sku, record.inventory_added, record.demand, result);
set prev_sku = record.sku;
end for;
select * from results
order by sku, parse_date('%d %B %Y', regexp_replace(date, r'(\d*)(\w*)( \w{3} \d{4})', r'\1 \3'));
If applied to sample data in your question - output is
Note: While delivering expected result - obviously this is going to be extremely slow (as any cursor based solution) - so while applicable for learning - I don't think appropriate for real production use

Pandas Convert Year/Month Int Columns to Datetime and Quarter Average

I have data in a df that is separated into a year and month column and I'm trying to find the average of observed data columns. I cannot find online how to convert the 'year' and 'month' columns to datetime and then to find the Q1, Q2, Q3, etc. averages.
year month data
0 2021 1 7.100427005789888
1 2021 2 7.22523237179488
2 2021 3 8.301528122415217
3 2021 4 6.843885683760697
4 2021 5 6.12365177832918
5 2021 6 6.049659188034206
6 2021 7 5.271174524400343
7 2021 8 5.098493589743587
8 2021 9 6.260155982906011
I need the final data to look like -
year Quarter Q data
2021 1 7.542395833
2021 2 6.33906555
2021 3 5.543274699
I've tried variations of this to change the 'year' and 'month' columns to datetime but it gives a long date starting with year = 1970
df.iloc[:, 1:2] = df.iloc[:, 1:2].apply(pd.to_datetime)
year month wind_speed_ms
0 2021 1970-01-01 00:00:00.000000001 7.100427
1 2021 1970-01-01 00:00:00.000000002 7.225232
2 2021 1970-01-01 00:00:00.000000003 8.301528
3 2021 1970-01-01 00:00:00.000000004 6.843886
4 2021 1970-01-01 00:00:00.000000005 6.123652
5 2021 1970-01-01 00:00:00.000000006 6.049659
6 2021 1970-01-01 00:00:00.000000007 5.271175
7 2021 1970-01-01 00:00:00.000000008 5.098494
8 2021 1970-01-01 00:00:00.000000009 6.260156
Thank you,
I hope this will work for you
# I created period column combining year and month column
df["period"]=df.apply(lambda x:f"{int(x.year)}-{int(x.month)}",axis=1).apply(pd.to_datetime).dt.to_period('Q')
# I applied groupby to period
df=df.groupby("period").mean().reset_index()
df["Quarter"] = df.period.astype(str).str[-2:]
df = df[["year","Quarter","data"]]
df.rename(columns={"data":"Q data"})
year Quarter Q data
0 2021.0 Q1 7.542396
1 2021.0 Q2 6.339066
2 2021.0 Q3 5.543275

Oracle results from last X days while ignoring time

Right now I have the following query which returns results from the last 60 days
select * from my_table where date_col > sysdate - 60
But it is also taking time of the day into consideration. For example today is
Sept 30 2021 10:30:00 AM
and the query would return results from Oct 01 2021 10:31:00 AM, but not from Oct 01 2021 10:29:00 AM
How can I modify the query that it does not care about the time when getting the last 60 days? I would the query to return results even if the row had a date of Oct 01 2021 00:00:01 AM
It sounds like you just want to trunc the sysdate. I'd guess that you want to do a >= as well.
WHERE date_col >= trunc(sysdate) - 60

Teradata SQL Week Number - Week 1 starting 1st Jan with weeks aligned to specific day of the week

my first post on here so please be gentle...
I'm trying to create a week number variable in Teradata (SQL) that does the following:
Week 1 always starts on 1st January of the given year
Week numbers increment on the specified day of the week
For example: If Saturday was the specified day of the week:
2019-01-01 would be the start of week 1, 2019, changing to week 2 on 2019-01-05
2020-01-01 would be the start of week 1, 2020, changing to week 2 on 2020-01-04
I have come up wit the following based on an Excel function however it doesn't quite work as expected:
ROUND(((DATE_SPECIFIED - CAST(EXTRACT(YEAR FROM DATE_SPECIFIED) || '-01-01' AS DATE) + 1) - ((DATE_SPECIFIED - DATE '0001-01-06') MOD 7 + 1) + 10) / 7) AS REQUIRED_WEEK
The last digit of the section - DATE '0001-01-06' deals with the specified day of the week, where '0001-01-01' would be Monday.
This works in some cases however for some years, the first week number is showing as 0 where it should be 1, e.g. 1st Jan 2018 / 2019 are fine whereas 1st Jan 2020 is not.
Any ideas to correct this would be gratefully received.
Many thanks,
Mike
You can apply NEXT_DAY for both the specified date and Jan 1st of that year, e.g. for Saturday as week start:
(Next_Day(DATE_SPECIFIED,'SAT') - Next_Day(Trunc(DATE_SPECIFIED,'yyyy'),'SAT')) / 7 +1
Hmmm . . . I'm a bit week on Teradata functions. But the idea is to get the start of the second week. This follows the rule:
Jan 1 weekday (TD) 2nd week
Sunday 1 01-02
Monday 2 01-08
Tuesday 3 01-07
Wednesday 4 01-06
Thursday 5 01-05
Friday 6 01-04
Saturday 7 01-03
I think the following logic calculates this:
select t.*,
(case when td_day_of_week(cast(extract(year from DATE_SPECIFIED) || '-01-01' as date) ) = 1
then cast(extract(year from DATE_SPECIFIED) + '-01-02' as date)
else extract(year from DATE_SPECIFIED) + 10 - cast(td_day_of_week(cast(extract(year from DATE_SPECIFIED) || '-01-01') as date)
from t;
Then do you your week calculate either from the second week or subtract one more week to get when the first week really starts.

sql running total math current quarter

Im trying to figure out the total for the quarter when the only data shown is a running total for the year:
Id Amount Periods Year Type Date
-------------------------------------------------------------
1 65 2 2014 G 4-1-12
2 75 3 2014 G 7-1-12
3 25 1 2014 G 1-1-12
4 60 1 2014 H 1-1-12
5 75 1 2014 Y 1-1-12
6 120 3 2014 I 7-1-12
7 30 1 2014 I 1-1-12
8 90 2 2014 I 4-1-12
In the data shown above. The items in type G and I are running totals for the period (in qtrs). If my query returns period 3, is there a sql way to get the data for the qtr? The math would involve retrieving the data for the 3rd period - 2nd period.
Right now my sql is something like:
SELECT * FROM data WHERE Date='4-1-12';
In this query, it will return row #1, which is a total for 2 periods. I would like it to return just the total for the 2nd period. Im looking to make this happen with SQLite.
Any help would be appreciated.
Thank alot
You want to subtract the running total of the previous quarter:
SELECT Id,
Year,
Type,
Date,
Amount - IFNULL((SELECT Amount
FROM data AS previousQuarter
WHERE previousQuarter.Year = data.year
AND previousQuarter.Type = data.Type
AND previousQuarter.Periods = data.Periods - 1
), 0) AS Amount
FROM data
The IFNULL is needed to handle a quarter that has no previous quarter.