Finding cumulative sum using SQL Server with ORDER BY - sql

Trying to calculate a cumulative sum up to a given number.
Need to order by 2 columns : Delivery, Date.
Query:
SELECT Date, Delivery, Balance, SUM(Balance) OVER ( ORDER BY Delivery, Date) AS cumsum
FROM t
Results:
Contract_Date Delivery Balance cumsum
2020-02-25 2020-03-01 308.100000 308.100000
2020-03-05 2020-03-01 -2.740000 305.360000
2020-03-06 2020-04-01 176.820000 682.180000
2020-03-06 2020-04-01 200.000000 682.180000
2020-03-09 2020-04-01 300.000000 1082.180000
2020-03-09 2020-04-01 100.000000 1082.180000
2020-03-13 2020-04-01 129.290000 1211.470000
2020-03-16 2020-04-01 200.000000 1711.470000
2020-03-16 2020-04-01 300.000000 1711.470000
2020-03-17 2020-04-01 300.000000 2011.470000
2020-04-01 2020-04-01 86.600000 2098.070000
2020-04-03 2020-04-01 200.000000 2298.070000
Expected results:
Contract_Date Delivery Balance cumsum
25/2/2020 1/3/2020 308.1 308.1
5/3/2020 1/3/2020 -2.74 305.36
6/3/2020 1/4/2020 176.82 482.18
6/3/2020 1/4/2020 200 682.18
9/3/2020 1/4/2020 300 982.18
9/3/2020 1/4/2020 100 1082.18
13/3/2020 1/4/2020 129.29 1211.47
16/3/2020 1/4/2020 200 1411.47
16/3/2020 1/4/2020 300 1711.47
17/3/2020 1/4/2020 300 2011.47
1/4/2020 1/4/2020 86.6 2098.07
3/4/2020 1/4/2020 200 2298.07
Version:
Microsoft SQL Server 2017

You need a third column in the ORDER BY clause to break the ties on Contract_Date and Delivery. It is not obvious which one you would use. Here is one option using column Balance:
SELECT
Date,
Delivery,
Balance,
SUM(Balance) OVER ( ORDER BY Delivery, Contract_Date, Balance) AS cumsum
FROM t

Related

Merging two series with alternating dates into one grouped Pandas dataframe

Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1

Calculate duration between two rows T-Sql

Good afternoon! Could anyone help me to solve the task? I have a table:
Id
Date
Reason
1
2020-01-01 10:00
Departure
1
2020-01-01 12:20
Arrival
1
2020-01-02 14:30
Departure
1
2020-01-02 19:20
Arrival
1
2020-01-03 15:40
Departure
1
2020-01-04 19:20
Arrival
2
2020-02-03 15:40
Departure
2
2020-02-04 19:20
Arrival
3
2020-03-05 15:40
Departure
3
2020-03-05 19:20
Arrival
3
2020-03-06 16:28
Departure
3
2020-03-06 21:00
Arrival
I need to estimate average duration of each ID. At first step I want to get table, for example for id = 1, as
Id
Duraton (minutes)
1
140
1
290
1
1660
How can I achive that by T-Sql query?
Assuming the rows are perfectly interleaved, you can use lead():
select t.*,
datediff(minute, date, next_date) as diff_minutes
from (select t.*,
lead(date) over (partition by id order by date) as next_date
from t
) t
where reason = 'Departure';
If you want the results for only one id, you can filter in either the subquery or the outer query.

How to merge records with aggregate historical data?

I have a table with individual records and another which holds historical information about the individuals in the former.
I want to extract information about the individuals from the second table. Both tables have timestamp. It is very important that the historical information happened before the record in the first table.
Date_Time name
0 2021-09-06 10:46:00 Leg It Liam
1 2021-09-06 10:46:00 Hollyhill Island
2 2021-09-06 10:46:00 Shani El Bolsa
3 2021-09-06 10:46:00 Kilbride Fifi
4 2021-09-06 10:46:00 Go
2100 2021-10-06 11:05:00 Slaneyside Babs
2101 2021-10-06 11:05:00 Hillview Joe
2102 2021-10-06 11:05:00 Fairway Flyer
2103 2021-10-06 11:05:00 Whiteys Surprise
2104 2021-10-06 11:05:00 Astons Lucy
The name is the variable by which you connect the two tables:
Date_Time name cc
13 2021-09-15 12:16:00 Hollyhill Island 6.00
14 2021-09-06 10:46:00 Hollyhill Island 4.50
15 2021-05-30 18:28:00 Hollyhill Island 3.50
16 2021-05-25 10:46:00 Hollyhill Island 2.50
17 2021-05-18 12:46:00 Hollyhill Island 2.38
18 2021-04-05 12:31:00 Hollyhill Island 3.50
19 2021-04-28 12:16:00 Hollyhill Island 3.75
I want to add aggregated data from this table to the first. Such as adding the cc mean and count.
Date_Time name
1 2021-09-06 10:46:00 Hollyhill Island
This line I would add 5 for cc count and 3.126 for the cc mean. Remember the historical records need to be before the date time of the individual records.
I am a bit confused how to do this efficiently. I know I need to groupby the historical data.
Also the individual records are usually in groups of Date_Time, if that makes it any easier.
IIUC:
try:
out=df1.merge(df2,on='name',suffixes=('','_y'))
#merging both df's on name
out=out.mask(out['Date_Time']<=out['Date_Time_y']).dropna()
#filtering results
out=out.groupby(['Date_Time','name'])['cc'].agg(['count','mean']).reset_index()
#aggregrating values
output of out:
Date_Time name count mean
0 2021-09-06 10:46:00 Hollyhill Island 5 3.126

pandas groupby several criteria

I have a dataframe that looks like this
which contains every minute of a year.
I need to simplify it on hourly base and to get only hours of the year and then maximum of Reserved and Used columns for the respective hours.
I made this, which works, but not totally for my purposes
df = df.assign(date=df.date.dt.round('H'))
df1 = df.groupby('date').agg({'Reserved': ['max'], 'Used': ['max'] }).droplevel(1, axis=1).reset_index()
which just groups the minutes into hours.
date Reserved Used
0 2020-01-01 00:00:00 2176 0.0
1 2020-01-01 01:00:00 2176 0.0
2 2020-01-01 02:00:00 2176 0.0
3 2020-01-01 03:00:00 2176 0.0
4 2020-01-01 04:00:00 2176 0.0
... ... ... ...
8780 2020-12-31 20:00:00 3450 50.0
8781 2020-12-31 21:00:00 3450 0.0
8782 2020-12-31 22:00:00 3450 0.0
8783 2020-12-31 23:00:00 3450 0.0
8784 2021-01-01 00:00:00 3450 0.0
Now I need group it more to plot several curves, containing only 24 points (for every hour) based on several criteria
average used and reserved for the whole year (so to group together every 00 hour, every 01 hour, etc.)
average used and reserved for every month (so to group every 00 hour, 01 hour etc for each month individually)
average used and reserved for weekdays and for weekends
I know this is only the similar groupby as before, but I somehow miss the logic of doing it.
Could anybody help?
Thanks.

Calculate number of days from date time column to a specific date - pandas

I have a df as shown below.
df:
ID open_date limit
1 2020-06-03 100
1 2020-06-23 500
1 2019-06-29 300
1 2018-06-29 400
From the above I would like to calculate a column named age_in_days.
age_in_days is the number of days from open_date to 2020-06-30.
Expected output
ID open_date limit age_in_days
1 2020-06-03 100 27
1 2020-06-23 500 7
1 2019-06-29 300 367
1 2018-06-29 400 732
Make sure open_date in datetime dtype and subtract it from 2020-06-30
df['open_date'] = pd.to_datetime(df.open_date)
df['age_in_days'] = (pd.Timestamp('2020-06-30') - df.open_date).dt.days
Out[209]:
ID open_date limit age_in_days
0 1 2020-06-03 100 27
1 1 2020-06-23 500 7
2 1 2019-06-29 300 367
3 1 2018-06-29 400 732