How to aggregate over date including all prior dates - sql

I am working with a table in Databricks Delta lake. It gets new records appended every month. The field insert_dt indicates when the records are inserted.
| ID | Mrc | insert_dt |
|----|-----|------------|
| 1 | 40 | 2022-01-01 |
| 2 | 30 | 2022-01-01 |
| 3 | 50 | 2022-01-01 |
| 4 | 20 | 2022-02-01 |
| 5 | 45 | 2022-02-01 |
| 6 | 55 | 2022-03-01 |
Now I want to aggregate by insert_dt and calculate the average of Mrc. For each date, the average is done not just for the records of that date but all records with date prior to that. In this example, there are 3 rows for 2022-01-01, 5 rows for 2022-02-01 and 6 rows for 2022-03-01. The expected results would look like this:
| Mrc | insert_dt |
|-----|------------|
| 40 | 2022-01-01 |
| 37 | 2022-02-01 |
| 40 | 2022-03-01 |
How do I write a query to do that?

I checked the documentation for Delta-lake databricks (https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html ) and it looks like TSQL so I think this will work for you, but you may need to tweak slightly.
The approach is to condense each day to a single point and then use window functions to get the running totals. Note that any given day may have a different count, so you can't just average the averages.
--Enter the sample data you gave as a CTE for testing
;with cteSample as (
SELECT * FROM ( VALUES
(1, 40, CONVERT(date,'2022-01-01'))
, ('2', '30', '2022-01-01')
, ('3', '50', '2022-01-01')
, ('4', '20', '2022-02-01')
, ('5', '45', '2022-02-01')
, ('6', '55', '2022-03-01')
) as TabA(ID, Mrc, insert_dt)
)--Solution begins here, find the total and count for each date
--because window can only handle a single "last row"
, cteGrouped as (
SELECT insert_dt, SUM(Mrc) as MRCSum, COUNT(*) as MRCCount
FROM cteSample
GROUP BY insert_dt
)--Now use the window function to get the totals "up to today"
, cteTotals as (
SELECT insert_dt
, SUM(MRCSum) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcSum
, SUM(MRCCount) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcCount
FROM cteGrouped as G
) --Now divide out to get the average to date
SELECT insert_dt, MrcSum/MrcCount as MRCAverage
FROM cteTotals as T
This gives the following output
insert_dt
MRCAverage
2022-01-01
40
2022-02-01
37
2022-03-01
40

Calculate a running average using a window function (the inner subquery) and then pick only one row per insert_dt - the one with the highest id. I only tested this on PostgreSQL 13 so not sure how far does delta-lake support the SQL standard and will it work there or not though.
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from the_table
) t
where rn = 1
order by insert_dt;
DB-fiddle demo
Update If the_table has no id column then use a CTE to add one.
with t_id as (select *, row_number() over (order by insert_dt) id from the_table)
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from t_id
) t
where rn = 1
order by insert_dt;

Related

How do I calculate unique users in each month but excluding the count of users if they were existing in 12 months prior?

Assume I have the following table: http://sqlfiddle.com/#!17/3de963/1
What I am trying to calculate is for each paidmonth in the table, I want to calculate total users and total purchase_amt for that month and 11 months prior (so a total of 12 months, including the month of the current row).
I can calculate the total amount easily and I have done this by doing:
sum(purchase_amt) over (order by paidmonth asc rows 11 preceding)
However, when I try to do:
count(distinct user_id) over (order by paidmonth asc rows 11 preceding)
I get this error:
Window ORDER BY is not allowed if DISTINCT is specified
So this is the result I am hoping to get:
| paidmonth | total_unique_users | total_amount |
| ---------- | ------------------ | ------------ |
| 2020-10-01 | 1 | 20 |
| 2020-11-01 | 1 | 50 |
| 2020-12-01 | 1 | 100 |
| 2021-06-01 | 2 | 180 |
| 2022-03-01 | 2 | 85 |
| 2022-06-01 | 1 | 105 |
| 2022-10-01 | 2 | 175 |
If there are any additional columns you require, please let me know and I will help. The table I have shown in the link is a summary CTE.
The trick is to turn your list of users into array and then do the calculation unnesting it. Hope this helps.
with temp_table as (
select '2020-10-01' paidmonth, 23 user_id, 392 order_id, 20 purchase_amt union all
select '2020-11-01', 23, 406, 30 union all
select '2020-12-01', 23, 412, 50 union all
select '2021-06-01', 32, 467, 80 union all
select '2022-03-01', 87, 449, 5 union all
select '2022-06-01', 87, 512, 100 union all
select '2022-10-01', 87, 553, 50 union all
select '2022-10-01', 155, 583, 20
),
calcs AS (
SELECT
paidmonth,
purchase_amt,
ARRAY_AGG(user_id) OVER(ORDER BY paidmonth ASC ROWS 11 PRECEDING ) AS last_11_unique_users
FROM
temp_table )
SELECT
paidmonth,
(SELECT COUNT(DISTINCT users) FROM UNNEST(last_11_unique_users) AS users) total_unique_users,
SUM(purchase_amt) OVER (ORDER BY paidmonth ASC ROWS 11 PRECEDING) total_amount
FROM
calcs
Thank you to AlienDeg, he gave me insights into how I should approach this. I think I finally got my answer by doing some trial and error approach and then almost following the solution here: Count unique ids in a rolling time frame
Now what I had to end up changing was that I could not use unix_date(paidmonth) range between 11 preceding and current row as mentioned in the solution in the link above. This is because, if I did that I would just get count distinct of users per month as I am using range between.
So I ended up doing:
cte1 as (
select
paidmonth,
string_agg(distinct user_id) users
from tempt1 -- this is just a table of distinct month and user_ids, where user_ids have been cast as string
group by paidmonth
),
cte2 as (
select
paidmonth,
string_agg(users) over (order by month_paid rows between 11 preceding and current row) users_12m
from cte1
)
select
paidmonth,
(select count(distinct id) from unnest(split(users_12m)) as id) unique_12m
from cte2
order by 1 asc;
So instead of using unix_date and range, I just used order by month and rows.

Calculating elapsed time in non-contiguous rows using SQL

I need to deduce uptime for servers using SQL with a table that looks as follows:
| Row | ID | Status | Timestamp |
-----------------------------------
| 1 | A1 | UP | 1598451078 |
-----------------------------------
| 2 | A2 | UP | 1598457488 |
-----------------------------------
| 3 | A3 | UP | 1598457489 |
-----------------------------------
| 4 | A1 | DOWN | 1598458076 |
-----------------------------------
| 5 | A3 | DOWN | 1598461096 |
-----------------------------------
| 6 | A1 | UP | 1598466510 |
-----------------------------------
In this example, A1 went down on Wed, 26 Aug 2020 16:07:56 and came back up at Wed, 26 Aug 2020 18:28:30. This means I need to find the difference between rows 6 and 4 using the ID field and display it as an additional column named "Uptime".
I have found several answers that explain how to use aliases and inner joins to calculate the difference between contiguous rows (e.g. How to get difference between two rows for a column field?), but none that explains how to do so for non-contiguous rows.
For example, this piece of code from https://www.mysqltutorial.org/mysql-tips/mysql-compare-calculate-difference-successive-rows/ gives a possible solution, but I don't know how to adapt it to compare the roaws based on the ID field:
SELECT
g1.item_no,
g1.counted_date from_date,
g2.counted_date to_date,
(g2.qty - g1.qty) AS receipt_qty
FROM
inventory g1
INNER JOIN
inventory g2 ON g2.id = g1.id + 1
WHERE
g1.item_no = 'A';
Any help would be much appreciated.
Basically, you need the total time minus the downtime.
If you want the different periods, you can use:
select status, max(timestamp), min(timestamp),
max(timestamp) - min(timestamp)
from (select t.*,
row_number() over (order by timestamp) as seqnum,
row_number() over (partition by status order by timestamp) as seqnum2
from t
) t
group by status, (seqnum - seqnum2);
However, for your purposes, for the total uptime:
select sum( coalesce(next_timestamp, max_uptimestamp) - min(timestamp))
from (select t.*,
lag(timestamp) over (order by status) as prev_status,
lead(timestamp) over (order by timestamp) as next_timestamp,
max(case when status = 'UP' then timestamp end) over () as max_uptimestamp
from t
) t
where status = 'UP' and
(prev_status = 'DOWN' or pre_status is null);
Basically, this counts all the time from the first UP to the next DOWN or to the last UP. It then sums that up.

Get the previous amount of accounts?

I got a table like this:
===============================
| ID | Salary | Date |
===============================
| A1 | $1000 | 2020-01-03|
-------------------------------
| A1 | $1300 | 2020-02-03|
-------------------------------
| A1 | $1500 | 2020-03-01|
-------------------------------
| A2 | $1300 | 2020-01-13|
-------------------------------
| A2 | $1500 | 2020-02-11|
-------------------------------
Expected output:
==================================================
| ID | Salary | Previous Salary | Date |
==================================================
| A1 | $1500 | $1300 | 2020-03-01|
--------------------------------------------------
| A2 | $1500 | $1300 | 2020-02-03|
--------------------------------------------------
How could I query to always get their previous salary and to show in another column/table ?
You can combine both the row_number and the lag windows functions to locate the last salary for every id and to return their last and previous salary.
with cte as (
select id, salary,
row_number() over (partition by id order by date desc) as position,
lag(salary) over (partition by id order by date) as previous,
date
from payroll
)
select id, salary, previous, date
from cte
where position = 1 -- It's the first one because we ordered by date descendingly
Result :
ID Salary Previous Date
----- --------------------- --------------------- ----------
A1 1500,00 1300,00 2020-03-01
A2 1500,00 1300,00 2020-02-11
Online sample: http://sqlfiddle.com/#!18/770472/15/0
In SQL Server you can use the LAG Window Function to reference a field from the previous record in a partitioned set (within a specified data window)
with [Data] as
(
SELECT ID, Salary, Cast([Date] as Date) [Date]
FROM (VALUES
('A1', 1000, '2020-01-03'),
('A1',1300,'2020-02-03'),
('A1',1500,'2020-03-01'),
('A2',1300,'2020-01-13'),
('A2',1500,'2020-02-11')
) as t(ID,Salary,Date)
)
-- above is a simple demo dataset definition, your actual query is below
SELECT ID, Salary, LAG(Salary, 1)
OVER (
PARTITION BY [ID]
ORDER BY [Date]
) as [Previous_Salary], [Date]
FROM [Data]
ORDER BY [Data].[Date] DESC
Produces the following output:
ID Salary Previous_Salary Date
---- ----------- --------------- ----------
A1 1500 1300 2020-03-01
A2 1500 1300 2020-02-11
A1 1300 1000 2020-02-03
A2 1300 NULL 2020-01-13
A1 1000 NULL 2020-01-03
(5 rows affected)
Experiment with the ordering, note here in the window we are using ascending order, and in the display, we can show descending order.
Window functions create a virtual dataset outside of your current query, think of windows functions as a way to execute correlated queries in parallel and merge in the result.
In many simple implementations window functions like this should provide better or the same performance as writing your own logic to self-join or to sub query, for example this is an equivalent query using CROSS APPLY
SELECT ID, [Data].Salary, previous.Salary as [Previous_Salary], [Data].[Date]
FROM [Data]
CROSS APPLY (SELECT TOP 1 x.Salary
FROM [Data] x
WHERE x.[ID] = [Data].ID AND x.[Date] > [Data].[Date]
ORDER BY x.[Date] DESC) as previous
ORDER BY [Data].[Date] DESC
The LAG syntax requires less code, clearly defines your intent and allows set-based execution and optimizations.
Other JOIN style queries will still be blocking queries as they will require the reversal of the original data-set (by forcing the entire set to be loaded in reverse order) and so will not offer a truly set-based or optimal approach.
SQL Server developers realized that there is genuine need for these types of queries, and that in general when left to our own devices we create inefficient lookup queries, Window Functions were designed to offer a best-practice solution to these types of analytical queries.
This query could work for you
select *, LAG(salary) OVER (partition by id ORDER BY id) as previous from A
Try this: (Assumption: Table name 'PAYROLL' in 'PLS' schema)
SELECT R1.ID,R1.SALARY AS SALARY,R2.SALARY AS PREVIOUS_SALARY
FROM PLS.PAYROLL R1
LEFT OUTER JOIN PLS.PAYROLL R2 ON R1.ID=R2.ID AND R2.SALARY_DATE = (SELECT MAX(SALARY_DATE) FROM PLS.PAYROLL WHERE ID=R1.ID AND SALARY_DATE<R1.SALARY_DATE)
WHERE R1.SALARY_DATE=(SELECT MAX(SALARY_DATE) FROM PLS.PAYROLL WHERE ID=R1.ID)
ORDER BY R1.ID
You can use window functions and pivot to make this.
DECLARE #SampleData TABLE (ID VARCHAR(5), Salary MONEY ,[Date] DATE)
INSERT INTO #SampleData VALUES
('A1', $1000 , '2020-01-03'),
('A1', $1300 , '2020-02-03'),
('A1', $1500 , '2020-03-01'),
('A2', $1300 , '2020-01-13'),
('A2', $1500 , '2020-02-11')
SELECT ID, [1] Salary, [2] [Previous Salary], [Date]
FROM (SELECT ID, Salary, MAX([Date]) OVER(PARTITION BY ID) AS [Date],
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY [Date] DESC) RN
FROM #SampleData ) SRC
PIVOT(MAX(Salary) FOR RN IN ([1],[2])) PVT
ORDER BY ID
Result:
ID Salary Previous Salary Date
----- --------------------- --------------------- ----------
A1 1500,00 1300,00 2020-03-01
A2 1500,00 1300,00 2020-02-11

Query to find records that where created one after another in bigquery

I am playing around with bigquery. Following input is given:
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
| 3 | 2 | 309 | NY | 2019-02-12 06:41pm |
| 1 | 1 | 654 | LA | 2019-02-12 05:12pm |
+---------------+---------+---------+--------+----------------------+
I want to find transactions that where issued one after another (say within 5 minutes) by one and the same agent. So the output for the above table should look like:
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
+---------------+---------+---------+--------+----------------------+
The query should somehow group by agent and find such transactions. However the result is not really grouped as you can see from the output. My first thought was using the LEAD function, but I am not sure. Do you have any ideas?
Ideas for a query:
sort by agent and timestamp DESC
start with the first row, look at the following row (using LEAD?)
check if timestamp difference is less than 5 minutes
if so, this two rows should be in the output
continue with next (2nd) row
When the 2nd and 3rd row match the criteria, too, the 2nd row will get into the output, which would cause duplicate rows. I am not sure how to avoid that, yet.
There must be an easier way but does this achieve what you are after?
CTE2 AS (
SELECT customer, agent, value, city, timestamp,
lead(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lead,
lead(customer,1) OVER (PARTITION BY agent ORDER BY timestamp) customer_lead,
lead(value,1) OVER (PARTITION BY agent ORDER BY timestamp) value_lead,
lead(city,1) OVER (PARTITION BY agent ORDER BY timestamp) city_lead,
lag(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lag
FROM CTE
)
SELECT agent,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(customer as string),', ',cast(customer_lead as string)),cast(customer as string)) customer,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(value as string),', ',cast(value_lead as string)),cast(value as string)) value,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(city as string),', ',cast(city_lead as string)),cast(city as string)) cities,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(timestamp as string),', ',cast(timestamp_lead as string)),cast(timestamp as string)) timestamps
FROM CTE2
WHERE (timestamp_diff(timestamp_lead,timestamp,MINUTE)<5 OR NOT timestamp_diff(timestamp,timestamp_lag,MINUTE)<5)
Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM `project.dataset.yourtable`
)
WHERE NOT next_customer IS NULL
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer, 1 agent, 106 value,'LA' city, '2019-02-12 03:05pm' ts UNION ALL
SELECT 1, 1, 251,'LA', '2019-02-12 03:06pm' UNION ALL
SELECT 3, 2, 309,'NY', '2019-02-12 06:41pm' UNION ALL
SELECT 1, 1, 654,'LA', '2019-02-12 05:12pm'
), temp AS (
SELECT customer, agent, value, city, PARSE_TIMESTAMP('%Y-%m-%d %I:%M%p', ts) ts
FROM `project.dataset.table`
)
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM temp
)
WHERE NOT next_customer IS NULL
-- ORDER BY ts
with result
Row customer agent value city ts next_customer next_value
1 1 1 106 LA 2019-02-12 15:05:00 UTC 1 251

get the id based on condition in group by

I'm trying to create a sql query to merge rows where there are equal dates. the idea is to do this based on the highest amount of hours, so that i in the end gets the corresponding id for each date with the highest amount of hours. i've been trying to do with a simple group by, but does not seem to work, since i CANT just put a aggregate function on id column, since it should be based the hours condition
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 2 | 2012-01-01 | 10 |
| 3 | 2012-01-01 | 5 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
desired result
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
If you want exactly one row -- even if there are ties -- then use row_number():
select t.*
from (select t.*, row_number() over (partition by date order by hours desc) as seqnum
from t
) t
where seqnum = 1;
Ironically, both Postgres and Oracle (the original tags) have what I would consider to be better ways of doing this, but they are quite different.
Postgres:
select distinct on (date) t.*
from t
order by date, hours desc;
Oracle:
select date, max(hours) as hours,
max(id) keep (dense_rank first over order by hours desc) as id
from t
group by date;
Here's one approach using row_number:
select id, dt, hours
from (
select id, dt, hours, row_number() over (partition by dt order by hours desc) rn
from yourtable
) t
where rn = 1
You can use subquery with correlation approach :
select t.*
from table t
where id = (select t1.id
from table t1
where t1.date = t.date
order by t1.hours desc
limit 1);
In Oracle you can use fetch first 1 row only in subquery instead of LIMIT clause.