Sum column values over a window based on variable date range (impala) - sql

Given a table as follows :
client_id date connections
---------------------------------------
121438297 2018-01-03 0
121438297 2018-01-08 1
121438297 2018-01-10 3
121438297 2018-01-12 1
121438297 2018-01-19 7
363863811 2018-01-18 0
363863811 2018-01-30 5
363863811 2018-02-01 4
363863811 2018-02-10 0
I am looking for an efficient way to sum the number of connections that occur within x number of days following the current row (the current row being included in the sum), partitioned by client_id.
If x=6 then the output table would result in :
client_id date connections connections_within_6_days
---------------------------------------------------------------------
121438297 2018-01-03 0 1
121438297 2018-01-08 1 5
121438297 2018-01-10 3 4
121438297 2018-01-12 1 1
121438297 2018-01-19 7 7
363863811 2018-01-18 0 0
363863811 2018-01-30 5 9
363863811 2018-02-01 4 4
363863811 2018-02-10 0 0
Concerns :
I do not want to add all missing dates and then perform a sliding window counting the x following rows because my table is already extremely large.
I am using Impala and the range between interval 'x' days following and current row is not supported.

The generic solution is a bit troublesome for multiple periods, but you can use multiple CTEs to support that. The idea is to "unpivot" the counts based on when they go in and out and then use a cumulative sum.
So:
with conn as (
select client_id, date, connections
from t
union all
select client_id, date + interval 7 day, -connections
from t
),
conn1 as (
select client_id, date,
sum(sum(connections)) over (partition by client_id order by date) as connections_within_6_days
from t
group by client_id, date
)
select t.*, conn1. connections_within_6_days
from t join
conn1
on conn1.client_id = t.client_id and
conn1.date = t.date;

Related

SQL lag to row which meets condition

I have a table which contains measures taken on random dates, partitioned by the site at which they were taken.
site
date
measurement
AB1234
2022-12-09
1
AB1234
2022-06-11
2
AB1234
2019-05-22
3
AB1234
2017-01-30
4
CD5678
2022-11-01
5
CD5678
2020-04-10
6
CD5678
2017-04-10
7
CD5678
2017-01-22
8
In order to calculate a year on year growth, I want to have an additional field for each record which contains the previous measurement at that site. The challenging part is that I only want the previous which occurred more than a year in the past.
Like so:
site
date
measurement
previous_measurement
AB1234
2022-12-09
1
3
AB1234
2022-06-11
2
3
AB1234
2019-05-22
3
4
AB1234
2017-01-30
4
NULL
CD5678
2022-11-01
5
6
CD5678
2020-04-10
6
7
CD5678
2017-04-10
7
NULL
CD5678
2017-01-22
8
NULL
It feels like it should be possible with a window function, but I can't work it out.
Please help :(
Amazon Athena engine version 3 incorporated from Trino. If it has incorporated full support for frame type RANGE for window functions you can use that:
-- sample data
with dataset(site, date, measurement) as (
values ('AB1234', date '2022-12-09', 1),
('AB1234', date '2022-06-11', 2),
('AB1234', date '2019-05-22', 3),
('AB1234', date '2017-01-30', 4),
('CD5678', date '2022-11-01', 5),
('CD5678', date '2020-04-10', 6),
('CD5678', date '2017-04-10', 7),
('CD5678', date '2017-01-22', 8)
)
-- query
select *,
last_value(measurement) over (
partition by site
order by date
RANGE BETWEEN UNBOUNDED PRECEDING AND interval '1' year PRECEDING)
from dataset;
Output:
site
date
measurement
_col3
CD5678
2017-01-22
8
NULL
CD5678
2017-04-10
7
NULL
CD5678
2020-04-10
6
7
CD5678
2022-11-01
5
6
AB1234
2017-01-30
4
NULL
AB1234
2019-05-22
3
4
AB1234
2022-06-11
2
3
AB1234
2022-12-09
1
3

SQL Select up to a certain sum

I have been trying to figure out a way to write a SQL script to select a given sum, and would appreciate any ideas given to do so.
I am trying to do a stock valuation based on the dates of goods received. At month-end closing, the value of my stocks remaining in the warehouse would be a specified sum of the last received goods.
The below query is done by a couple of unions but reduces to:
SELECT DATE, W1 FROM Table
ORDER BY DATE DESC
Query result:
Row DATE W1
1 2019-02-28 00:00:00 13250
2 2019-02-28 00:00:00 42610
3 2019-02-28 00:00:00 41170
4 2019-02-28 00:00:00 13180
5 2019-02-28 00:00:00 20860
6 2019-02-28 00:00:00 19870
7 2019-02-28 00:00:00 37780
8 2019-02-28 00:00:00 47210
9 2019-02-28 00:00:00 32000
10 2019-02-28 00:00:00 41930
I have thought about solving this issue by calculating a cumulative sum as follows:
Row DATE W1 Cumulative Sum
1 2019-02-28 00:00:00 13250 13250
2 2019-02-28 00:00:00 42610 55860
3 2019-02-28 00:00:00 41170 97030
4 2019-02-28 00:00:00 13180 110210
5 2019-02-28 00:00:00 20860 131070
6 2019-02-28 00:00:00 19870 150940
7 2019-02-28 00:00:00 37780 188720
8 2019-02-28 00:00:00 47210 235930
9 2019-02-28 00:00:00 32000 267930
10 2019-02-28 00:00:00 41930 309860
However, I am stuck when figuring out a way to use a parameter to return only the rows of interest.
For example, if a parameter was specified as '120000', it would return the rows where the cumulative sum is exactly 120000.
Row DATE W1 Cumulative Sum W1_Select
1 2019-02-28 00:00:00 13250 13250 13250
2 2019-02-28 00:00:00 42610 55860 42610
3 2019-02-28 00:00:00 41170 97030 41170
4 2019-02-28 00:00:00 13180 110210 13180
5 2019-02-28 00:00:00 20860 131070 9790
----------
Total 120000
This just requires some arithmetic:
select t.*,
(case when running_sum < #threshold then w1
else #threshold - w1
end)
from (select date, w1, sum(w1) over (order by date) as running_sum
from t
) t
where running_sum - w1 < #threshold;
Actually, in your case, the dates are all the same. That is a bit counter-intuitive, but you need to use the row for this to work:
select t.*,
(case when running_sum < #threshold then w1
else #threshold - w1
end)
from (select date, w1, sum(w1) over (order by row) as running_sum
from t
) t
where running_sum - w1 < #threshold;
Here is a db<>fiddle.

SQL Collapse Data

I am trying to collapse data that is in a sequence sorted by date. While grouping on the person and the type.
The data is stored in an SQL server and looks like the following -
seq person date type
--- ------ ------------------- ----
1 1 2018-02-10 08:00:00 1
2 1 2018-02-11 08:00:00 1
3 1 2018-02-12 08:00:00 1
4 1 2018-02-14 16:00:00 1
5 1 2018-02-15 16:00:00 1
6 1 2018-02-16 16:00:00 1
7 1 2018-02-20 08:00:00 2
8 1 2018-02-21 08:00:00 2
9 1 2018-02-22 08:00:00 2
10 1 2018-02-23 08:00:00 1
11 1 2018-02-24 08:00:00 1
12 1 2018-02-25 08:00:00 2
13 2 2018-02-10 08:00:00 1
14 2 2018-02-11 08:00:00 1
15 2 2018-02-12 08:00:00 1
16 2 2018-02-14 16:00:00 3
17 2 2018-02-15 16:00:00 3
18 2 2018-02-16 16:00:00 3
This data set contains about 1.2 million records that resemble the above.
The result that I would like to get from this would be -
person start type
------ ------------------- ----
1 2018-02-10 08:00:00 1
1 2018-02-20 08:00:00 2
1 2018-02-23 08:00:00 1
1 2018-02-25 08:00:00 2
2 2018-02-10 08:00:00 1
2 2018-02-14 16:00:00 3
I have the data in the first format by running the following query -
select
ROW_NUMBER() OVER (ORDER BY date) AS seq
person,
date,
type,
from table
group by person, date, type
I am just not sure how to keep the minimum date with the other distinct values from person and type.
This is a gaps-and-islands problem so, you can use differences of row_number() & use them in grouping :
select person, min(date) as start, type
from (select *,
row_number() over (partition by person order by seq) seq1,
row_number() over (partition by person, type order by seq) seq2
from table
) t
group by person, type, (seq1 - seq2)
order by person, start;
The correct solution using the difference of row numbers is:
select person, type, min(date) as start
from (select t.*,
row_number() over (partition by person order by seq) as seqnum_p,
row_number() over (partition by person, type order by seq) as seqnum_pt
from t
) t
group by person, type, (seqnum_p - seqnum_pt)
order by person, start;
type needs to be included in the GROUP BY.

PostgreSQL - rank over rows listed in blocks of 0 and 1

I have a table that looks like:
id code date1 date2 block
--------------------------------------------------
20 1234 2017-07-01 2017-07-31 1
15 1234 2017-06-01 2017-06-30 1
13 1234 2017-05-01 2017-05-31 0
11 1234 2017-03-01 2017-03-31 0
9 1234 2017-02-01 2017-02-28 1
8 1234 2017-01-01 2017-01-31 0
7 1234 2016-11-01 2016-11-31 0
6 1234 2016-10-01 2016-10-31 1
2 1234 2016-09-01 2016-09-31 1
I need to rank the rows according to the blocks of 0's and 1's, like:
id code date1 date2 block desired_rank
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 1
13 1234 2017-05-01 2017-05-31 0 2
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 4
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 5
2 1234 2016-09-01 2016-09-31 1 5
I've tried to use rank() and dense_rank(), but the result I end up with is:
id code date1 date2 block dense_rank()
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 2
13 1234 2017-05-01 2017-05-31 0 1
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 3
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 4
2 1234 2016-09-01 2016-09-31 1 5
In the last table, the rank doesn't care about the rows, it just takes all the 1's and 0's as a unit and sets an ascending count starting at the first 1 and 0.
My query goes like this:
CREATE TEMP TABLE data (id integer,code text, date1 date, date2 date, block integer);
INSERT INTO data VALUES
(20,'1234', '2017-07-01','2017-07-31',1),
(15,'1234', '2017-06-01','2017-06-30',1),
(13,'1234', '2017-05-01','2017-05-31',0),
(11,'1234', '2017-03-01','2017-03-31',0),
(9, '1234', '2017-02-01','2017-02-28',1),
(8, '1234', '2017-01-01','2017-01-31',0),
(7, '1234', '2016-11-01','2016-11-30',0),
(6, '1234', '2016-10-01','2016-10-31',1),
(2, '1234', '2016-09-01','2016-09-30',1);
SELECT *,dense_rank() OVER (PARTITION BY code,block ORDER BY date2 DESC)
FROM data
ORDER BY date2 DESC;
By the way, the database is in postgreSQL.
I hope there's a workaround... Thanks :)
Edit: Note that the blocks of 0's and 1's aren't equal.
There's no way to get this result using a single Window Function:
SELECT *,
Sum(flag) -- now sum the 0/1 to create the "rank"
Over (PARTITION BY code
ORDER BY date2 DESC)
FROM
(
SELECT *,
CASE
WHEN Lag(block) -- check if this is the 1st row of a new block
Over (PARTITION BY code
ORDER BY date2 DESC) = block
THEN 0
ELSE 1
END AS flag
FROM DATA
) AS dt

can I join to all rows in a reference table as a kind of template?

I am working on an app that has some scheduling functionality. As part of this, I need to select a list of people and whether or not they have have something scheduled in a certain period of the week.
The number of periods in the week are variable so they are stored in a reference table, for example:
Period Reference Table
id name start end day
------------------------------------------------------------------
1 Morning 1900-01-01 4:00:00 1900-01-01 11:00:00 MON
2 Afternoon 1900-01-01 14:00:00 1900-01-01 20:00:00 MON
3 Night 1900-01-01 20:00:00 1900-01-01 24:00:00 MON
4 Morning 1900-01-01 4:00:00 1900-01-01 11:00:00 TUE
5 Afternoon 1900-01-01 14:00:00 1900-01-01 20:00:00 TUE
6 Night 1900-01-01 20:00:00 1900-01-01 24:00:00 TUE
I also have a "person reference" table and a "person schedule" table.
The "person schedule" table only stores a record if a person has something scheduled in a period, here is a simplified example:
Person Schedule Table
id person_id period_id
------------------------------
1 1 2
2 1 3
3 2 2
4 2 3
5 2 4
Now I need to select from these three tables a full list of periods for each person and whether they have a schedule record in the period or not (1 or 0). In other words, I need to get this resultset for the example data above:
person_id period_id is_scheduled
----------------------------------------
1 1 0
1 2 1
1 3 1
1 4 0
1 5 0
1 6 0
2 1 0
2 2 1
2 3 1
2 4 1
2 5 0
2 6 0
Is it possible to do this with a select statement without getting into dynamic SQL?
This is all done using SQL Server 2000
SELECT
people.person_id, period_id = periods.id, is_scheduled = CASE
WHEN schedule.person_id IS NOT NULL THEN 1 ELSE 0 END
FROM dbo.[period reference table] AS periods
CROSS JOIN
(
SELECT person_id
FROM dbo.[person schedule table]
GROUP BY person_id
) AS people
LEFT OUTER JOIN
dbo.[person schedule table] AS schedule
ON people.person_id = schedule.person_id
AND periods.id = schedule.period_id
ORDER BY
people.person_id, p.id;