Redshift aggregate grouped data by date range - sql

I have following table that contains quantities of items per day.
ID Date Item Count
-----------------------------
1 2022-01-01 Milk 10
2 2022-01-11 Milk 20
3 2022-01-12 Milk 10
4 2022-01-15 Milk 12
5 2022-01-16 Milk 10
6 2022-01-02 Bread 20
7 2022-01-03 Bread 22
8 2022-01-05 Bread 24
9 2022-01-08 Bread 20
10 2022-01-12 Bread 10
I want to aggregate (sum, avg, ...) the quantity per item for the last 7 days (or 14, 28 days). The expected outcome would look like this table.
ID Date Item Count Sum_7d
-------------------------------------
1 2022-01-01 Milk 10 10
2 2022-01-11 Milk 20 20
3 2022-01-12 Milk 10 30
4 2022-01-15 Milk 12 42
5 2022-01-16 Milk 10 52
6 2022-01-02 Bread 20 20
7 2022-01-03 Bread 22 42
8 2022-01-05 Bread 24 66
9 2022-01-08 Bread 10 56
10 2022-01-12 Bread 10 20
My first approach was using Redshift window functions like this
SELECT *, SUM(Count) OVER (PARTITION BY Item
ORDER BY Date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
FROM my_table
but it does not give the expected results because there are missing dates and I could not figure out how to put a condition on the time range.
My fallback solution is a cross product, but that's not desirable because it is inefficient for large data.
SELECT l.Date, l.Item, l.Count, sum(r.Count) as Sum_7d
FROM my_table l,
my_table r
WHERE l.Date - r.Date < 7
AND l.Date - r.Date >= 0
AND l.Item = r.Item
GROUP BY 1, 2, 3
Is there any efficient and concise way to do such an aggregation on date ranges in Redshift?
Related:
Can I put a condition on a window function in Redshift?
Redshift SQL Window Function frame_clause with days

This is a missing data problem and a common way to "fill in the blanks" is with a cross join. You correctly point out that this can get very expensive because the cross joining (usually) massively expands the data being worked upon AND because Redshift isn't great at creating data. But you do have to fill in the missing data. The best way I have found is to create the (near) minimum data set that will complete the data then UNION this data to the original table. The code below performs this path.
There is a way to do this w/o adding rows but the SQL is large, inflexible, error prone and just plain ugly. You could create new columns (date and count) based on LAG(6), LAG(5), LAG(4) ... and compare the date of each and use the count if the date is truly in range. If you want to sum a different date look-back you need to add columns and things get uglier. Also this will only be faster that the code below for certain circumstances (very few repeats of item). It just replaces making new data in rows for making new data in columns. So don't go this way unless absolutely necessary.
Now to what I think will work for you. You need a dummy row for every date and item combination that doesn't already exist. This is the minimal set of new data that will make you window function work. In reality I make all the combinations of data and item and merge these with the existing - a slight compromise from the ideal.
First let's set up your data. I changed some names as using reserved words for column names is not ideal.
create table test (ID int, dt date, Item varchar(16), Cnt int);
insert into test values
(1, '2022-01-01', 'Milk', 10),
(2, '2022-01-11', 'Milk', 20),
(3, '2022-01-12', 'Milk', 10),
(4, '2022-01-15', 'Milk', 12),
(5, '2022-01-16', 'Milk', 10),
(6, '2022-01-02', 'Bread', 20),
(7, '2022-01-03', 'Bread', 22),
(8, '2022-01-05', 'Bread', 24),
(9, '2022-01-08', 'Bread', 20),
(10, '2022-01-12', 'Bread', 10);
The SQL for generating what you want is:
with recursive dates(dt) as
( select min(dt) as dt
from test
union all
select dt + 1
from dates d
where d.dt <= current_date
)
select *
from (
SELECT *, SUM(Cnt) OVER (PARTITION BY Item
ORDER BY Dt
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
FROM (
select min(id) as id, dt, item, sum(cnt) as cnt
from (
select *
from test
union all
select NULL as id, dt, item, NULL as cnt
from ( select distinct item from test) as items
cross join dates
) as all_item_dates
group by dt, item
) as grouped
) as windowed
where id is not null
order by id, dt;
Quickly here what this does.
A recursive CTE creates the date range in question (from min date in table until today).
These dates are cross joined with the distinct list of items resulting in every date for every unique item.
This is UNIONed to the table so all data exists.
GROUP By is used to merge real data rows with dummy rows for the same item and date.
Your window function is run.
A surrounding SELECT has a WHERE clause to remove any dummy rows.
As you will note this does use a cross-join but on a much reduced set of data (just the unique item list). As long as this distinct list of items is much shorter than the size of the table (very likely) then this will perform much faster than other techniques. Also if this is the kind of data you have you might find interest in this post I wrote - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp_2.html

Related

For each unique item in a redshift sql column, get the last rows based on a looking/scanning window

patient_id
alert_id
alert_timestamp
3
xyz
2022-10-10
1
anp
2022-10-12
1
gfe
2022-10-10
2
fgy
2022-10-02
2
gpl
2022-10-03
1
gdf
2022-10-13
2
mkd
2022-10-23
1
liu
2022-10-01
I have a sql table (see simplified version above) where for each patient_id, I want to only keep the latest alert (i.e. last one) that was sent out in a given window period e.g. window_size = 7.
Note, the window size needs to look at consecutive days i.e. between day 1 -> day 1 + window_size. The ranges of alert_timestamp for each patient_id varies and is usually well beyond the window_size range.
Note, that the table example given above, is a very simple example and will have many more patient_id's and will be in a mixed order in terms alert_timestamp and alert_id.
The approach is to start from the last alert_timstamp for a given patient_id and work back using the window_size to select the alert that was the last one in that window time frame.
Please note the idea is to have a scanning/looking window, example window_size = 7 days to move across the timestamps of each patient
The end result I want, is a table with the filtered out alerts
Expected output for (this example) window_size = 7:
patient_id
alert_id
alert_timestamp
1
liu
2022-10-01
1
gdf
2022-10-13
2
gpl
2022-10-03
2
mkd
2022-10-23
3
xyz
2022-10-10
What's the most efficient way to solve for this?
This can be done with the last_value window function but you need to prep your data a bit. Here's an example of what this could look like:
create table test (
patient_id int,
alert_id varchar(8),
alert_timestamp date);
insert into test values
(3, 'xyz', '2022-10-10'),
(1, 'anp', '2022-10-12'),
(1, 'gfe', '2022-10-10'),
(2, 'fgy', '2022-10-02'),
(2, 'gpl', '2022-10-03'),
(1, 'gdf', '2022-10-13'),
(2, 'mkd', '2022-10-23'),
(1, 'liu', '2022-10-01');
WITH RECURSIVE dates (dt) AS
(
SELECT '2022-09-30'::DATE AS dt UNION ALL SELECT dt + 1
FROM dates d
WHERE dt < '2022-10-31'::DATE
),
p_dates AS
(
SELECT pid,
dt
FROM dates d
CROSS JOIN (SELECT DISTINCT patient_id AS pid FROM test) p
),
combined AS
(
SELECT *
FROM p_dates d
LEFT JOIN test t
ON d.dt = t.alert_timestamp
AND d.pid = t.patient_id
),
latest AS
(
SELECT patient_id,
pid,
alert_id,
dt,
alert_timestamp,
LAST_VALUE(alert_id IGNORE NULLS) OVER (PARTITION BY pid ORDER BY dt ROWS BETWEEN CURRENT ROW AND 7 following) AS at
FROM combined
)
SELECT patient_id,
alert_id,
alert_timestamp
FROM latest
WHERE patient_id IS NOT NULL
AND alert_id = at
ORDER BY patient_id,
alert_timestamp;
This produces the results you are looking for with the test data but there are a few assumptions. The big one is that here is at most 1 alert per patient per day. If this isn't true then some more data massaging will be needed. Either way this should give you an outline on how to do this.
First need is to ensure that there is 1 row per patient per day so that the window function can operate on rows as these will be equivalent to days (for each patient). The date range is generated by a recursive CTE and joined to the test data to achieve the 1 row per day per patient.
The "ignore nulls" option is used in the last_value window function to ignore any of these "extra" rows create by the above process. The last step is to prune out all the unneeded rows and ensure that only the latest alert of the window is produced.

Sort events by highest occurrence by year

I am trying to find how many events occur by year. Currently I have this query, that basically counts when an event has visitors:
SELECT
count(visitors_y_2016) as y_16,
count(visitors_y_2017) as y_17,
count(visitors_y_2018) as y_18,
count(visitors_y_2019) as y_19,
count(visitors_y_2020) as y_20
FROM event
;
y16 y17 y18 y19 y20
23 25 26 27 19
But what I am looking for is an order by the year with more events:
Y19 27
Y18 26
y17 25
y16 23
y20 19
Any idea how to accomplish that?
your table design looks quite strange, as such information should be in rows and not columns.
But you can UNION all results and then sort them
CREATE TABLE event (visitors_y_2016 int,visitors_y_2017 int,visitors_y_2018 int,visitors_y_2019 i
(SELECT
'y_16' ,count(visitors_y_2016) as cnt
FROM event
UNION ALL
SELECT
'y_17',count(visitors_y_2017)
FROM event
UNION ALL
SELECT
'y_18',
count(visitors_y_2018)
FROM event
UNION ALL
SELECT
'y_19',
count(visitors_y_2019)
FROM event
UNION ALL
SELECT
'y_20',
count(visitors_y_2020)
FROM event)
ORDER BY cnt
;
?column? | cnt
:------- | --:
y_16 | 0
y_17 | 0
y_18 | 0
y_19 | 0
y_20 | 0
db<>fiddle here
You can "unpivot" with a VALUES expression in a LATERAL subquery:
SELECT t.*
FROM (
SELECT count(visitors_y_2016) AS y16
, count(visitors_y_2017) AS y17
, count(visitors_y_2018) AS y18
, count(visitors_y_2019) AS y19
, count(visitors_y_2020) AS y20
FROM event
) e, LATERAL (
VALUES
(16, e.y16)
, (17, e.y17)
, (18, e.y18)
, (19, e.y19)
, (20, e.y20)
) t(year, count)
ORDER BY count DESC; -- your desired sort order
db<>fiddle here
Since this only needs a single scan over the table, it's many times faster than aggregating ever output value separately.
Each line in the VALUES expression forms a row with two columns: year (number defaults to integer) and count (type of referenced column).
See:
Query for crosstab view
SELECT DISTINCT on multiple columns
About LATERAL subqueries:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
But your table design raises questions. Typically you'd have a single date or timestamp column visitors instead of visitors_y_2016, visitors_y_2017 etc. - and a simpler query based on that ...
I don't think you need a select on each year. I don't exactly know your table, but there should be a better wayu to organize your data. Also, SORT BY should be your friend if you wanna sort data. You just gotta have a single SELECT to use it like for example:
SORT BY
VISITOR_COUNT

retrieve several values previous to several given dates

I got a values table such as:
id | user_id | value | date
---------------------------------
1 | 12 | 38 | 2014-04-05
2 | 15 | 19 | 2014-04-05
3 | 12 | 47 | 2014-04-08
I want to retrieve all values for given dates. However, if I don't have a value for one specific date, I want to get the previous available value. For instance, with the above dataset, if I query values for user 12 for dates 2014-04-07 and 2014-04-08, I need to retrieve 38 and 47.
I succeeded using two queries like:
SELECT *
FROM values
WHERE date <= $date
ORDER BY date DESC
LIMIT 1
However, it would require dates.length requests each time. So, I'm wondering if there is any more performant solution to retrieve all my values in a single request?
In general, you would use a VALUES clause to specify multiple values in a single query.
If you have only occasional dates missing (and thus no big gaps in dates between rows for any particular user_id) then this would be an elegant solution:
SELECT dt, coalesce(value, lag(value) OVER (ORDER BY dt)) AS value
FROM (VALUES ('2014-04-07'::date), ('2014-04-08')) AS dates(dt)
LEFT JOIN "values" ON "date" = dt AND user_id = 12;
The lag() window function picks the previous value if the current row does not have a value.
If, on the other hand, there may be big gaps, you need to do some more work:
SELECT DISTINCT dt, first_value(value) OVER (ORDER BY diff) AS value
FROM (
SELECT dt, value, dt - "date" AS diff
FROM (VALUES ('2014-04-07'::date), ('2014-04-08')) AS dates(dt)
CROSS JOIN "values"
WHERE user_id = 12) sub;
In this case a CROSS JOIN is made for user_id = 12 and differences between the dates in the VALUES clause and the table rows computed, in a sub-query. So every row has a value for field value. In the main query the value with the smallest difference is selected using the first_value() window function. Note that ordering on diff and picking the first row would not work here because you want values for multiple dates returned.

show recent records only

I have a requirement to show most recent records when user selects the option to view most recent records. I have 3 different tables from which I take data and display on the screen.
Below are the sample tables created.
Create table one(sealID integer,product_ser_num varchar2(20),create_time timestamp,status varchar2(10));
create table two(transID integer,formatID integer, formatStatus varchar,ctimeStamp timestamp,sealID integer);
create table three(transID integer,fieldStatus varchar,fieldValue varchar,exctype varchar);
I'm joining above 3 tables and showing the results in a single screen. I want to display the most recent records based on the timestamp.
Please find the sample data on the screen taken from 3 different tables.
ProductSerialNumber formatID formatStatus fieldStatus TimeStamp
ASD100 100 P P 2015-09-03 10:30:22
ASD100 200 p P 2015-09-03 10:30:22
ASD100 100 p P 2015-09-03 10:22:11
ASD100 200 p P 2015-09-03 10:22:11
I want to display the most recent records from the above shown table which should return first 2 rows as they are the recent records when checked with the timestamp column.
Please suggest what changes to be done to the below query to show most recent records.
SELECT transId,product_ser_num,status, to_char(timestamp, 'yyyy-mm-dd hh24:mi:ss') timestamp,
cnt
FROM (SELECT one.*,
row_number() over(ORDER BY
CASE
WHEN :orderDirection like '%asc%' THEN
CASE
WHEN :orderBy='product_ser_num' THEN product_ser_num,
WHEN :orderBy='status' THEN status
WHEN :orderBy='timestamp' THEN to_char(timestamp, 'yyyy-mm-dd hh24:mi:ss')
ELSE to_char(timestamp, 'yyyy-mm-dd hh24:mi:ss')
END
END ASC,
CASE
WHEN :orderDirection like '%desc%' THEN
CASE
WHEN :orderBy='product_ser_num' THEN product_ser_num,
WHEN :orderBy='status' THEN status
WHEN :orderBy='timestamp' THEN to_char(timestamp, 'yyyy-mm-dd hh24:mi:ss')
ELSE to_char(timestamp, 'yyyy-mm-dd hh24:mi:ss')
END
END DESC , transId ASC) line_number
FROM (select one_inner.*, COUNT(1) OVER() cnt
from (select two_tran.transaction_id,
one_res.product_serial_number productSerialNumber,
one_res.status status,from one one_res
left outer join two two_trans on two_trans.sealID = one_res.sealID
left outer join three three_flds on two_tran.transID = three_flds.transID and (three_flds.fieldStatus = 'P')
I don't think you are looking for a Top-n query as your topic title suggests.
It seems like you want to display the data in a customized order, as you have shown in the first image. You want the set of three rows to be grouped together on the basis of timestamp.
I have prepared a small test case to demonstrate the custom order of the rows:
SQL> WITH DATA(ID, num, datetime) AS(
2 SELECT 10, 1001, SYSDATE FROM dual UNION ALL
3 SELECT 10, 6009, SYSDATE FROM dual UNION ALL
4 SELECT 10, 3951, SYSDATE FROM dual UNION ALL
5 SELECT 10, 1001, SYSDATE -1 FROM dual UNION ALL
6 SELECT 10, 6009, SYSDATE -1 FROM dual UNION ALL
7 SELECT 10, 3951, SYSDATE -1 FROM dual
8 )
9 SELECT ID,
10 num,
11 TO_CHAR(DATETIME, 'yyyy-mm-dd hh24:mi:ss') TIMESTAMP
12 FROM
13 (SELECT t.*,
14 row_number() OVER(ORDER BY DATETIME DESC,
15 CASE num
16 WHEN 1001
17 THEN 1
18 WHEN 6009
19 THEN 2
20 WHEN 3951
21 THEN 3
22 END, num) rn
23 FROM DATA t
24 );
ID NUM TIMESTAMP
---------- ---------- -------------------
10 1001 2015-09-04 11:04:48
10 6009 2015-09-04 11:04:48
10 3951 2015-09-04 11:04:48
10 1001 2015-09-03 11:04:48
10 6009 2015-09-03 11:04:48
10 3951 2015-09-03 11:04:48
6 rows selected.
Now, you can see that for the same ID 10, the NUM values are grouped and also in a custom order.
This query seems very large and complex, so this may be oversimplifying things:
Add a clause to the end limit 3 ?
What I think you need to do is:
select
max(timestamp), engine_serial_number, formatID
from
<
joins here
>
group by engine_serial_number, formatID
This will basically give you the lines you want, but not all metadata.
Hence, you will just have to re-join all this with the main join to get the rest of the info (join on all three columns, engine serial number, formatID AND timestamp).
That should work.
Hope this helps!
It's hard to give you a precise answer, because your query is incomplete. But I'll give you the general idea, and you can tweak it into your query.
One way to accomplish what you want is by using the dense_rank() analytical function to number your rows by timestamp in descending order (You could use rank() too in this case, it doesn't actually matter). All rows with the same timestamp will be assigned the same "rank", so you can then filter by rank to only get the most recent records.
Try to adjust your query to something like this:
select ...
from (select ...,
dense_rank() over (order by timestamp desc) as timestamp_rank
from ...)
where timestamp_rank = 1
...
I suspect that with a better understanding of your data model and query, there would probably be a better solution. But based on the information provided, I think that the above should yield the results you are looking for.

Inserting data in another table having different data structure in oracle

I have a table table1
ITEM_CODE DESC MONTH DAY01 DAY02 DAY03
FG0050BCYL0000CD CYL HEAD FEB-15 0 204 408
FG00186CYL0000CD POWER UNIT FEB-15 425 123 202
I want to insert data in another table table2 from table1 in such a way.
ITEM_CODE MONTH DATE QUANTITY
FG0050BCYL0000CD FEB-15 01-FEB-2015 0
FG0050BCYL0000CD FEB-15 02-FEB-2015 204
FG0050BCYL0000CD FEB-15 03-FEB-2015 408
FG00186CYL0000CD FEB-15 01-FEB-2015 425
FG00186CYL0000CD FEB-15 02-FEB-2015 123
FG00186CYL0000CD FEB-15 03-FEB-2015 202
Please tell me how to achieve this via pl sql
This SQL query worked for me.
with items as (
select table1.*,
to_date(month||'-01', 'MON-YY-DD', 'NLS_DATE_LANGUAGE=American') day
from table1)
select item_code, month, day + lvl - 1 day,
case extract(Day from day + lvl - 1)
when 1 then day01
when 2 then day02
when 3 then day03
-- <- insert rest (day04...day30) here
when 31 then day31
end value
from items
join (select level lvl from dual connect by level<32) n
on day + lvl - 1 <= last_day(day)
Subquery items attaches first day of month to data. Next I join this subquery with other, hierarchical subquery,
which gives simple list of 31 numbers (form 1 to 31). Join is constructed this way that date cannot exceed last day of month.
So for each row in table1 we have 28, 29, 30 or 31 rows with proper dates.
Now simple, but tedious task - for each day we have to get value from proper column; we need case here.
In solution these are four rows, but you will need to complete rest.
At the end just insert results into table2.
The following should get you close:
BEGIN
FOR aRow IN (SELECT * FROM TABLE1)
LOOP
INSERT INTO TABLE2(ITEM_CODE, MONTH, "DATE", QUANTITY)
VALUES (aRow.ITEM_CODE, aRow.MONTH,
TO_DATE(aRow.MONTH, 'MON-RR')+0, aRow.DAY01);
INSERT INTO TABLE2(ITEM_CODE, MONTH, "DATE", QUANTITY)
VALUES (aRow.ITEM_CODE, aRow.MONTH,
TO_DATE(aRow.MONTH, 'MON-RR')+1, aRow.DAY02);
INSERT INTO TABLE2(ITEM_CODE, MONTH, "DATE", QUANTITY)
VALUES (aRow.ITEM_CODE, aRow.MONTH,
TO_DATE(aRow.MONTH, 'MON-RR')+2, aRow.DAY03);
END LOOP;
END;
Note that the column names DESC and DATE are both reserved words in Oracle, which requires that they be quoted as shown above. It would be simpler to use different names, such as DESCRIPTION and ACTIVITY_DATE, to eliminate the need to quote these names every time they're used.
Best of luck.