Amazon Redshift avoid recursive CTE for filling null values with previous

Amazon Redshift avoid recursive CTE for filling null values with previous - sql

I have found a pretty good solution to a common problem in SQL, right here: https://stackoverflow.com/a/3474775
My only problem is that Amazon Redshift does not support recursive CTE, is there any way to rewrite this portion of code differently and avoid the recursion on CleanCust?
/* Test Data & Table */
DECLARE #Customers TABLE
(Dates datetime,
Customer integer,
Value integer)
INSERT INTO #Customers
VALUES ('20100101', 1, 12),
('20100101', 2, NULL),
('20100101', 3, 32),
('20100101', 4, 42),
('20100101', 5, 15),
('20100102', 1, NULL),
('20100102', 2, NULL),
('20100102', 3, 39),
('20100102', 4, NULL),
('20100102', 5, 16),
('20100103', 1, 13),
('20100103', 2, 24),
('20100103', 3, NULL),
('20100103', 4, NULL),
('20100103', 5, 21),
('20100104', 1, 14),
('20100104', 2, NULL),
('20100104', 3, NULL),
('20100104', 4, 65),
('20100104', 5, 23) ;
/* CustCTE - This gives us a RowNum to allow us to build the recursive CTE CleanCust */
WITH CustCTE
AS (SELECT Customer,
Value,
Dates,
ROW_NUMBER() OVER (PARTITION BY Customer ORDER BY Dates) RowNum
FROM #Customers),
/* CleanCust - A recursive CTE. This runs down the list of values for each customer, checking the Value column, if it is null it gets the previous non NULL value.*/
CleanCust
AS (SELECT Customer,
ISNULL(Value, 0) Value, /* Ensure we start with no NULL values for each customer */
Dates,
RowNum
FROM CustCte cur
WHERE RowNum = 1
UNION ALL
SELECT Curr.Customer,
ISNULL(Curr.Value, prev.Value) Value,
Curr.Dates,
Curr.RowNum
FROM CustCte curr
INNER JOIN CleanCust prev ON curr.Customer = prev.Customer
AND curr.RowNum = prev.RowNum + 1)
The desired output is below, in the Required column:
Date Customer Value Required Rule
20100101 1 12 12
20100101 2 0 If no value assign 0
20100101 3 32 32
20100101 4 42 42
20100101 5 15 15
20100102 1 12 Take last known value
20100102 2 0 Take last known value
20100102 3 39 39
20100102 4 42 Take last known value
20100102 5 16 16
20100103 1 13 13
20100103 2 24 24
20100103 3 39 Take last known value
20100103 4 42 Take last known value
20100103 5 21 21
20100104 1 14 14
20100104 2 24 Take last known value
20100104 3 39 Take last known value
20100104 4 65 65
20100104 5 23 23

Use a running sum to set groups based on the occurrence of null values. Then get the max value for that group.
select dates,customer,val,coalesce(max(val) over(partition by customer,grp),0) as required
from (select dates,customer,val,
sum(case when val is null then 0 else 1 end)
over(partition by customer order by dates rows unbounded preceding) as grp
from customers
) t

Related

Identify rows subsequent to other rows based on criteria?

I am fairly new to DB2 and SQL. There exists a table of customers and their visits. I need to write a query to find visits by the same customer subsequent and within 24hr to a visit when Sale = 'Y'.
Based on this example data:
CustomerId
VisitID
Sale
DateTime
1
1
Y
2021-04-23 20:16:00.000000
2
2
N
2021-04-24 20:16:00.000000
1
3
N
2021-04-23 21:16:00.000000
2
4
Y
2021-04-25 20:16:00.000000
3
5
Y
2021-04-23 20:16:00.000000
2
6
N
2021-04-25 24:16:00.000000
3
7
N
2021-5-23 20:16:00.000000
The query results should return:
VisitID
3
6
How do I do this?

Try this. You may uncomment the commented out block to run this statement as is.
/*
WITH MYTAB (CustomerId, VisitID, Sale, DateTime) AS
(
VALUES
(1, 1, 'Y', '2021-04-23 20:16:00'::TIMESTAMP)
, (1, 3, 'N', '2021-04-23 21:16:00'::TIMESTAMP)
, (2, 2, 'N', '2021-04-24 20:16:00'::TIMESTAMP)
, (2, 4, 'Y', '2021-04-25 20:16:00'::TIMESTAMP)
, (2, 6, 'N', '2021-04-25 23:16:00'::TIMESTAMP)
, (3, 5, 'Y', '2021-04-23 20:16:00'::TIMESTAMP)
, (3, 7, 'N', '2021-05-23 20:16:00'::TIMESTAMP)
)
*/
SELECT VisitID
FROM MYTAB A
WHERE EXISTS
(
SELECT 1
FROM MYTAB B
WHERE B.CustomerID = A.CustomerID
AND B.Sale = 'Y'
AND B.VisitID <> A.VisitID
AND A.DateTime BETWEEN B.DateTime AND B.DateTime + 24 HOUR
)

Select hours as columns from Oracle table

I am working with an Oracle database table that is structured like this:
TRANS_DATE TRANS_HOUR_ENDING TRANS_HOUR_SUFFIX READING
1/1/2021 1 1 100
1/1/2021 2 1 105
... ... ... ...
1/1/2021 24 1 115
The TRANS_HOUR_SUFFIX is only used to track hourly readings on days where day light savings time ends (when there could be 2 hours with the same TRANS_HOUR value). This column is the bane of this database's design, however I'm trying to do something to select this data in a certain way. We need a report that columnizes this data based on the hour. Therefore, it would be structured like this (last day shows a day on which DST would end):
TRANS_DATE HOUR_1 HOUR_2_1 HOUR_2_2 ... HOUR_24
1/1/2021 100 105 0 ... 115
1/2/2021 112 108 0 ... 135
... ... ... ... ... ...
11/7/2021 117 108 107 ... 121
I have done something like this before with a PIVOT, however in this case I'm having trouble determining what I should do to account for the suffix. When DST ending happens, we have to account for this hour. I know that we can do this by selecting each hourly value individually with decode or case statements, but that is some messy code. Is there a cleaner way to do this?

You can include multiple source columns in the pivot for() and in() clauses, so you could do:
select *
from (
select trans_date,
trans_hour_ending,
trans_hour_suffix,
reading
from your_table
)
pivot (max(reading) for (trans_hour_ending, trans_hour_suffix)
in ((1, 1) as hour_1, (2, 1) as hour_2_1, (2, 2) as hour_2_2, (3, 1) as hour_3,
-- snip
(23, 1) as hour_23, (24, 1) as hour_24))
order by trans_date;
where every hour has a (24, 1) tuple, and the DST-relevant hour has an extra (2, 2) tuple.
If you don't have rows for every hour - which you don't appear to have form the very brief sample data, at least for suffix 2 for non-DST days - then you will get null results for those, but can replace them with zeros:
select trans_date,
coalesce(hour_1, 0) as hour_1,
coalesce(hour_2_1, 0) as hour_2_1,
coalesce(hour_2_2, 0) as hour_2_2,
coalesce(hour_3, 0) as hour_3,
-- snip
coalesce(hour_23, 0) as hour_23,
coalesce(hour_24, 0) as hour_24
from (
select trans_date,
trans_hour_ending,
trans_hour_suffix,
reading
from your_table
)
pivot (max(reading) for (trans_hour_ending, trans_hour_suffix)
in ((1, 1) as hour_1, (2, 1) as hour_2_1, (2, 2) as hour_2_2, (3, 1) as hour_3,
-- snip
(23, 1) as hour_23, (24, 1) as hour_24))
order by trans_date;
which with slightly expanded sample data gets:
TRANS_DATE HOUR_1 HOUR_2_1 HOUR_2_2 HOUR_3 HOUR_23 HOUR_24
---------- ---------- ---------- ---------- ---------- ---------- ----------
2021-01-01 100 105 0 0 0 115
2021-01-02 112 108 0 0 0 135
2021-11-07 117 108 107 0 0 121
Which is a bit long-winded when you have to include all 25 columns everywhere; but to avoid that you'd have to do a dynamic pivot.

Like I said in my comment, if you can format it with an additional row, I would recommend just having a row for the extra hour. Every other day would look normal. The query to do it would look like this:
CREATE TABLE READINGS
(
TRANS_DATE DATE,
TRANS_HOUR INTEGER,
TRANS_SUFFIX INTEGER,
READING INTEGER
);
INSERT INTO readings
SELECT TO_DATE('01/01/2021', 'MM/DD/YYYY'), 1, 1, 100 FROM DUAL UNION ALL
SELECT TO_DATE('01/01/2021', 'MM/DD/YYYY'), 2, 1, 100 FROM DUAL UNION ALL
SELECT TO_DATE('11/07/2021', 'MM/DD/YYYY'), 1, 1, 200 FROM DUAL UNION ALL
SELECT TO_DATE('11/07/2021', 'MM/DD/YYYY'), 1, 2, 300 FROM DUAL UNION ALL
SELECT TO_DATE('11/07/2021', 'MM/DD/YYYY'), 2, 1, 500 FROM DUAL UNION ALL
SELECT TO_DATE('11/07/2021', 'MM/DD/YYYY'), 2, 2, 350 FROM DUAL;
SELECT TRANS_DATE||DECODE(MAX(TRANS_SUFFIX) OVER (PARTITION BY TRANS_DATE), 1, NULL, 2, ' - '||TRANS_SUFFIX) AS TRANS_DATE,
HOUR_1, HOUR_2, /*...*/ HOUR_24
FROM readings
PIVOT (MAX(READING) FOR TRANS_HOUR IN (1 AS HOUR_1, 2 AS HOUR_2, /*...*/ 24 AS HOUR_24));
This would result in the following results (Sorry, I can't get dbfiddle to work):
TRANS_DATE
HOUR_1
HOUR_2
HOUR_24
01-JAN-21
100
100
-
07-NOV-21 - 1
200
500
-
07-NOV-21 - 2
300
350
-

SQL query to compare time difference

I've used the code below to query and got the output shown. Now, I would like to query as describe below. How should I do it?
Find code 2, check if code 1 comes after code 2 within the same ItemID. If yes, compare the time difference. If time difference is less than 10 seconds, display the two compared rows.
SELECT [Date]
,[Code]
,[ItemId]
,[ItemName]
FROM [dbo].[Log] as t
join Item as d
on t.ItemId = d.Id
where ([Code] = 2 or [Code] = 1) and ([ItemId] > 97 and [ItemId] < 100)
order by [ItemId], [Date]
Output from the above query
Date Code ItemName ItemID
2017-01-06 11:00:49.000 2 B 98
2017-01-06 11:00:49.000 1 A 98
2017-01-06 11:00:55.000 2 B 98
2017-01-06 12:01:56.000 1 A 98
2017-01-06 12:02:37.000 2 B 98
2017-01-06 12:03:49.000 1 A 98
2017-01-06 12:05:44.000 2 B 98
2017-01-06 20:24:32.000 1 A 98
2017-01-06 20:24:55.000 2 B 98
2017-03-14 16:37:42.000 2 B 99
2017-03-14 17:40:24.000 1 A 99
2017-03-14 17:40:25.000 2 B 99
2017-03-14 21:28:46.000 1 A 99
2017-03-15 08:03:07.000 2 B 99
2017-03-15 10:43:00.000 1 A 99
2017-03-15 12:01:17.000 2 B 99
2017-03-15 14:18:19.000 2 B 99
Expected Result
Date Code ItemName ItemID
2017-01-06 11:00:49.000 2 B 98
2017-01-06 11:00:49.000 1 A 98

create table results ([Date] datetime, Code int, ItemName char(1), ItemID int);
insert into results values
('2017-01-06 11:00:49', 2, 'B', 98),
('2017-01-06 11:00:49', 1, 'A', 98),
('2017-01-06 11:00:55', 2, 'B', 98),
('2017-01-06 12:01:56', 1, 'A', 98),
('2017-01-06 12:01:58', 1, 'A', 98),
('2017-01-06 12:02:37', 2, 'B', 98),
('2017-01-06 12:03:49', 1, 'A', 98),
('2017-01-06 12:05:44', 2, 'B', 98),
('2017-01-06 20:24:32', 1, 'A', 98),
('2017-01-06 20:24:55', 2, 'B', 98),
('2017-03-07 00:02:27', 1, 'A', 91),
('2017-03-07 00:02:27', 1, 'A', 58),
('2017-03-14 16:37:42', 2, 'B', 99),
('2017-03-14 17:40:24', 1, 'A', 99),
('2017-03-14 17:40:38', 2, 'B', 99),
('2017-03-14 21:28:46', 1, 'A', 99),
('2017-03-15 08:03:07', 2, 'B', 99),
('2017-03-15 10:43:00', 1, 'A', 99),
('2017-03-15 12:01:17', 2, 'B', 99),
('2017-03-15 14:18:19', 1, 'A', 99);
--= set a reset point when ItemId changes, or there is no correlative (2,1) couples
--= keep in mind this solution assumes that first Code must be 2
--
WITH SetReset AS
(
SELECT [Date], Code, ItemName, ItemId,
CASE WHEN LAG([ItemId]) OVER (PARTITION BY ItemId ORDER BY [Date]) IS NULL
OR ([Code] = 2)
OR ([Code] = COALESCE(LAG([Code]) OVER (PARTITION BY ItemId ORDER BY [Date]), [Code]))
THEN 1 END is_reset
FROM results
)
--
--= set groups according to reset points
--
, SetGroup AS
(
SELECT [Date], Code, ItemName, ItemId,
COUNT(is_reset) OVER (ORDER BY [ItemId], [Date]) grp
FROM SetReset
)
--
--= calcs diff date for each group
, CalcSeconds AS
(
SELECT [Date], Code, ItemName, ItemId,
DATEDIFF(SECOND, MIN([Date]) OVER (PARTITION BY grp), MAX([Date]) OVER (PARTITION BY grp)) dif_sec,
COUNT(*) OVER (PARTITION BY grp) num_items
FROM SetGroup
)
--
--= selects those rows with 2 items by group and date diff less than 10 sec
SELECT [Date], Code, ItemName, ItemId
FROM CalcSeconds
WHERE dif_sec < 10
AND num_items = 2
;
GO
Date | Code | ItemName | ItemId
:------------------ | ---: | :------- | -----:
06/01/2017 11:00:49 | 2 | B | 98
06/01/2017 11:00:49 | 1 | A | 98
Warning: Null value is eliminated by an aggregate or other SET operation.
dbfiddle here

count trip in sql server

my table structure
id zoneid status
1 35 IN starting zone
2 35 OUT 1st trip has been started
3 36 IN
4 36 IN
5 36 OUT
6 38 IN last station zone 1 trip completed
7 38 OUT returning back 2nd trip has start
8 38 OUT
9 36 IN
10 36 OUT
11 35 IN when return back in start zone means 2nd trip complete
12 35 IN
13 35 IN
14 35 OUT 3rd trip has been started
15 36 IN
16 36 IN
17 36 OUT
18 38 IN 3rd trip has been completed
19 38 OUT 4th trip has been started
20 38 OUT
21 36 IN
22 36 OUT
23 35 IN 4th trip completed
24 35 IN
now i want a sql query, so i can count no of trips. i do not want to use status field for count
edit
i want result total trips
where 35 is the starting point and 38 is the ending point(this is 1 trip), when again 35 occures after 38 means 2 trip and so on.

So you don't want to look at the status, but only look at the zoneid changes ordered by id. zoneid 36 is irrelevant, so we select 35 and 38 only, order them by id and count changes. We detect changes by comparing a record with the previous one. We can look into a previous record with LAG.
select sum(ischange) as trips_completed
from
(
select
case when zoneid <> lag(zoneid) over (order by id) then 1 else 0 end as ischange
from trips
where zoneid in (35,38)
) changes_detected;

I am suggesting this without any testing. Does the following query produce the correct number of rows? Note if there is a date_created (datetime) column then I would suggest using that column to order by instead of id.
select
ca.in_id, t.id as out_id, ca.in_status, t.status as out_status
from table1 t
cross apply (
select top (1) id as in_id, status as in_status
from table1
where table1.id < t.id
and zoneid = 35
order by id DESC
) ca
where t.zoneid = 38
/* and conditions for selecting one day only */
If that logic is correct then just use COUNT(*) instead of the column list.
CREATE TABLE Table1
("id" int, "zoneid" int, "status" varchar(3), "other" varchar(54))
;
INSERT INTO Table1
("id", "zoneid", "status", "other")
VALUES
(1, 35, 'IN', 'starting zone'),
(2, 35, 'OUT', '1st trip has been started'),
(3, 36, 'IN', NULL),
(4, 36, 'IN', NULL),
(5, 36, 'OUT', NULL),
(6, 38, 'IN', 'last station zone 1 trip completed'),
(7, 38, 'OUT', 'returning back 2nd trip has start'),
(8, 38, 'OUT', NULL),
(9, 36, 'IN', NULL),
(10, 36, 'OUT', NULL),
(11, 35, 'IN', 'when return back in start zone means 2nd trip complete'),
(12, 35, 'IN', NULL),
(13, 35, 'IN', NULL),
(14, 35, 'OUT', '3rd trip has been started'),
(15, 36, 'IN', NULL),
(16, 36, 'IN', NULL),
(17, 36, 'OUT', 'other'),
(18, 38, 'IN', '3rd trip has been completed'),
(19, 38, 'OUT', '4th trip has been started'),
(20, 38, 'OUT', NULL),
(21, 36, 'IN', NULL),
(22, 36, 'OUT', NULL),
(23, 35, 'IN', '4th trip completed'),
(24, 35, 'IN', NULL)
;

For learning purposes, here is a self explanatory & detailed version http://sqlfiddle.com/#!15/d8bf4/1/0
The solution is based on calculating the 'running' count of trips from '35 to 38' and '38 to 35'. Solution is very specific to the OP query but can be optimized with a much shorter version...
with trip_38_to_35 as (
select * from zonecount
where (zoneid=38 and status='OUT') OR (zoneid=35 and status='IN')
order by id asc
)
, count_start_on_38 as (
select count(*) as start_on_38
from trip_38_to_35
where (zoneid=38 and status='OUT') AND
id <
( select max(id)
from trip_38_to_35
where (zoneid=35 and status='IN')
) /*do not count unfinished trips*/
)
, count_end_on_35 as (
select count(*) as end_on_35
from trip_38_to_35
where (zoneid=35 and status='IN')
) /*the other way of trip*/
, trip_35_to_38 as (
select * from zonecount
where (zoneid=35 and status='OUT') OR (zoneid=38 and status='IN')
order by id asc
)
,count_start_on_35 as (
select count(*) as start_on_35
from trip_35_to_38
where (zoneid=35 and status='OUT') AND
id <
( select max(id)
from trip_35_to_38
where (zoneid=38 and status='IN')
) /*do not count unfinished trips*/
)
,count_end_on_38 as (
select count(*) as end_on_38
from trip_35_to_38
where (zoneid=38 and status='IN')
)
/*sum the MIN of the two trips count*/
select
(case when end_on_35 > start_on_38 then start_on_38 else end_on_35 end) +
(case when end_on_38 > start_on_35 then start_on_35 else end_on_38 end)
from
count_start_on_38,
count_end_on_35,
count_start_on_35,
count_end_on_38
btw, 6 trips are calculated as per definition

SQL Server - Selecting periods without changes in data

What I am trying to do is to select periods of time where the rest of data in the table was stable based on one column and check was there a change in second column value in this period.
Table:
create table #stable_periods
(
[Date] date,
[Car_Reg] nvarchar(10),
[Internal_Damages] int,
[External_Damages] int
)
insert into #stable_periods
values ('2015-08-19', 'ABC123', 10, 10),
('2015-08-18', 'ABC123', 9, 10),
('2015-08-17', 'ABC123', 8, 9),
('2015-08-16', 'ABC123', 9, 9),
('2015-08-15', 'ABC123', 10, 10),
('2015-08-14', 'ABC123', 10, 10),
('2015-08-19', 'ABC456', 5, 3),
('2015-08-18', 'ABC456', 5, 4),
('2015-08-17', 'ABC456', 8, 4),
('2015-08-16', 'ABC456', 9, 4),
('2015-08-15', 'ABC456', 10, 10),
('2015-01-01', 'ABC123', 1, 1),
('2015-01-01', 'ABC456', NULL, NULL);
--select * from #stable_periods
-- Unfortunately I can’t post pictures yet but you get the point of how the table looks like
What I would like to receive is
Car_Reg FromDate ToDate External_Damages Have internal damages changed in this period?
ABC123 2015-08-18 2015-08-19 10 Yes
ABC123 2015-08-16 2015-08-17 9 Yes
ABC123 2015-08-14 2015-08-15 10 No
ABC123 2015-01-01 2015-01-01 1 No
ABC456 2015-08-19 2015-08-19 3 No
ABC456 2015-08-16 2015-08-18 4 Yes
ABC456 2015-08-15 2015-08-15 10 No
ABC456 2015-01-01 2015-01-01 NULL NULL
Basically to build period frames where [External_Damages] were constant and check did the [Internal_Damages] change in the same period (doesn't matter how many times).
I spend a lot of time trying but I am afraid that my level of abstraction thinking in much to low...
Will be great to see any suggestions.
Thanks,
Bartosz

I believe this is a form of Islands Problem.
Here is a solution using ROW_NUMBER and GROUP BY:
SQL Fiddle
WITH CTE AS(
SELECT *,
RN = DATEADD(DAY, - ROW_NUMBER() OVER(PARTITION BY Car_reg, External_Damages ORDER BY [Date]), [Date])
FROM #stable_periods
)
SELECT
Car_Reg,
FromDate = MIN([Date]),
ToDate = MAX([Date]) ,
External_Damages,
Change =
CASE
WHEN MAX(External_Damages) IS NULL THEN NULL
WHEN COUNT(DISTINCT Internal_Damages) > 1 THEN 'Yes'
ELSE 'No'
END
FROM CTE c
GROUP BY Car_Reg, External_Damages, RN
ORDER BY Car_Reg, ToDate DESC

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Amazon Redshift avoid recursive CTE for filling null values with previous - sql

Related

Identify rows subsequent to other rows based on criteria?

Select hours as columns from Oracle table

SQL query to compare time difference

count trip in sql server

SQL Server - Selecting periods without changes in data

Categories

Resources