Retrieve last known value for each column of a row - sql

Not sure about the correct words to ask this question, so I will break it down.
I have a table as follows:
date_time | a | b | c
Last 4 rows:
15/10/2013 11:45:00 | null | 'timtim' | 'fred'
15/10/2013 13:00:00 | 'tune' | 'reco' | null
16/10/2013 12:00:00 | 'abc' | null | null
16/10/2013 13:00:00 | null | 'died' | null
How would I get the last record but with the value ignoring the null and instead get the value from the previous record.
In my provided example the row returned would be
16/10/2013 13:00:00 | 'abc' | 'died' | 'fred'
As you can see if the value for a column is null then it goes to the last record which has a value for that column and uses that value.
This should be possible, I just cant figure it out. So far I have only come up with:
select
last_value(a) over w a
from test
WINDOW w AS (
partition by a
ORDER BY ts asc
range between current row and unbounded following
);
But this only caters for a single column ...

Here I create an aggregation function that collects columns into arrays. Then it is just a matter of removing the NULLs and selecting the last element from each array.
Sample Data
CREATE TABLE T (
date_time timestamp,
a text,
b text,
c text
);
INSERT INTO T VALUES ('2013-10-15 11:45:00', NULL, 'timtim', 'fred'),
('2013-10-15 13:00:00', 'tune', 'reco', NULL ),
('2013-10-16 12:00:00', 'abc', NULL, NULL ),
('2013-10-16 13:00:00', NULL, 'died', NULL );
Solution
CREATE AGGREGATE array_accum (anyelement)
(
sfunc = array_append,
stype = anyarray,
initcond = '{}'
);
WITH latest_nonull AS (
SELECT MAX(date_time) As MaxDateTime,
array_remove(array_accum(a), NULL) AS A,
array_remove(array_accum(b), NULL) AS B,
array_remove(array_accum(c), NULL) AS C
FROM T
ORDER BY date_time
)
SELECT MaxDateTime, A[array_upper(A, 1)], B[array_upper(B,1)], C[array_upper(C,1)]
FROM latest_nonull;
Result
maxdatetime | a | b | c
---------------------+-----+------+------
2013-10-16 13:00:00 | abc | died | fred
(1 row)

Order of rows
The "last row" and the sort order would need to be defined unambiguously. There is no natural order in a set (or a table). I am assuming ORDER BY ts, where ts is the timestamp column.
Like #Jorge pointed out in his comment: If ts is not UNIQUE, one needs to define tiebreakers for the sort order to make it unambiguous (add more items to ORDER BY). A primary key would be the ultimate solution.
General solution with window functions
To get a result for every row:
SELECT ts
, max(a) OVER (PARTITION BY grp_a) AS a
, max(b) OVER (PARTITION BY grp_b) AS b
, max(c) OVER (PARTITION BY grp_c) AS c
FROM (
SELECT *
, count(a) OVER (ORDER BY ts) AS grp_a
, count(b) OVER (ORDER BY ts) AS grp_b
, count(c) OVER (ORDER BY ts) AS grp_c
FROM t
) sub;
How?
The aggregate function count() ignores NULL values when counting. Used as aggregate-window function, it computes the running count of a column according to the default window definition, which is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. NULL values don't increase the count, so these rows fall into the same peer group as the last non-null value.
In a second window function, the only non-null value per group is easily extracted with max() or min().
Just the last row
WITH cte AS (
SELECT *
, count(a) OVER w AS grp_a
, count(b) OVER w AS grp_b
, count(c) OVER w AS grp_c
FROM t
WINDOW w AS (ORDER BY ts)
)
SELECT ts
, max(a) OVER (PARTITION BY grp_a) AS a
, max(b) OVER (PARTITION BY grp_b) AS b
, max(c) OVER (PARTITION BY grp_c) AS c
FROM cte
ORDER BY ts DESC
LIMIT 1;
Simple alternatives for just the last row
SELECT ts
,COALESCE(a, (SELECT a FROM t WHERE a IS NOT NULL ORDER BY ts DESC LIMIT 1)) AS a
,COALESCE(b, (SELECT b FROM t WHERE b IS NOT NULL ORDER BY ts DESC LIMIT 1)) AS b
,COALESCE(c, (SELECT c FROM t WHERE c IS NOT NULL ORDER BY ts DESC LIMIT 1)) AS c
FROM t
ORDER BY ts DESC
LIMIT 1;
Or:
SELECT (SELECT ts FROM t ORDER BY ts DESC LIMIT 1) AS ts
,(SELECT a FROM t WHERE a IS NOT NULL ORDER BY ts DESC LIMIT 1) AS a
,(SELECT b FROM t WHERE b IS NOT NULL ORDER BY ts DESC LIMIT 1) AS b
,(SELECT c FROM t WHERE c IS NOT NULL ORDER BY ts DESC LIMIT 1) AS c
db<>fiddle here
Old sqlfiddle
Performance
While this should be decently fast, if performance is your paramount requirement, consider a plpgsql function. Start with the last row and loop descending until you have a non-null value for every column required. Along these lines:
GROUP BY and aggregate sequential numeric values

This should work but keep in mind it is an uggly solution
select * from
(select dt from
(select rank() over (order by ctid desc) idx, dt
from sometable ) cx
where idx = 1) dtz,
(
select a from
(select rank() over (order by ctid desc) idx, a
from sometable where a is not null ) ax
where idx = 1) az,
(
select b from
(select rank() over (order by ctid desc) idx, b
from sometable where b is not null ) bx
where idx = 1) bz,
(
select c from
(select rank() over (order by ctid desc) idx, c
from sometable where c is not null ) cx
where idx = 1) cz
See it here at fiddle: http://sqlfiddle.com/#!15/d5940/40
The result will be
DT A B C
October, 16 2013 00:00:00+0000 abc died fred

Related

How to get the last not null value for a column while sorting based upon a timestamp column

Table T has columns A, B, C, TS(timestamp) with values as defined below
A B C TS
d g null 3
h y gh 2
q r null 7
If I write a query like below:
SELECT * from T order by TS desc Limit 1;
It gives me result as:
A B C TS
q r null 7
What I want is to never get a null value. Instead it should display the last not null value from that column, if any.
Desired result:
A B C TS
q r gh 7
Try the OLAP function LAST_VALUE( .... IGNORE NULLS) OVER ....
I saw that you get data that makes sense if you order by ts ascending in the window definition clause - I hope that is what you need ...
WITH
-- your input ...
indata(A,B,C,TS) AS (
SELECT 'd','g',null,3
UNION ALL SELECT 'h','y','gh',2
UNION ALL SELECT 'q','r',null,7
)
-- real query starts here ...
SELECT
LAST_VALUE(a IGNORE NULLS) OVER w AS a
, LAST_VALUE(b IGNORE NULLS) OVER w AS b
, LAST_VALUE(c IGNORE NULLS) OVER w AS c
, ts
FROM indata
WINDOW w AS (ORDER BY ts)
ORDER BY ts DESC
LIMIT 1;
-- out a | b | c | ts
-- out ---+---+----+----
-- out q | r | gh | 7
One method is to fetch each column independently:
select (select a from T where a is not null order by TS desc Limit 1),
(select b from T where b is not null order by TS desc Limit 1),
(select c from T where c is not null order by TS desc Limit 1),
(select ts from T where ts is not null order by TS desc Limit 1)

How to number consecutive records per island?

I have a table which looks like:
group date color
A 1-1-2019 R
A 1-2-2019 Y
B 1-1-2019 R
B 1-2-2019 Y
B 1-3-2019 Y
B 1-4-2019 R
B 1-5-2019 R
B 1-6-2019 R
And it's ordered by group and date. I want an extra column showing sequential number of consecutive color 'R' for each group.
Required output:
group date color rank
A 1-1-2019 R 1
A 1-2-2019 Y null
B 1-1-2019 R 1
B 1-2-2019 Y null
B 1-3-2019 Y null
B 1-4-2019 R 1
B 1-5-2019 R 2
B 1-6-2019 R 3
I've tried to use window function with partition by group and color columns but it returns output below which is not correct.
Wrong Query and Output:
SELECT
*,
RANK() OVER (PARTITION BY group, color order by group, date) as rank
FROM table
group date color rank
A 1-1-2019 R 1
A 1-2-2019 Y null
B 1-1-2019 R 1
B 1-2-2019 Y null
B 1-3-2019 Y null
B 1-4-2019 R 2
B 1-5-2019 R 3
B 1-6-2019 R 4
I'm wondering if it's doable in SQL, or should I switch to another language (like Python)?
This is how it can be done using window functions. First we create a CTE which has a flag which indicates that a new sequence has started, then from that we generate one which counts sequence numbers. Finally we count rows within each sequence to get the rank:
WITH cte AS (SELECT `group`, date, color,
COALESCE(color = LAG(color) OVER(ORDER BY `group`, date), 0) AS samecolor
FROM `table`),
sequences AS (SELECT `group`, date, color,
SUM(samecolor = 0) OVER (ORDER BY `group`, date) AS seq_num
FROM cte)
SELECT `group`, date, color,
ROW_NUMBER() OVER (PARTITION BY seq_num) AS `rank`
FROM sequences
ORDER BY `group`, date
Output:
group date color rank
A 1-1-2019 R 1
A 1-2-2019 Y 1
B 1-1-2019 R 1
B 1-2-2019 Y 1
B 1-3-2019 Y 2
B 1-4-2019 R 1
B 1-5-2019 R 2
B 1-6-2019 R 3
Demo on dbfiddle
Note that this query also gives ranking for Y values, if you want those to be NULL replace the definition of rank with this:
CASE WHEN color = 'Y' THEN NULL
ELSE ROW_NUMBER() OVER (PARTITION BY seq_num)
END AS `rank`
Using user variables could keep the rank and previous values to produce the results:
CREATE TABLE tbl (
`group` VARCHAR(1),
`date` VARCHAR(8),
`color` VARCHAR(1)
);
INSERT INTO tbl
(`group`, `date`, `color`)
VALUES
('A', '1-1-2019', 'R'),
('A', '1-2-2019', 'Y'),
('B', '1-1-2019', 'R'),
('B', '1-2-2019', 'Y'),
('B', '1-3-2019', 'Y'),
('B', '1-4-2019', 'R'),
('B', '1-5-2019', 'R'),
('B', '1-6-2019', 'R');
set #seq := 0, #prev := 'B'
SELECT
*,
IF(color='R', #seq := IF(#prev = color, #seq + 1, 1), NULL) AS rank,
#prev := color as prev
FROM tbl
ORDER BY `group`, `date`
group | date | color | rank | prev
:---- | :------- | :---- | ---: | :---
A | 1-1-2019 | R | 1 | R
A | 1-2-2019 | Y | | Y
B | 1-1-2019 | R | 1 | R
B | 1-2-2019 | Y | | Y
B | 1-3-2019 | Y | | Y
B | 1-4-2019 | R | 1 | R
B | 1-5-2019 | R | 2 | R
B | 1-6-2019 | R | 3 | R
db<>fiddle here
Use the window function row_number() for a pure standard SQL solution in Postgres - or any modern RDBMS, even MySQL since version 8:
SELECT grp, the_date, color
, row_number() OVER (PARTITION BY grp, color, part
ORDER BY the_date) AS rnk
FROM (
SELECT *
, row_number() OVER (PARTITION BY grp ORDER BY the_date, color)
- row_number() OVER (PARTITION BY grp, color ORDER BY the_date) AS part
FROM tbl
) sub
ORDER BY grp, the_date, color;
This assumes that the combination (grp, color, the_date) is defined UNIQUE, duplicates would create non-deterministic results.
Substracting the two different row numbers computes a distinct number per island (part). Then you can run row_number() once more, now partitioning by the subgroup additionally. Voilá.
To only see numbers for a particular color, 'R' in the example:
SELECT grp, the_date, color, CASE WHEN color = 'R' THEN rnk END AS rnk
FROM (
<<query from above, without ORDER BY>>
) sub
ORDER BY grp, the_date, color;
While set-based solution are the forté of RDBMS' and typically faster, a procedural solution only needs a single scan for this type of problem, so this plpgsql function should be substantially faster:
CREATE OR REPLACE FUNCTION rank_color(_color text = 'R') -- default 'R'
RETURNS TABLE (grp text, the_date date, color text, rnk int) AS
$func$
DECLARE
_last_grp text;
BEGIN
FOR grp, the_date, color IN
SELECT t.grp, t.the_date, t.color FROM tbl t ORDER BY 1,2
LOOP
IF color = $1 THEN
IF _last_grp = grp THEN
rnk := COALESCE(rnk + 1, 1);
ELSE
rnk := 1;
END IF;
ELSIF rnk > 0 THEN -- minimize assignments
rnk := NULL;
END IF;
RETURN NEXT;
_last_grp := grp;
END LOOP;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM rank_color('R');
db<>fiddle here
Looping is not always the wrong solution in a relational database.
Further reading:
Select longest continuous sequence
GROUP BY and aggregate sequential numeric values
Aside: "rank" is a rather misleading name for those row numbers, unless you have duplicates supposed to rank equally ...

First value in DATE minus 30 days SQL

I have bunch of data out of which I'm showing ID, max date and it's corresponding values (user id, type, ...). Then I need to take MAX date for each ID, substract 30 days and show first date and it's corresponding values within this date period.
Example:
ID Date Name
1 01.05.2018 AAA
1 21.04.2018 CCC
1 05.04.2018 BBB
1 28.03.2018 AAA
expected:
ID max_date max_name previous_date previous_name
1 01.05.2018 AAA 05.04.2018 BBB
I have working solution using subselects, but as I have quite huge WHERE part, refresh takes ages.
SUBSELECT looks like that:
(SELECT MIN(N.name)
FROM t1 N
WHERE N.ID = T.ID
AND (N.date < MAX(T.date) AND N.date >= (MAX(T.date)-30))
AND (...)) AS PreviousName
How'd you write the select?
I'm using TSQL
Thanks
I can do this with 2 CTEs to build up the dates and names.
SQL Fiddle
MS SQL Server 2017 Schema Setup:
CREATE TABLE t1 (ID int, theDate date, theName varchar(10)) ;
INSERT INTO t1 (ID, theDate, theName)
VALUES
( 1,'2018-05-01','AAA' )
, ( 1,'2018-04-21','CCC' )
, ( 1,'2018-04-05','BBB' )
, ( 1,'2018-03-27','AAA' )
, ( 2,'2018-05-02','AAA' )
, ( 2,'2018-05-21','CCC' )
, ( 2,'2018-03-03','BBB' )
, ( 2,'2018-01-20','AAA' )
;
Main Query:
;WITH cte1 AS (
SELECT t1.ID, t1.theDate, t1.theName
, DATEADD(day,-30,t1.theDate) AS dMinus30
, ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.theDate DESC) AS rn
FROM t1
)
, cte2 AS (
SELECT c2.ID, c2.theDate, c2.theName
, ROW_NUMBER() OVER (PARTITION BY c2.ID ORDER BY c2.theDate) AS rn
, COUNT(*) OVER (PARTITION BY c2.ID) AS theCount
FROM cte1
INNER JOIN cte1 c2 ON cte1.ID = c2.ID
AND c2.theDate >= cte1.dMinus30
WHERE cte1.rn = 1
GROUP BY c2.ID, c2.theDate, c2.theName
)
SELECT cte1.ID, cte1.theDate AS max_date, cte1.theName AS max_name
, cte2.theDate AS previous_date, cte2.theName AS previous_name
, cte2.theCount
FROM cte1
INNER JOIN cte2 ON cte1.ID = cte2.ID
AND cte2.rn=1
WHERE cte1.rn = 1
Results:
| ID | max_date | max_name | previous_date | previous_name |
|----|------------|----------|---------------|---------------|
| 1 | 2018-05-01 | AAA | 2018-04-05 | BBB |
| 2 | 2018-05-21 | CCC | 2018-05-02 | AAA |
cte1 builds the list of max_date and max_name grouped by the ID and then using a ROW_NUMBER() window function to sort the groups by the dates to get the most recent date. cte2 joins back to this list to get all dates within the last 30 days of cte1's max date. Then it does essentially the same thing to get the last date. Then the outer query joins those two results together to get the columns needed while only selecting the most and least recent rows from each respectively.
I'm not sure how well it will scale with your data, but using the CTEs should optimize pretty well.
EDIT: For the additional requirement, I just added in another COUNT() window function to cte2.
I would do:
select id,
max(case when seqnum = 1 then date end) as max_date,
max(case when seqnum = 1 then name end) as max_name,
max(case when seqnum = 2 then date end) as prev_date,
max(case when seqnum = 2 then name end) as prev_name,
from (select e.*, row_number() over (partition by id order by date desc) as seqnum
from example e
) e
group by id;

How to count most consecutive occurrences of a value in a Column in SQL Server

I have a table Attendance in my database.
Date | Present
------------------------
20/11/2013 | Y
21/11/2013 | Y
22/11/2013 | N
23/11/2013 | Y
24/11/2013 | Y
25/11/2013 | Y
26/11/2013 | Y
27/11/2013 | N
28/11/2013 | Y
I want to count the most consecutive occurrence of a value Y or N.
For example in the above table Y occurs 2, 4 & 1 times. So I want 4 as my result.
How to achieve this in SQL Server?
Any help will be appreciated.
Try this:-
The difference between the consecutive date will remain constant
Select max(Sequence)
from
(
select present ,count(*) as Sequence,
min(date) as MinDt, max(date) as MaxDt
from (
select t.Present,t.Date,
dateadd(day,
-(row_number() over (partition by present order by date))
,date
) as grp
from Table1 t
) t
group by present, grp
)a
where Present ='Y'
SQL FIDDLE
You can do this with a recursive CTE:
;WITH cte AS (SELECT Date,Present,ROW_NUMBER() OVER(ORDER BY Date) RN
FROM Table1)
,cte2 AS (SELECT Date,Present,RN,ct = 1
FROM cte
WHERE RN = 1
UNION ALL
SELECT a.Date,a.Present,a.RN,ct = CASE WHEN a.Present = b.Present THEN ct + 1 ELSE 1 END
FROM cte a
JOIN cte2 b
ON a.RN = b.RN+1)
SELECT TOP 1 *
FROM cte2
ORDER BY CT DESC
Demo: SQL Fiddle
Note, the date's in the demo got altered due to the format you posted the dates in your question.

SELECT records until new value SQL

I have a table
Val | Number
08 | 1
09 | 1
10 | 1
11 | 3
12 | 0
13 | 1
14 | 1
15 | 1
I need to return the last values where Number = 1 (however many that may be) until Number changes, but do not need the first instances where Number = 1. Essentially I need to select back until Number changes to 0 (15, 14, 13)
Is there a proper way to do this in MSSQL?
Based on following:
I need to return the last values where Number = 1
Essentially I need to select back until Number changes to 0 (15, 14,
13)
Try (Fiddle demo ):
select val, number
from T
where val > (select max(val)
from T
where number<>1)
EDIT: to address all possible combinations (Fiddle demo 2)
;with cte1 as
(
select 1 id, max(val) maxOne
from T
where number=1
),
cte2 as
(
select 1 id, isnull(max(val),0) maxOther
from T
where val < (select maxOne from cte1) and number<>1
)
select val, number
from T cross join
(select maxOne, maxOther
from cte1 join cte2 on cte1.id = cte2.id
) X
where val>maxOther and val<=maxOne
I think you can use window functions, something like this:
with cte as (
-- generate two row_number to enumerate distinct groups
select
Val, Number,
row_number() over(partition by Number order by Val) as rn1,
row_number() over(order by Val) as rn2
from Table1
), cte2 as (
-- get groups with Number = 1 and last group
select
Val, Number,
rn2 - rn1 as rn1, max(rn2 - rn1) over() as rn2
from cte
where Number = 1
)
select Val, Number
from cte2
where rn1 = rn2
sql fiddle demo
DEMO: http://sqlfiddle.com/#!3/e7d54/23
DDL
create table T(val int identity(8,1), number int)
insert into T values
(1),(1),(1),(3),(0),(1),(1),(1),(0),(2)
DML
; WITH last_1 AS (
SELECT Max(val) As val
FROM t
WHERE number = 1
)
, last_non_1 AS (
SELECT Coalesce(Max(val), -937) As val
FROM t
WHERE EXISTS (
SELECT val
FROM last_1
WHERE last_1.val > t.val
)
AND number <> 1
)
SELECT t.val
, t.number
FROM t
CROSS
JOIN last_1
CROSS
JOIN last_non_1
WHERE t.val <= last_1.val
AND t.val > last_non_1.val
I know it's a little verbose but I've deliberately kept it that way to illustrate the methodolgy.
Find the highest val where number=1.
For all values where the val is less than the number found in step 1, find the largest val where the number<>1
Finally, find the rows that fall within the values we uncovered in steps 1 & 2.
select val, count (number) from
yourtable
group by val
having count(number) > 1
The having clause is the key here, giving you all the vals that have more than one value of 1.
This is a common approach for getting rows until some value changes. For your specific case use desc in proper spots.
Create sample table
select * into #tmp from
(select 1 as id, 'Alpha' as value union all
select 2 as id, 'Alpha' as value union all
select 3 as id, 'Alpha' as value union all
select 4 as id, 'Beta' as value union all
select 5 as id, 'Alpha' as value union all
select 6 as id, 'Gamma' as value union all
select 7 as id, 'Alpha' as value) t
Pull top rows until value changes:
with cte as (select * from #tmp t)
select * from
(select cte.*, ROW_NUMBER() over (order by id) rn from cte) OriginTable
inner join
(
select cte.*, ROW_NUMBER() over (order by id) rn from cte
where cte.value = (select top 1 cte.value from cte order by cte.id)
) OnlyFirstValueRecords
on OriginTable.rn = OnlyFirstValueRecords.rn and OriginTable.id = OnlyFirstValueRecords.id
On the left side we put an original table. On the right side we put only rows whose value is equal to the value in first line.
Records in both tables will be same until target value changes. After line #3 row numbers will get different IDs associated because of the offset and will never be joined with original table:
LEFT RIGHT
ID Value RN ID Value RN
1 Alpha 1 | 1 Alpha 1
2 Alpha 2 | 2 Alpha 2
3 Alpha 3 | 3 Alpha 3
----------------------- result set ends here
4 Beta 4 | 5 Alpha 4
5 Alpha 5 | 7 Alpha 5
6 Gamma 6 |
7 Alpha 7 |
The ID must be unique. Ordering by this ID must be same in both ROW_NUMBER() functions.