Aggregating rows in SQL with missing Booleans - sql

I have the below SQL script which returns the following data from a PostgreSQL DB view table.
SELECT
"V_data".macaddr,
"V_data".sensorid,
"V_data".ts,
"V_data".velocity,
"V_data".temp,
"V_data".highspeed,
"V_data".hightemp,
"V_data".distance,
FROM
sensordb."V_data"
WHERE
"V_data".macaddr like '%abcdef'
AND
(
("V_data".sensorid = 'abc1') or ("V_data".sensorid = 'a2bc') or ("V_data".sensorid = 'ab3c')
)
AND
"V_data".ts >= 1616370867000
ORDER BY
"V_data".ts DESC;
Output
macaddr
sensorid
ts
velocity
temp
highspeed
hightemp
distance
abcdef
abc1
1616370867010
25
32
52
abcdef
a2bc
1616370867008
27
35
T
51
abcdef
ab3c
1616370867006
26
30
50
abcdef
abc1
1616370867005
24
36
T
50
abcdef
a2bc
1616370867004
27
31
50
abcdef
abc1
1616370867002
21
30
T
48
abcdef
ab3c
1616370867000
22
33
F
46
I want to aggregate the rows such that I have the latest readings per sensorid for ts, velocity, temp, distance.
For the Booleans highspeed and hightemp, I want the latest available Boolean value or an empty cell if no Boolean value was available.
Expected output
macaddr
sensorid
ts
velocity
temp
highspeed
hightemp
distance
abcdef
abc1
1616370867010
25
32
T
T
52
abcdef
a2bc
1616370867008
27
35
T
51
abcdef
ab3c
1616370867006
26
30
F
50
How could I simplify this task?
Thanks.

You can use DISTINCT ON (available only in PostgreSQL afaik) to simplify this query. You can do:
with
q as (
-- your query here
)
select
l.macaddr, l.sensorid, l.ts, l.velocity, l.temp,
s.highspeed, t.hightemp,
l.distance
from (
select distinct on (sensorid) *
from q
order by sensorid, ts desc
) l
left join (
select distinct on (sensorid) *
from q
where highspeed is not null
order by sensorid, ts desc
) s on s.sensorid = l.sensorid
left join (
select distinct on (sensorid) *
from q
where hightemp is not null
order by sensorid, ts desc
) t on t.sensorid = l.sensorid

Hmmm . . . For all but the boolean columns DISTINCT ON would work. But those booleans are tricky. You could use some tricks on booleans.
Instead, let's go for ROW_NUMBER() to get the most recent row. And fiddle with arrays to get the most recent boolean values:
SELECT d.macaddr, d.sensorid,
MAX(d.ts) as ts,
MAX(d.velocity) FILTER (WHERE seqnum = 1) as velocity,
MAX(d.temp) FILTER (WHERE seqnum = 1) as temp,
(ARRAY_REMOVE(ARRAY_AGG(d.highspeed ORDER BY ts DESC), NULL))[1] as highspeed,
(ARRAY_REMOVE(ARRAY_AGG(d.hightemp ORDER BY ts DESC), NULL))[1] as hightemp
MAX(d.distance) FILTER (WHERE seqnum = 1)
FROM (SELECT d.*,
ROW_NUMBER() OVER (PARTITION BY d.macaddr, d.sensorid ORDER BY ts DESC) as seqnum
FROM sensordb."V_data" d
WHERE d.macaddr like '%abcdef' AND
d.sensorid IN ('abc1', 'a2bc', 'ab3c') AND
d.ts >= 1616370867000
) d
GROUP BY d.macaddr, d.sensorid
ORDER BY d.ts DESC;

Related

PostgreSQL: Set the max value in the column +1 if the following is NULL

I'm a newbie in SQL and I know that the following question is popular but any proposed solutions do not help for me.
So, I have a table
ratingId | userId
1 | 1
2 | 2
NULL | 3
NULL | 4
Now I want to set '3', '4' etc instead of NULL for each row with ratingId = NULL, which means MAX value of last NOT NULL value + 1
I have used many ways but the most popular that was found is max()
My current code is
SELECT
COALESCE(rating.id, max(rating.id)) AS id,
but it does not work :( I still have NULL values.
Any suggestions, please?
Does this do what you want?
select coalesce(ratingId,
coalesce(max(ratingId) over (), 0) +
count(*) filter (where ratingId is null) over (order by userid)
) as imputed_ratingId
An equivalent phrasing is:
select coalesce(ratingId,
coalesce(max(ratingId) over (), 0) +
row_number() over (partition by ratingId order by userid)
) as imputed_ratingId
These provide a unique ratingId for the rows where it is NULL, with incremental values over the previous maximum.
I was in a hurry to answer. NULL is replaced with 64, but should start from 61.
ratingId | userId
1 | 1
2 | 2
.........|.......
60 | 60
64 | 61 // these row should have ratingId: 61 instead of NULL
64 | 62 // these row should have ratingId: 62 instead of NULL
Here is a Raw SQL
SELECT
coalesce(r.id,
coalesce(max(r.id) over (), 0) +
count(*) filter (where r.id is null) over (order by r.Id)
) as id,
r2.seqnum AS position,
coalesce(r3.avg, 0) AS avg,
r3."avgPosition",
u.id AS "userId"
FROM ("user" u
CROSS JOIN "userRole" ur
LEFT JOIN rating r
JOIN (SELECT
r2_1.id,
r2_1."userId",
r2_1."userRoleId",
r2_1."performerRatingGroupId",
r2_1.value,
row_number() OVER (PARTITION BY r2_1."userRoleId", r2_1."performerRatingGroupId" ORDER BY r2_1.value DESC) AS seqnum
FROM rating r2_1
) r2 ON ((r2.id = r.id))
JOIN (SELECT
r3_1.id,
r3_2.avg,
dense_rank() OVER (ORDER BY r3_2.avg) AS "avgPosition"
FROM
(rating r3_1
JOIN (SELECT
rating.id,
round(avg(rating.value) OVER (PARTITION BY rating."userId" ORDER BY rating."userId")) AS avg
FROM rating
) r3_2 ON ((r3_1.id = r3_2.id))
)
) r3 ON ((r3.id = r.id))
ON u.id = r."userId" AND ur.id = r."userRoleId"
)
GROUP BY
r.id,
r2.seqnum,
r3.avg,
r3."avgPosition",
u.id

stratified sample on ranges

I have table_1, that has data such as:
Range Start Range End Frequency
10 20 90
20 30 68
30 40 314
40 40 191 (here, it means we have just 40 as data point repeating 191 times)
table_2:
group value
10 56.1
10 88.3
20 53
20 20
30 55
I need to get the stratified sample on the basis of range from table_1, the table_2 can have millions of rows but the result should be restricted to just 10k points.
Tried below query:
SELECT
d.*
FROM
(
SELECT
ROW_NUMBER() OVER(
PARTITION BY group
ORDER BY group
) AS seqnum,
COUNT(*) OVER() AS ct,
COUNT(*) OVER(PARTITION BY group) AS cpt,
group, value
FROM
table_2 d
) d
WHERE
seqnum < 10000 * ( cpt * 1.0 / ct )
but a bit confused with the analytics functions usage here.
Expecting 10k records as a stratified sample from table_2:
Result table:
group value
10 56.1
20 53
20 20
30 55
It means you need atleast one record of each group and more records on random basis then try this:
SELECT GROUP, VALUE FROM
(SELECT T2.GROUP, T2.VALUE,
ROW_NUMBER()
OVER (PARTITION BY T2.GROUP ORDER BY NULL) AS RN
FROM TABLE_1 T1
JOIN TABLE_2 T2
ON(T1.RANGE = T2.GROUP))
WHERE RN = 1 OR
CASE WHEN RN > 1
AND RN = CEIL(DBMS_RANDOM.VALUE(1,RN))
THEN 1 END = 1
FETCH FIRST 10000 ROWS ONLY;
Here, Rownum is taken on random basis for each group and then result is taking rownum 1 and other rownum if they fulfill random condition.
Cheers!!
If I understand what you want - which is by no means certain - then I think you want to get a maximum of 10000 rows, with the number of group values proportional to the frequencies. So you can get the number of rows you want from each range with:
select range_start, range_end, frequency,
frequency/sum(frequency) over () as proportion,
floor(10000 * frequency/sum(frequency) over ()) as limit
from table_1;
RANGE_START RANGE_END FREQUENCY PROPORTION LIMIT
----------- ---------- ---------- ---------- ----------
10 20 90 .135746606 1357
20 30 68 .102564103 1025
30 40 314 .473604827 4736
40 40 191 .288084465 2880
Those limits don't quite add up to 10000; you could go slightly above with ceil instead of floor.
You can then assign a nominal row number to each entry in table_2 based on which range it is in, and then restrict the number of rows from that range via that limit:
with cte1 (range_start, range_end, limit) as (
select range_start, range_end, floor(10000 * frequency/sum(frequency) over ())
from table_1
),
cte2 (grp, value, limit, rn) as (
select t2.grp, t2.value, cte1.limit,
row_number() over (partition by cte1.range_start order by t2.value) as rn
from cte1
join table_2 t2
on (cte1.range_end > cte1.range_start and t2.grp >= cte1.range_start and t2.grp < cte1.range_end)
or (cte1.range_end = cte1.range_start and t2.grp = cte1.range_start)
)
select grp, value
from cte2
where rn <= limit;
...
9998 rows selected.
I've used order by t2.value in the row_number() call because it isn't clear how you want to pick which rows in the range you actually want; you might want to order by dbms_random.value or something else.
db<>fiddle with some artificial data.

How can I select top 3 for each group based on another column in sqlite?

I'm trying to get top 3 most profitable UserIDs in each country in one table using sqlite. I'm not sure where to use LIMIT 3.
Here is the table I have:
Country | UserID | Profit
US 1 100
US 12 98
US 13 10
US 5 8
US 2 5
IR 9 95
IR 3 90
IR 8 70
IR 4 56
IR 15 40
the result should look like this:
Country | UserID | Profit
US 1 100
US 12 98
US 13 10
IR 9 95
IR 3 90
IR 8 70
One pretty simple method is:
select t.*
from t
where t.profit >= (select t2.profit
from t t2
where t2.country = t.country
order by t2.profit desc
limit 1 offset 2
);
This assumes at least three records for each country. You can get around that with coalesce():
select t.*
from t
where t.profit >= coalesce((select t2.profit
from t t2
where t2.country = t.country
order by t2.profit desc
limit 1 offset 2
), t.profit
);
Since SQLite doesn't support windows function, so you can write a subquery be a seqnum by Country, then get top 3
You can try this query.
select t.Country,t.UserID,t.Profit
from(
select t.*,
(select count(*)
from T t2
where t2.Country = t.Country and t2.Profit >= t.Profit
) as seqnum
from T t
)t
where t.seqnum <=3
sqlfiddle:https://www.db-fiddle.com/f/tmNhRLGG2oKqCKXJEDsjfe/0
LIMIT won't be usefull as it applies to a whole result set.
I would create an auxiliary column "CountryRank" like this:
SELECT *, (SELECT COUNT() FROM Data AS d WHERE d.Country=Data.Country AND d.Profit>Data.Country)+1 AS CountryRank
FROM Data;
And query on that result:
SELECT Country, UserID, Profit
FROM (
SELECT *, (SELECT COUNT() FROM Data AS d WHERE d.Country=Data.Country AND d.Profit>Data.Profit)+1 AS CountryRank FROM Data)
WHERE CountryRank<=3
ORDER BY Country, CountryRank;

Select distinct for one column

I have a compound primary key where the single parts are potentially random. They aren't in any particular order and one can be unique or they can be all the same.
I do not care which row I get. This is like "Just pick one from each group".
My table:
KeyPart1 KeyPart2 KeyPart3 colA colB colD
11 21 39 d1
11 22 39 d2
12 21 39 d2
12 22 39 d3
13 21 38 d3
13 22 38 d5
Now what I want is to get for each entry in colD one row, I do not care which one.
KeyPart1 KeyPart2 KeyPart3 colA colB colD
11 21 39 d1
12 21 39 d2
12 22 39 d3
13 22 38 d5
For rows that are unique by colD, you will have to decide which other column values will be discarded. Here, within the over clause I have use partition by colD which provides the wanted uniqueness by that column, but the order by is arbitrary and you may want to change it to suit your needs.
select
d.*
from (
select
t.*
, row_number() over (partition by t.colD
order by t.KeyPart1,t.KeyPart2,t.KeyPart) as rn
from yourtable t
) d
where d.rn = 1;
The following should work in almost any version of DB2:
select t.*
from (select t.*,
row_number() over (partition by KeyPart1, KeyPart2
order by KeyPart1
) as seqnum
from t
) t
where seqnum = 1;
If you only care about column d, and the first two key parts, then you can use group by:
select KeyPart1, KeyPart2, min(colD)
from t
group by KeyPart1, KeyPart2;
Change 'order by' if necessary
with D as (
select distinct ColdD from yourtable
)
select Y.* from D
inner join lateral
(
select * from yourtable X
where X.ColdD =D.ColdD
order by X.KeyPart1, X.KeyPart2, X.KeyPart3
fetch first rows only
) Y on 1=1

Retrieve last known value for each column of a row

Not sure about the correct words to ask this question, so I will break it down.
I have a table as follows:
date_time | a | b | c
Last 4 rows:
15/10/2013 11:45:00 | null | 'timtim' | 'fred'
15/10/2013 13:00:00 | 'tune' | 'reco' | null
16/10/2013 12:00:00 | 'abc' | null | null
16/10/2013 13:00:00 | null | 'died' | null
How would I get the last record but with the value ignoring the null and instead get the value from the previous record.
In my provided example the row returned would be
16/10/2013 13:00:00 | 'abc' | 'died' | 'fred'
As you can see if the value for a column is null then it goes to the last record which has a value for that column and uses that value.
This should be possible, I just cant figure it out. So far I have only come up with:
select
last_value(a) over w a
from test
WINDOW w AS (
partition by a
ORDER BY ts asc
range between current row and unbounded following
);
But this only caters for a single column ...
Here I create an aggregation function that collects columns into arrays. Then it is just a matter of removing the NULLs and selecting the last element from each array.
Sample Data
CREATE TABLE T (
date_time timestamp,
a text,
b text,
c text
);
INSERT INTO T VALUES ('2013-10-15 11:45:00', NULL, 'timtim', 'fred'),
('2013-10-15 13:00:00', 'tune', 'reco', NULL ),
('2013-10-16 12:00:00', 'abc', NULL, NULL ),
('2013-10-16 13:00:00', NULL, 'died', NULL );
Solution
CREATE AGGREGATE array_accum (anyelement)
(
sfunc = array_append,
stype = anyarray,
initcond = '{}'
);
WITH latest_nonull AS (
SELECT MAX(date_time) As MaxDateTime,
array_remove(array_accum(a), NULL) AS A,
array_remove(array_accum(b), NULL) AS B,
array_remove(array_accum(c), NULL) AS C
FROM T
ORDER BY date_time
)
SELECT MaxDateTime, A[array_upper(A, 1)], B[array_upper(B,1)], C[array_upper(C,1)]
FROM latest_nonull;
Result
maxdatetime | a | b | c
---------------------+-----+------+------
2013-10-16 13:00:00 | abc | died | fred
(1 row)
Order of rows
The "last row" and the sort order would need to be defined unambiguously. There is no natural order in a set (or a table). I am assuming ORDER BY ts, where ts is the timestamp column.
Like #Jorge pointed out in his comment: If ts is not UNIQUE, one needs to define tiebreakers for the sort order to make it unambiguous (add more items to ORDER BY). A primary key would be the ultimate solution.
General solution with window functions
To get a result for every row:
SELECT ts
, max(a) OVER (PARTITION BY grp_a) AS a
, max(b) OVER (PARTITION BY grp_b) AS b
, max(c) OVER (PARTITION BY grp_c) AS c
FROM (
SELECT *
, count(a) OVER (ORDER BY ts) AS grp_a
, count(b) OVER (ORDER BY ts) AS grp_b
, count(c) OVER (ORDER BY ts) AS grp_c
FROM t
) sub;
How?
The aggregate function count() ignores NULL values when counting. Used as aggregate-window function, it computes the running count of a column according to the default window definition, which is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. NULL values don't increase the count, so these rows fall into the same peer group as the last non-null value.
In a second window function, the only non-null value per group is easily extracted with max() or min().
Just the last row
WITH cte AS (
SELECT *
, count(a) OVER w AS grp_a
, count(b) OVER w AS grp_b
, count(c) OVER w AS grp_c
FROM t
WINDOW w AS (ORDER BY ts)
)
SELECT ts
, max(a) OVER (PARTITION BY grp_a) AS a
, max(b) OVER (PARTITION BY grp_b) AS b
, max(c) OVER (PARTITION BY grp_c) AS c
FROM cte
ORDER BY ts DESC
LIMIT 1;
Simple alternatives for just the last row
SELECT ts
,COALESCE(a, (SELECT a FROM t WHERE a IS NOT NULL ORDER BY ts DESC LIMIT 1)) AS a
,COALESCE(b, (SELECT b FROM t WHERE b IS NOT NULL ORDER BY ts DESC LIMIT 1)) AS b
,COALESCE(c, (SELECT c FROM t WHERE c IS NOT NULL ORDER BY ts DESC LIMIT 1)) AS c
FROM t
ORDER BY ts DESC
LIMIT 1;
Or:
SELECT (SELECT ts FROM t ORDER BY ts DESC LIMIT 1) AS ts
,(SELECT a FROM t WHERE a IS NOT NULL ORDER BY ts DESC LIMIT 1) AS a
,(SELECT b FROM t WHERE b IS NOT NULL ORDER BY ts DESC LIMIT 1) AS b
,(SELECT c FROM t WHERE c IS NOT NULL ORDER BY ts DESC LIMIT 1) AS c
db<>fiddle here
Old sqlfiddle
Performance
While this should be decently fast, if performance is your paramount requirement, consider a plpgsql function. Start with the last row and loop descending until you have a non-null value for every column required. Along these lines:
GROUP BY and aggregate sequential numeric values
This should work but keep in mind it is an uggly solution
select * from
(select dt from
(select rank() over (order by ctid desc) idx, dt
from sometable ) cx
where idx = 1) dtz,
(
select a from
(select rank() over (order by ctid desc) idx, a
from sometable where a is not null ) ax
where idx = 1) az,
(
select b from
(select rank() over (order by ctid desc) idx, b
from sometable where b is not null ) bx
where idx = 1) bz,
(
select c from
(select rank() over (order by ctid desc) idx, c
from sometable where c is not null ) cx
where idx = 1) cz
See it here at fiddle: http://sqlfiddle.com/#!15/d5940/40
The result will be
DT A B C
October, 16 2013 00:00:00+0000 abc died fred