How to get the count of the modal value in PostgreSQL? - sql

I have calculated the modal value of a column in a JOIN.
mode() WITHIN GROUP (ORDER BY col) AS modal_col
I would also like the frequency of the modal value. i.e. how often does this value appear?
I have tried to simply nest this in the count function, but postgres does not allow this.
count(mode() WITHIN GROUP (ORDER BY col))
ERROR: aggregate function calls cannot be nested
I have also tried:
row_number() OVER (PARTITION BY id ORDER BY count(*)
I have also tried the RANK() function, but these simply give me the row numbers
I would like a simple count of the occurrence of modal value.
Input
id
col
id1
a
id1
a
id1
b
id2
a
id2
a
id3
a
id3
b
id3
c
id3
c
id3
c
id3
c
Output
id
col_mode
mode_count
id1
a
2
id2
a
3
id3
c
4
EDIT
SELECT DISTINCT ON (t1.id)
t1.id,
mode() WITHIN GROUP (ORDER BY t2.col) AS modal_col,
count(*) OVER(PARTITION BY t1.id, t2.col) AS mode_count
FROM schema.foo t2
JOIN schema.bar t1
ON t2.id2 = t1.id2
ORDER BY t1.id, count(*) OVER(PARTITION BY t1.id, t2.col) DESC
;
Thanks to Danny for the pointer.
I tried the above and postgres errors and requires that I group by both t1.id AND t2.col. Do I need to create an intermediary scratch table as I do not to want to group by both columns, just t1.id?

select distinct on (id)
id
,col as col_mode
,count(*) over(partition by id, col) as mode_count
from t
order by id, count(*) over(partition by id, col) desc
id
col_mode
mode_count
id1
a
2
id2
a
2
id3
c
4
Here's another answer using mode() within group
select id
,mode() WITHIN GROUP (order by col) as col_mode
,max(mode_count) as mode_count
from
(
select *
,count(*) over(partition by id, col) as mode_count
from t
) t
group by id
Fiddle

Related

create a new column that contains a list of values from another column subsequent rows

I have a table like below,
and want to create a new column that contains a list of values from another column subsequent rows like below,
for copy paste:
timestamp ID Value
2021-12-03 04:03:45 ID1 O
2021-12-03 04:03:46 ID1 P
2021-12-03 04:03:47 ID1 Q
2021-12-03 04:03:48 ID1 R
2021-12-03 04:03:49 ID1 NULL
2021-12-03 04:03:50 ID1 S
2021-12-03 04:03:51 ID1 T
2021-12-04 11:09:03 ID2 A
2021-12-04 11:09:04 ID2 B
2021-12-04 11:09:05 ID2 C
Using windowed functions and range JOIN:
WITH cte AS (
SELECT tab.*,
COALESCE(FIRST_VALUE(CASE WHEN VALUE IS NULL THEN tmp END) IGNORE NULLS
OVER(PARTITION BY ID ORDER BY TMP
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
,MAX(tmp) OVER(PARTITION BY ID)) AS next_tmp
FROM tab
)
SELECT c1.tmp, c1.id, c1.value,
LISTAGG(c2.value, ',') WITHIN GROUP(ORDER BY c2.tmp) AS list
FROM cte c1
LEFT JOIN cte c2
ON c1.ID = c2.ID
AND (c1.tmp < c2.tmp AND c2.tmp <= c1.next_tmp)
GROUP BY c1.tmp, c1.id, c1.value
ORDER BY c1.ID, c1.tmp;
db<>fiddle demo
Output:
How does it work:
The idea is to find first timestamp corresponding to NULL value per each ID:
SELECT tab.*,
COALESCE(FIRST_VALUE(CASE WHEN VALUE IS NULL THEN tmp END) IGNORE NULLS
OVER(PARTITION BY ID ORDER BY TMP
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
, MAX(tmp) OVER(PARTITION BY ID)) AS next_tmp
FROM tab;
Output:
The rules are complex and it wasn't easy to code this — but a triple join, rules for nulls, and removing the first element can produce the results as desired:
with data as (
select (x[0]||' '||x[1])::timestamp ts, x[2]::string id, iff(x[3]='NULL', null, x[3])::string value
from (
select split(value, ' ') x
from table(split_to_table($$2021-12-03 04:03:45 ID1 O
2021-12-03 04:03:46 ID1 P
2021-12-03 04:03:47 ID1 Q
2021-12-03 04:03:48 ID1 R
2021-12-03 04:03:49 ID1 NULL
2021-12-03 04:03:50 ID1 S
2021-12-03 04:03:51 ID1 T
2021-12-04 11:09:03 ID2 A
2021-12-04 11:09:04 ID2 B
2021-12-04 11:09:05 ID2 C$$, '\n'))
))
select ts, id, value, iff( -- return null for null values
value is null
, null
, array_to_string(
array_slice( -- remove first element
array_agg(bvalue) within group (order by bts)
, 1, 99999)
, ',')
) list
from (
select a.*, b.ts bts, b.value bvalue
, coalesce( -- find max null after current value, or max value if none
(
select max(ts)
from data
where a.id=id
and value is null
and a.ts<ts
),
(
select max(ts)
from data
where a.id=id
)) maxts
from data a
join data b
on a.id=b.id
and a.ts<=b.ts
where maxts >= b.ts
)
group by id, ts, value
order by id, ts
Data that is ordered by TMP is also ordered by ID logically.
So, you can
group rows first by ID;
in each group, create a new group when the previous VALUE is null;
in each subgroup, use the comma to join up VALUEs from the second to the last non-null VALUE to form a sequence and make it the value of the new column LIST.
A SQL set is unordered, which makes computing process very complicated.
You need to first create a marker column using the window function, perform a self-join by the marker column, and group rows and join up VALUE values to get the desired result.
A common alternative is to fetch the original data out of the database and process it in Python or SPL. SPL, the open-source Java package, is easy to be integrated into a Java program and generates much simpler code. It can get it done with only two lines of code:
A
1
=ORACLE.query("SELECT * FROM TAB ORDER BY 1")
2
=A1.group#o(#2).conj(~.group#i(#3[-1]==null).run(tmp=~.(#3).select(~),~=~.derive(tmp.m(#+1:).concat#c():LIST))).conj()

Second largest value in a group

I want to select second largest value in each group, how can I do that in SQL?
For example with the below table,
IDs value
ID1 2
ID1 3
ID1 4
ID2 1
ID2 2
ID2 5
When grouping by IDs, I want this output
IDs value
ID1 3
ID2 2
Thanks.
Use row_number():
select t.*
from (select t.*, row_number() over (partition by id order by value desc) as seqnum
from t
) t
where seqnum = 2;
Alternate way - you can use dense_rank().
It will make sure that your SQL always returns second largest value even when you have two records with largest value.
select t.*
from (select t.*, dense_rank() over (partition by id order by value desc) as rrank
from t
) t
where rrank = 2;

Oracle Getting latest value from the other table

I have two tables, table1 contains old values and table2 contains latest values, I want to show latest value in table1 but I do not have anything which tells me this is the latest value in table2.
for example
Table1
CID-----PID-----RID
CT1-----C-------R1
CT2-----C-------R2
CT3-----C-------R3
CT4-----C-------R4
Table2
CID-----PID----RID
CT1-----A-------R1
CT1-----C-------R11
CT2-----C-------R2
CT3-----A-------R3
CT4-----A-------R4
The condition is I have to give priority to value C in case both values (A and C) exist also it's RID changes so need to get that also in output table, for the same CID and for unique value I will simple replace it in table1 from table2, so output will be like this
Table3
CID-----PID----RID
CT1-----C-------R11
CT2-----C-------R2
CT3-----A-------R3
CT4-----A-------R4
I may be missing something, but isn't this simply:
select cid, max(pid)
from table2
group by cid;
If you want whole records, use a ranking with ROW_NUMBER instead:
select cid, pid, rid
from
(
select cid, pid, rid, row_number() over (partition by cid order by pid desc) as rn
from table2
)
where rn = 1;
You can also use case expressions for ranking, e.g.:
(partition by cid order by case pid when 'C' then 1 when 'A' then 2 else 3 end) as rn
UPDATE: Now that you've finally explained what you are after ...
You want more or less the second query I gave you above. Only that you want data from both tables, which you can get with UNION ALL. You can easily give each row a rank on the way:
table2 PIM C => rank #1
table2 PIM A => rank #2
table1 rank #3
Then again take the row with the best rank:
select cid, pid, rid
from
(
select cid, pid, rid, row_number() over (partition by cid order by rnk) as rn
from
(
select cid, pid, rid, case when pid = 'C' then 1 else 2 end as rnk from table2
union
select cid, pid, rid, 3 as rnk from table1
)
)
where rn = 1;

How to label groups in postgresql when group belonging depends on the preceding line?

I want, in a request, to fill all Null values by the last known value.
When it's in a table and not in a request, it's easy:
If I define and fill my table as follows:
CREATE TABLE test_fill_null (
date INTEGER,
value INTEGER
);
INSERT INTO test_fill_null VALUES
(1,2),
(2, NULL),
(3, 45),
(4,NULL),
(5, null);
SELECT * FROM test_fill_null ;
date | value
------+-------
1 | 2
2 |
3 | 45
4 |
5 |
Then I just have to fill like that:
UPDATE test_fill_null t1
SET value = (
SELECT t2.value
FROM test_fill_null t2
WHERE t2.date <= t1.date AND value IS NOT NULL
ORDER BY t2.date DESC
LIMIT 1
);
SELECT * FROM test_fill_null;
date | value
------+-------
1 | 2
2 | 2
3 | 45
4 | 45
5 | 45
But now, I'm in a request, like this one:
WITH
pre_table AS(
SELECT
id1,
id2,
tms,
CASE
WHEN tms - lag(tms) over w < interval '5 minutes' THEN NULL
ELSE id2
END as group_id
FROM
table0
window w as (partition by id1 order by tms)
)
Where the group_id is set to id2 when the previous point is distant from more than 5 minutes, null otherwise. By doing so, I want to end up with group of points that follow each other by less than 5 minutes, and gaps of more than 5 minutes between each groups.
Then I don't know how to proceed. I tried:
SELECT distinct on (id1, id2)
t0.id1,
t0.id2,
t0.tms,
t1.group_id
FROM
pre_table t0
LEFT JOIN (
select
id1,
tms,
group_id
from pre_table t2
where t2.group_id is not null
order by tms desc
) t1
ON
t1.tms <= t0.tms AND
t1.id1 = t0.id1
WHERE
t0.id1 IS NOT NULL
ORDER BY
id1,
id2,
t1.tms DESC
But in the final result I have some group with two consecutive points which are distant from more than 5 minutes. Their should be two different groups in this case.
A "select within a select" is more commonly called "subselect" or "subquery" In your particular case it's a correlated subquery. LATERAL joins (new in postgres 9.3) can largely replace correlated subqueries with more flexible solutions:
What is the difference between LATERAL and a subquery in PostgreSQL?
I don't think you need either here.
For your first case this query is probably faster and simpler, though:
SELECT date, max(value) OVER (PARTITION BY grp) AS value
FROM (
SELECT *, count(value) OVER (ORDER BY date) AS grp
FROM test_fill_null
) sub;
count() only counts non-null values, so grp is incremented with every non-null value, thereby forming groups as desired. It's trivial to pick the one non-null value per grp in the outer SELECT.
For your second case, I'll assume the initial order of rows is determined by (id1, id2, tms) as indicated by one of your queries.
SELECT id1, id2, tms
, count(step) OVER (ORDER BY id1, id2, tms) AS group_id
FROM (
SELECT *, CASE WHEN lag(tms, 1, '-infinity') OVER (PARTITION BY id1 ORDER BY id2, tms)
< tms - interval '5 min'
THEN true END AS step
FROM table0
) sub
ORDER BY id1, id2, tms;
Adapt to your actual order. One of these might cover it:
PARTITION BY id1 ORDER BY id2 -- ignore tms
PARTITION BY id1 ORDER BY tms -- ignore id2
SQL Fiddle with an extended example.
Related:
Select longest continuous sequence
While editing my question I found a solution. It's pretty low though, much lower than my example within a table. Any suggestion to improve it ?
SELECT
t2.id1,
t2.id2,
t2.tms,
(
SELECT t1.group_id
FROM pre_table t1
WHERE
t1.tms <= t2.tms
AND t1.group_id IS NOT NULL
AND t2.id1 = t2.id1
ORDER BY t1.tms DESC
LIMIT 1
) as group_id
FROM
pre_table t2
ORDER BY
t2.id1
t2.id2
t2.tms
So as I said, a select within a select

find row number by group in SQL server table with duplicated rows

I need to count the row number by group in a table with some duplications.
Table:
id va1ue1 value2
1 3974 39
1 3974 39
1 972 5
1 972 10
SQL:
select id, value1, value2, COUNT(*) cnt
FROM table
group by id, value1, value2
having COUNT(*) > 1
The code only count the duplicated rows.
I need:
id, value1, value2
1 972 5
1 972 10
I do not need to count the duplicated rows, I only need the rows that value1 has more than one distinct values in value2 column.
Thanks
Use DISTINCT:
select id, value1, count(distinct value2) cnt
from table
group by id, value1
having count(distinct value2) > 1
If you want detais then:
select * from table t1
cross apply(select cnt from(
select count(distinct value2) cnt
from table t2
where t1.id = t2.id and t1.value1 = t2.value1) t
where cnt > 1)ca
In SQL Server 2008, you can use a trick to count distinct values using window functions. You might find this a nice solution:
select t.id, t.value1, t.value2
from (select t.*, sum(case when seqnum = 1 then 1 else 0 end) over (partition by value1) as numvals
from (select t.*, row_number() over (partition by value1, value2 order by (select null)) as seqnum
from table t
) t
) t
where numvals > 1;
Try it this way without a GROUP BY:
select id, value1, value2
FROM table AS T1
where 1 < (
select COUNT(*)
FROM table AS T2
where T1.value1 = T2.value1)
Try this
;WITH CTE
AS ( SELECT id ,
value1 ,
value2 ,
COUNT(*) cnt
FROM table
GROUP BY id ,
value1 ,
value2
HAVING COUNT(*) > 1
)
SELECT *
FROM table1
WHERE value1 IN ( SELECT value1
FROM CTE )
Simply use a NOT after HAVING, which precisely gets you the rows which are NOT duplicated.
select id, value1, value2
FROM [table]
group by id, value1, value2
having NOT COUNT(*) > 1
Fiddle here.
If you want the actual rows from the table, not just the qualifying id, value1 pairs, you could do this:
WITH discrepancies AS (
SELECT,
id,
value1,
value2,
distinctcount = COUNT(DISTINCT value2) OVER (PARTITION BY id, value1)
FROM
dbo.atable
)
SELECT
id,
value1,
value2
FROM
discrepancies
WHERE
distinctcount > 1
;
if SQL Server 2008 supported COUNT(DISTINCT ...) with an OVER clause.
Basically, it would be the same idea as Giorgi Nakeuri's one, more or less, except you would not be hitting the table more than once.
Alas, there is no support for COUNT(DISTINCT ...) OVER ... in SQL Server so far. Still, you can use a different method, which will still allow you to touch the table just once and return detail rows nevertheless:
WITH discrepancies AS (
SELECT,
id,
value1,
value2,
minvalue2 = MIN(value2) OVER (PARTITION BY id, value1),
maxvalue2 = MAX(value2) OVER (PARTITION BY id, value1)
FROM
dbo.atable
)
SELECT
id,
value1,
value2
FROM
discrepancies
WHERE
minvalue2 <> maxvalue2
;
The idea here is to get MIN(value2) and MAX(value2) per each id, value1 and to see if those differ. If they do, that means you have a discrepancy in this id, value1 subset and you want that row to be returned.
The method takes advantage of aggregates with an OVER clause to avoid a self-join, and that is precisely the reason why the table is accessed just once here.