Select arbitrary row for each group in Postgres - sql

In Presto, there's an arbitrary() aggregate function to select any arbitrary row in a given group. If there's no group by clause, then I can use distinct on. With group by, every selected column must be in the group by or be an aggregated column. E.g.:
| id | foo |
| 1 | 123 |
| 1 | 321 |
select id, arbitrary(foo), count(*)
from mytable
group by id
Fiddle
It doesn't matter if it returns 1, 123, 2 or 1, 321, 2. Something like min() or max() works, but it's a lot slower.
Does something like arbitrary() exist in Postgres?

select m.foo,b.id,b.cnt from mytable m
join (select id, count(*) cnt
from mytable
group by id) b using (id) limit 1;
If not explicit mention asc, desc all the order is not guaranteed. Therefore in the above query the foo's appearance is arbitrary.

Related

Finding SQL duplicates - two methods different results

I have a table in which duplicates may appear. A duplicate is considered when:
sector_id, department_id,number_id are the same (I will add that these are foreign keys to other tables, because maybe it is important)
and valid_to is null
I did this with two queries:
1.
select count(*) from(
select sector_id, departament_id,numer_id, count(*) from tables.workspace
where valid_to is null
group by 1,2,3
having count(*) >1 ) as r
--results : 650
with duplicate_rows as
(
select *, count(id) over (partition by sector_id, departament_id, numer_id) duplicate_count from tables.workspace where valid_to is null
)
select count(*) from
(
select * from duplicate_rows where duplicate_count >1
) as t
--results : 3655
Please explain what I`m doing wrong, possibly why these two functions return different values and which of them is true
Your second query is the wrong one.
You're using a window function and selecting everything in your CTE, which means that every record will have the total COUNT for each combination of your partition by fields.
For example, if there are 3 records with sector_id = 'A', departament_id = 'RED', numer_id = 1, your CTE will look like this:
sector_id | departament_id | numer_id | duplicate_count
------------+----------------+----------+-----------------
A | RED | 1 | 3
A | RED | 1 | 3
A | RED | 1 | 3
Which means that your second query will return 3 instead of 1.
Try adding a DISTINCT to the query that selects from the CTE and it should give you the same results as your first query.
select distinct * from duplicate_rows where duplicate_count >1

Is there a way to calculate average based on distinct rows without using a subquery?

If I have data like so:
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 2 | 20 |
+----+-------+
How do I calculate the average based on the distinct id WITHOUT using a subquery (i.e. querying the table directly)?
For the above example it would be (10+20+30)/3 = 20
I tried to do the following:
SELECT AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be NULL and thus it would not be counted into the calculation, but unfortunately I can't put analytical functions inside aggregate functions.
As far as I know, you can't do this without a subquery. I would use:
SELECT AVG(avg_value)
FROM
(
SELECT AVG(value) AS avg_value
FROM yourTable
GROUP BY id
) t;
WITH RANK AS (
Select *,
ROW_NUMBER() OVER(PARTITION BY ID) AS RANK
FROM
TABLE
QUALIFY RANK = 1
)
SELECT
AVG(VALUES)
FROM RANK
The outer query will have other parameters that need to access all the data in the table
I interpret this comment as wanting an average on every row -- rather than doing an aggregation. If so, you can use window functions:
select t.*,
avg(case when seqnum = 1 then value end) over () as overall_avg
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from t
) t;
Yes there is a way,
Simply use distinct inside the avg function as below :
select avg(distinct value) from tab;
http://sqlfiddle.com/#!4/9d156/2/0

SQL query to find all combinations of grouped values

I am looking for a SQL query or a series of SQL queries.
Schema
I have a logging table with three columns: id, event_type, and timestamp
The IDs are arbitrary text, generated randomly at runtime and unknown to me
The event types are numbers from a finite collection of known event types
The timestamps are your typical int64 epoch timestamp
A single ID value may have 1 or more rows, each with some value for event_type. representing a flow of events associated with the same ID
For each ID, its collection of rows can be sorted by increasing timestamp
Most times, there will be only one occurrence of an ID + event type combination, but rarely, there could be two; not sure this matters
Goal
What I want to do is to query the number of distinct combinations of event types (sorted by timestamp). For example, provided this table:
id event_type timestamp
-----------------------------------------
foo event_1 101
foo event_2 102
bar event_2 102
bar event_1 101
foo event_3 103
bar event_3 103
blah event_1 101
bleh event_2 102
backwards event_1 103
backwards event_2 102
backwards event_3 101
Then I should get the following result:
combination count
-------------------------------
[event_1,event_2,event_3] 2 // foo and bar
[event_3,event_2,event_1] 1 // backwards
[event_1] 1 // blah
[event_2] 1 // bleh
You can do 2 levels of grouping to your data.
For Mysql use group_concat():
select t.combination, count(*) count
from (
select
group_concat(event_type order by timestamp) combination
from tablename
group by id
) t
group by t.combination
order by count desc
See the demo.
For Postgresql use array_agg() with array_to_string():
select t.combination, count(*) count
from (
select
array_to_string(array_agg(event_type order by timestamp), ',') combination
from tablename
group by id
) t
group by t.combination
order by count desc
See the demo.
For Oracle there is listagg():
select t.combination, count(*) count
from (
select
listagg(event_type, ',') within group (order by timestamp) combination
from tablename
group by id
) t
group by t.combination
order by count desc
See the demo.
For SQL Server 2017+ there is string_agg():
select t.combination, count(*) count
from (
select
string_agg(event_type, ',') within group (order by timestamp) combination
from tablename
group by id
) t
group by t.combination
order by count desc
See the demo.
Results:
| combination | count |
| ----------------------- | ----- |
| event_1,event_2,event_3 | 2 |
| event_3,event_2,event_1 | 1 |
| event_1 | 1 |
| event_2 | 1 |
SELECT
"combi"."combination",
COUNT(*) AS "count"
FROM
(
SELECT
GROUP_CONCAT("event_type" SEPARATOR ',') AS "combination"
FROM
?table?
GROUP BY
"id"
) AS "combi"
GROUP BY
"combi"."combination"
Note: GROUP_CONCAT(... SEPARATOR ...) syntax is not SQL standard, it's DB specific (in this case MySQL, other dbs have other aggregate functions). You might need to adjust for your DB of choice or specify in tags which DB you are actually using.
As for "sorted by timestamp", you need to define what this actually means. What is "sorted by timestamp" for a group of groups?

order by count(catid) without group

I want count how many rows use the same catid and order the query by total.
id | catid | name
0 | 1 | foo
1 | 1 | bar
2 | 2 | paint
I've tried COUNT(catid) but this requires a GROUP BY, and I do not want to compress rows.
How may I do this?
Do you want window functions?
select t.*, count(*) over (partition by catid) as cat_cnt
from t
order by cat_cnt, catid;
I should note that if you don't want to see the total, you can put the window function in the order by:
select *
from t
order by count(*) over (partition by catid), catid
Maybe you could run the GROUP BY as a separate SELECT, then JOIN?
E.g.
select orig.*, summ.totals
from t
join (select count(cat_id) totals
from t
group by cat_id) summ
on t.cat_id = summ.cat_id;

How to select most frequent value in a column per each id group?

I have a table in SQL that looks like this:
user_id | data1
0 | 6
0 | 6
0 | 6
0 | 1
0 | 1
0 | 2
1 | 5
1 | 5
1 | 3
1 | 3
1 | 3
1 | 7
I want to write a query that returns two columns: a column for the user id, and a column for what the most frequently occurring value per id is. In my example, for user_id 0, the most frequent value is 6, and for user_id 1, the most frequent value is 3. I would want it to look like below:
user_id | most_frequent_value
0 | 6
1 | 3
I am using the query below to get the most frequent value, but it runs against the whole table and returns the most common value for the whole table instead of for each id. What would I need to add to my query to get it to return the most frequent value for each id? I am thinking I need to use a subquery, but am unsure of how to structure it.
SELECT user_id, data1 AS most_frequent_value
FROM my_table
GROUP BY user_id, data1
ORDER BY COUNT(*) DESC LIMIT 1
You can use a window function to rank the userids based on their count of data1.
WITH cte AS (
SELECT
user_id
, data1
, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY COUNT(data1) DESC) rn
FROM dbo.YourTable
GROUP BY
user_id,
data1)
SELECT
user_id,
data1
FROM cte WHERE rn = 1
If you use proper "order by" then distinct on (user_id) make the same work because it takes 1.line from data partitioned by "user_id". DISTINCT ON is specialty of PostgreSQL.
select distinct on (user_id) user_id, most_frequent_value from (
SELECT user_id, data1 AS most_frequent_value, count(*) as _count
FROM my_table
GROUP BY user_id, data1) a
ORDER BY user_id, _count DESC
With postgres 9.4 or greater it is possible. You can use it like:
SELECT
user_id, MODE() WITHIN GROUP (ORDER BY value)
FROM
(VALUES (0,6), (0,6), (0, 6), (0,1),(0,1), (1,5), (1,5), (1,3), (1,3), (1,7))
users (user_id, value)
GROUP BY user_id