SQL query to find all combinations of grouped values - sql

I am looking for a SQL query or a series of SQL queries.
Schema
I have a logging table with three columns: id, event_type, and timestamp
The IDs are arbitrary text, generated randomly at runtime and unknown to me
The event types are numbers from a finite collection of known event types
The timestamps are your typical int64 epoch timestamp
A single ID value may have 1 or more rows, each with some value for event_type. representing a flow of events associated with the same ID
For each ID, its collection of rows can be sorted by increasing timestamp
Most times, there will be only one occurrence of an ID + event type combination, but rarely, there could be two; not sure this matters
Goal
What I want to do is to query the number of distinct combinations of event types (sorted by timestamp). For example, provided this table:
id event_type timestamp
-----------------------------------------
foo event_1 101
foo event_2 102
bar event_2 102
bar event_1 101
foo event_3 103
bar event_3 103
blah event_1 101
bleh event_2 102
backwards event_1 103
backwards event_2 102
backwards event_3 101
Then I should get the following result:
combination count
-------------------------------
[event_1,event_2,event_3] 2 // foo and bar
[event_3,event_2,event_1] 1 // backwards
[event_1] 1 // blah
[event_2] 1 // bleh

You can do 2 levels of grouping to your data.
For Mysql use group_concat():
select t.combination, count(*) count
from (
select
group_concat(event_type order by timestamp) combination
from tablename
group by id
) t
group by t.combination
order by count desc
See the demo.
For Postgresql use array_agg() with array_to_string():
select t.combination, count(*) count
from (
select
array_to_string(array_agg(event_type order by timestamp), ',') combination
from tablename
group by id
) t
group by t.combination
order by count desc
See the demo.
For Oracle there is listagg():
select t.combination, count(*) count
from (
select
listagg(event_type, ',') within group (order by timestamp) combination
from tablename
group by id
) t
group by t.combination
order by count desc
See the demo.
For SQL Server 2017+ there is string_agg():
select t.combination, count(*) count
from (
select
string_agg(event_type, ',') within group (order by timestamp) combination
from tablename
group by id
) t
group by t.combination
order by count desc
See the demo.
Results:
| combination | count |
| ----------------------- | ----- |
| event_1,event_2,event_3 | 2 |
| event_3,event_2,event_1 | 1 |
| event_1 | 1 |
| event_2 | 1 |

SELECT
"combi"."combination",
COUNT(*) AS "count"
FROM
(
SELECT
GROUP_CONCAT("event_type" SEPARATOR ',') AS "combination"
FROM
?table?
GROUP BY
"id"
) AS "combi"
GROUP BY
"combi"."combination"
Note: GROUP_CONCAT(... SEPARATOR ...) syntax is not SQL standard, it's DB specific (in this case MySQL, other dbs have other aggregate functions). You might need to adjust for your DB of choice or specify in tags which DB you are actually using.
As for "sorted by timestamp", you need to define what this actually means. What is "sorted by timestamp" for a group of groups?

Related

Select arbitrary row for each group in Postgres

In Presto, there's an arbitrary() aggregate function to select any arbitrary row in a given group. If there's no group by clause, then I can use distinct on. With group by, every selected column must be in the group by or be an aggregated column. E.g.:
| id | foo |
| 1 | 123 |
| 1 | 321 |
select id, arbitrary(foo), count(*)
from mytable
group by id
Fiddle
It doesn't matter if it returns 1, 123, 2 or 1, 321, 2. Something like min() or max() works, but it's a lot slower.
Does something like arbitrary() exist in Postgres?
select m.foo,b.id,b.cnt from mytable m
join (select id, count(*) cnt
from mytable
group by id) b using (id) limit 1;
If not explicit mention asc, desc all the order is not guaranteed. Therefore in the above query the foo's appearance is arbitrary.

How to select the last record of each ID

I need to extract the last records of each user from the table. The table schema is like below.
mytable
product | user_id |
-------------------
A | 15 |
B | 15 |
A | 16 |
C | 16 |
-------------------
The output I want to get is
product | user_id |
-------------------
B | 15 |
C | 16 |
Basically the last records of each user.
Thanks in advance!
You can use a window function called ROW_NUMBER.Here is a solution for you given below. I have also made a demo query in db-fiddle for you. Please check link Demo Code in DB-Fiddle
WITH CTE AS
(SELECT product, user_id,
ROW_NUMBER() OVER(PARTITION BY user_id order by product desc)
as RN
FROM Mytable)
SELECT product, user_id FROM CTE WHERE RN=1 ;
You can try using row_number()
select product,iserid
from
(
select product, userid,row_number() over(partition by userid order by product desc) as rn
from tablename
)A where rn=1
There is no such thing as a "last" record unless you have a column that specifies the ordering. SQL tables represent unordered sets (well technically, multisets).
If you have such a column, then use distinct on:
select distinct on (user_id) t.*
from t
order by user_id, <ordering col> desc;
Distinct on is a very handy Postgres extension that returns one row per "group". It is the first row based on the ordering specified in the order by clause.
You should have a column that stores the insertion order. Whether through auto increment or a value with date and time.
Ex:
autoIncrement
produt
user_id
1
A
15
2
B
15
3
A
16
4
C
16
SELECT produt, user_id FROM table inner join
( SELECT MAX(autoIncrement) as id FROM table group by user_id ) as table_Aux
ON table.autoIncrement = table_Aux.id

how to get all the contiguous interval of all IDs in a relation

So my relation is simple: relation (ID, Date), which ID is not unique and not necessarily in any order. Each ID has a date (same ID can have the same date). My problem is to find the longest interval between a date and its NEXT date of all IDs.
So if the table is like this:
ID | Date
--------+------------
100 | 2015-06-20
100 | 2015-01-21
100 | 2016-04-23
the expected output will be
ID | interval
--------+------------
100 | (2016-04-23 - 2015-06-20)
or if all date the ID has are the same:
ID | Date
--------+------------
100 | 2016-04-23
100 | 2016-04-23
100 | 2016-04-23
the expected output should be
ID | interval
--------+------------
100 | 0
this is for a single ID, in my relation, there are 100 IDs are together
I think this query will be useful for you:
select t.id,
case
when t.lower != t.upper then '(' || t.lower || ' - ' || t.upper || ')'
else '0' end
from (select
r.id,
min(r.date) as lower,
max(r.date) as upper
from relation r
group by r.id) t;
We use a subquery to find lower and upper boundaries for each ID. After that we check lower and upper boundaries when they equals make formatted string else out zero.
I hope this is what you are looking for
WITH x AS
(
SELECT id, _date, lead_date, EXTRACT(epoch FROM age(lead_date,_date))/(3600*24) AS age
FROM
(
SELECT *, lead(_date) over(PARTITION BY id ORDER BY _date ) lead_date
from table_log
order by id, _date
) as z
WHERE lead_date IS NOT NULL
ORDER BY 4 DESC
)
SELECT DISTINCT id ,
(SELECT age FROM x WHERE x.id = t1.id ORDER BY age DESC LIMIT 1)
FROM table_log t1
Here i have used windows function to get the next date to determine the duration between 2 entries. with Postgres Recursive query you can re-use the original query with windows function.
I have used DISTINCT from the log table, but you can also directly use the table where you store the IDs.

How to select most frequent value in a column per each id group?

I have a table in SQL that looks like this:
user_id | data1
0 | 6
0 | 6
0 | 6
0 | 1
0 | 1
0 | 2
1 | 5
1 | 5
1 | 3
1 | 3
1 | 3
1 | 7
I want to write a query that returns two columns: a column for the user id, and a column for what the most frequently occurring value per id is. In my example, for user_id 0, the most frequent value is 6, and for user_id 1, the most frequent value is 3. I would want it to look like below:
user_id | most_frequent_value
0 | 6
1 | 3
I am using the query below to get the most frequent value, but it runs against the whole table and returns the most common value for the whole table instead of for each id. What would I need to add to my query to get it to return the most frequent value for each id? I am thinking I need to use a subquery, but am unsure of how to structure it.
SELECT user_id, data1 AS most_frequent_value
FROM my_table
GROUP BY user_id, data1
ORDER BY COUNT(*) DESC LIMIT 1
You can use a window function to rank the userids based on their count of data1.
WITH cte AS (
SELECT
user_id
, data1
, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY COUNT(data1) DESC) rn
FROM dbo.YourTable
GROUP BY
user_id,
data1)
SELECT
user_id,
data1
FROM cte WHERE rn = 1
If you use proper "order by" then distinct on (user_id) make the same work because it takes 1.line from data partitioned by "user_id". DISTINCT ON is specialty of PostgreSQL.
select distinct on (user_id) user_id, most_frequent_value from (
SELECT user_id, data1 AS most_frequent_value, count(*) as _count
FROM my_table
GROUP BY user_id, data1) a
ORDER BY user_id, _count DESC
With postgres 9.4 or greater it is possible. You can use it like:
SELECT
user_id, MODE() WITHIN GROUP (ORDER BY value)
FROM
(VALUES (0,6), (0,6), (0, 6), (0,1),(0,1), (1,5), (1,5), (1,3), (1,3), (1,7))
users (user_id, value)
GROUP BY user_id

Compare different orders of the same table

I have this following scenario, a table with these columns:
table_id|user_id|os_number|inclusion_date
In the system, the os_number is sequential for the users, but due to a system bug some users inserted OSs in wrong order. Something like this:
table_id | user_id | os_number | inclusion_date
-----------------------------------------------
1 | 1 | 1 | 2015-11-01
2 | 1 | 2 | 2015-11-02
3 | 1 | 3 | 2015-11-01
Note the os number 3 inserted before the os number 2
What I need:
Recover the table_id of the rows 2 and 3, which is out of order.
I have these two select that show me the table_id in two different orders:
select table_id from table order by user_id, os_number
select table_id from table order by user_id, inclusion_date
I can't figure out how can I compare these two selects and see which users are affected by this system bug.
Your question is a bit difficult because there is no correct ordering (as presented) -- because dates can have ties. So, use the rank() or dense_rank() function to compare the two values and return the ones that are not in the correct order:
select t.*
from (select t.*,
rank() over (partition by user_id order by inclusion_date) as seqnum_d,
rank() over (partition by user_id order by os_number) as seqnum_o
from t
) t
where seqnum_d <> seqnum_o;
Use row_number() over both orders:
select *
from (
select *,
row_number() over (order by os_number) rnn,
row_number() over (order by inclusion_date) rnd
from a_table
) s
where rnn <> rnd;
table_id | user_id | os_number | inclusion_date | rnn | rnd
----------+---------+-----------+----------------+-----+-----
3 | 1 | 3 | 2015-11-01 | 3 | 2
2 | 1 | 2 | 2015-11-02 | 2 | 3
(2 rows)
Not entirely sure about the performance on this but you could use a cross apply on the same table to get the results in one query. This will bring up the pairs of table_ids which are incorrect.
select
a.table_id as InsertedAfterTableId,
c.table_id as InsertedBeforeTableId
from table a
cross apply
(
select b.table_id
from table b
where b.inclusion_date < a.inclusion_date and b.os_number > a.os_number
) c
Both query examples given below simply check a mismatch between inclusion date and os_number:
This first query should return the offending row (the one whose os_number is off from its inclusion date)--in the case of the example row 3.
select table.table_id, table.user_id, table.os_number from table
where EXISTS(select * from table t
where t.user_id = table.user_id and
t.inclusion_date > table.inclusion_date and
t.os_number < table.os_number);
This second query will return the table numbers and users for two rows that are mismatched:
select first_table.table_id, second_table.table_id, first_table.user_id from
table first_table
JOIN table second_table
ON (first_table.user_id = second_table.user_id and
first_table.inclusion_date > second_table.inclusion_date and
first_table.os_number < second_table.os_number);
I would use WINDOW FUNCTIONS to get row numbers in orders in question and then compare them:
SELECT
sub.table_id,
sub.user_id,
sub.os_number,
sub.inclusion_date,
number_order_1, number_order_2
FROM (
SELECT
table_id,
user_id,
os_number,
inclusion_date,
row_number() OVER (PARTITION BY user_id
ORDER BY os_number
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS number_order_1,
row_number() OVER (PARTITION BY user_id
ORDER BY inclusion_date
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS number_order_2
FROM
table
) sub
WHERE
number_order_1 <> number_order_1
;
EDIT:
Because of a_horse_with_no_name made good point about my final answer. I've back to my first answer (look in edit history) which work also if os_number isn't gapless.
select *
from (
select a_table.*,
lag(inclusion_date) over (partition by user_id order by os_number) as last_date
from a_table
) result
where last_date is not null AND last_date>inclusion_date;
This should cover gaps as well as ties. Basically, I simply check the inclusion_date of the last os_number, and make sure it's not strictly greater than the current date (so 2 version on the same date is fine).