Group similar rows and count groups in PostgreSQL - sql

I've got a table like this:
number | info | side
--------------------
1 | foo | a
2 | bar | a
3 | bar | a
4 | baz | a
5 | foo | a
6 | bar | b
7 | bar | b
8 | foo | a
9 | bar | a
10 | baz | a
I'd like to get how many times a bar group/package (e.g. rows 2,3 is a group, rows 6,7 is a group, row 9 is also a group) appears in the info column depending on side. I'm stuck because I don't really know what do google. Whenever I search for something like group rows or merge rows I always end up finding information about the group by feature.
However I think I need some kind of window function.
Here is what I'd like to achieve:
bar_a | bar_b
-------------
2 | 1

Use lag() to determine first rows of groups:
select
number, info, side,
lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
from my_table
order by 1;
number | info | side | start_of_group
--------+------+------+----------------
1 | foo | a | t
2 | bar | a | t
3 | bar | a | f
4 | baz | a | t
5 | foo | a | t
6 | bar | b | t
7 | bar | b | f
8 | foo | a | t
9 | bar | a | t
10 | baz | a | t
(10 rows)
Aggregate and filter the above result to get the desired output:
select concat(info, '_', side) as info_side, count(*)
from (
select
info, side,
lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
from my_table
) s
where info = 'bar' and start_of_group
group by 1
order by 1;
info_side | count
-----------+-------
bar_a | 2
bar_b | 1
(2 rows)

This is a "gaps-and-islands" problem, at its heart, if I understand correct. For this version, the difference of row numbers should work well.
select sum( (side = 'a')::int) as num_a,
sum( (side = 'b')::int) as num_b
from (select info, side, count(*) as cnt
from (select t.*,
row_number() over (order by number) as seqnum,
row_number() over (partition by info, side order by number) as seqnum_bs
from t
) t
where info = 'bar'
group by info, size, (seqnum - seqnum_bs)
) si;

You can make do with a single window function, which should be the fastest option:
SELECT side, count(*) AS count
FROM (
SELECT side, grp
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM tbl
WHERE info = 'bar'
) sub1
GROUP BY 1, 2
) sub2
GROUP BY 1
ORDER BY 1; -- optional
Or shorter, maybe not faster:
SELECT side, count(DISTINCT grp) AS count
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM tbl
WHERE info = 'bar'
) sub
GROUP BY 1
ORDER BY 1; -- optional
The "trick" is that adjacent rows forming a group (grp) have consecutive numbers. When subtracting the running count over the partition on side from the running count over all rows (number), members of a "group" get the same grp number.
If there are gaps in your serial column number, which is not the case in your demo but typically there are gaps (and you actually want to ignore such gaps?!), then use row_number() OVER (ORDER BY number) in a subquery instead of just number to close the gaps first:
SELECT side, count(DISTINCT grp) AS count
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM (SELECT info, side, row_number() OVER (ORDER BY number) AS number FROM tbl) tbl1
WHERE info = 'bar'
) sub2
GROUP BY 1
ORDER BY 1; -- optional
SQL Fiddle (with extended test case)
Related:
Select longest continuous sequence

Related

Find the count of IDs that have the same value

I'd like to get a count of all of the Ids that have have the same value (Drops) as other Ids. For instance, the illustration below shows you that ID 1 and 3 have A drops so the query would count them. Similarly, ID 7 & 18 have B drops so that's another two IDs that the query would count totalling in 4 Ids that share the same values so that's what my query would return.
+------+-------+
| ID | Drops |
+------+-------+
| 1 | A |
| 2 | C |
| 3 | A |
| 7 | B |
| 18 | B |
+------+-------+
I've tried the several approaches but the following query was my last attempt.
With cte1 (Id1, D1) as
(
select Id, Drops
from Posts
),
cte2 (Id2, D2) as
(
select Id, Drops
from Posts
)
Select count(distinct c1.Id1) newcnt, c1.D1
from cte1 c1
left outer join cte2 c2 on c1.D1 = c2.D2
group by c1.D1
The result if written out in full would be a single value output but the records that the query should be choosing should look as follows:
+------+-------+
| ID | Drops |
+------+-------+
| 1 | A |
| 3 | A |
| 7 | B |
| 18 | B |
+------+-------+
Any advice would be great. Thanks
You can use a CTE to generate a list of Drops values that have more than one corresponding ID value, and then JOIN that to Posts to find all rows which have a Drops value that has more than one Post:
WITH CTE AS (
SELECT Drops
FROM Posts
GROUP BY Drops
HAVING COUNT(*) > 1
)
SELECT P.*
FROM Posts P
JOIN CTE ON P.Drops = CTE.Drops
Output:
ID Drops
1 A
3 A
7 B
18 B
If desired you can then count those posts in total (or grouped by Drops value):
WITH CTE AS (
SELECT Drops
FROM Posts
GROUP BY Drops
HAVING COUNT(*) > 1
)
SELECT COUNT(*) AS newcnt
FROM Posts P
JOIN CTE ON P.Drops = CTE.Drops
Output
newcnt
4
Demo on SQLFiddle
You may use dense_rank() to resolve your problem. if drops has the same ID then dense_rank() will provide the same rank.
Here is the demo.
with cte as
(
select
drops,
count(distinct rnk) as newCnt
from
( select
*,
dense_rank() over (partition by drops order by id) as rnk
from myTable
) t
group by
drops
having count(distinct rnk) > 1
)
select
sum(newCnt) as newCnt
from cte
Output:
|newcnt |
|------ |
| 4 |
First group the count of the ids for your drops and then sum the values greater than 1.
select sum(countdrops) as total from
(select drops , count(id) as countdrops from yourtable group by drops) as temp
where countdrops > 1;

Count rows in partition with Order By

I was trying to understand PARTITION BY in postgres by writing a few sample queries. I have a test table on which I run my query.
id integer | num integer
___________|_____________
1 | 4
2 | 4
3 | 5
4 | 6
When I run the following query, I get the output as I expected.
SELECT id, COUNT(id) OVER(PARTITION BY num) from test;
id | count
___________|_____________
1 | 2
2 | 2
3 | 1
4 | 1
But, when I add ORDER BY to the partition,
SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;
id | count
___________|_____________
1 | 1
2 | 2
3 | 1
4 | 1
My understanding is that COUNT is computed across all rows that fall into a partition. Here, I have partitioned the rows by num. The number of rows in the partition is the same, with or without an ORDER BY clause. Why is there a difference in the outputs?
When you add an order by to an aggregate used as a window function that aggregate turns into a "running count" (or whatever aggregate you use).
The count(*) will return the number of rows up until the "current one" based on the order specified.
The following query shows the different results for aggregates used with an order by. With sum() instead of count() it's a bit easier to see (in my opinion).
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
will result in:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
There are more examples in the Postgres manual
Your two expressions are:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
Why would you expect these to return the same values? The syntax is different for a reason.
The first returns the overall count for each num -- essentially joining back the aggregated value.
The second does a cumulative count. It does the COUNT() for each row of id, for all values up to that ids value.
Note that such cumulative counts would normally be implemented using RANK() (or related functions).
The cumulative count is subtly different from RANK(). The cumulative count implements:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK() is slightly different. The difference only matters when the ORDER BY keys have ties.
The "why" has already been explained by others. Sometimes you have an ordered window, and you have to do a count over the whole partition despite having an ORDER BY.
To do so, use an unbounded range with RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Get row which matched in each group

I am trying to make a sql query. I got some results from 2 tables below. Below results are good for me. Now I want those values which is present in each group. for example, A and B is present in each group(in each ID). so i want only A and B in result. and also i want make my query dynamic. Could anyone help?
| ID | Value |
|----|-------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 2 | A |
| 2 | B |
| 2 | C |
| 3 | A |
| 3 | B |
In the following query, I have placed your current query into a CTE for further use. We can try selecting those values for which every ID in your current result appears. This would imply that such values are associated with every ID.
WITH cte AS (
-- your current query
)
SELECT Value
FROM cte
GROUP BY Value
HAVING COUNT(DISTINCT ID) = (SELECT COUNT(DISTINCT ID) FROM cte);
Demo
The solution is simple - you can do this in two ways at least. Group by letters (Value), aggregate IDs with SUM or COUNT (distinct values in ID). Having that, choose those letters that have the value for SUM(ID) or COUNT(ID).
select Value from MyTable group by Value
having SUM(ID) = (SELECT SUM(DISTINCT ID) from MyTable)
select Value from MyTable group by Value
having COUNT(ID) = (SELECT COUNT(DISTINCT ID) from MyTable)
Use This
WITH CTE
AS
(
SELECT
Value,
Cnt = COUNT(DISTINCT ID)
FROM T1
GROUP BY Value
)
SELECT
Value
FROM CTE
WHERE Cnt = (SELECT COUNT(DISTINCT ID) FROM T1)

How to generate sequence like

+---+------------+
| V | output |
+---+------------+
| y | 1 |
| y | 2 |
| y | 3 |
| N | 0 |
| y | 1 |
| y | 2 |
| N | 0 |
| N | 1 |
+---+------------+
Let me assume that you have a column (say, id) that has the ordering information. Then, you want to identify groups of "Y"s and "N"s that appear together and then enumerate them.
You can do this using a difference of row numbers trick:
select t.v,
row_number() over (partition by v, seqnum_id - seqnum_vid order by id) as output
from (select t.*,
row_number() over (order by id) as seqnum_id,
row_number() over (partition v by order by id) as seqnum_vid
from t
) t;
Explaining how this works is usually tricky. I recommend that you run the subquery to see what the sequence numbers look like and why the difference is constant for the groups you want to identify.
Your sample output is some how a little bit complex,
I preferred to use a SQL recursive query for the solution of your problem
Of course I assume that the id column is starting from 1 and goes continuosly without any gap. In a more complex case, the row_number() function should be added besides id field and join should be setup on rownumbers
I hope it helps,
--create table bool(id int identity(1,1), bool char(1))
--insert into bool values ('Y'),('N'),('Y'),('Y'),('Y'),('N'),('Y'),('N'),('N'),('Y'),('Y'),('Y'),('Y'),('Y'),('N'),('Y'),('Y')
;with cte as (
select id, bool curr, bool pre, 1 output from bool where id = 1
union all
select
bool.id, bool.bool curr, cte.curr,
case when bool.bool = cte.curr then cte.output + 1 else case when bool.bool = 'Y' then 1 else 0 end end
from cte
inner join bool on bool.id = cte.id + 1
)
select * from cte
Output is as follows

Grouping SQL Results based on order

I have table with data something like this:
ID | RowNumber | Data
------------------------------
1 | 1 | Data
2 | 2 | Data
3 | 3 | Data
4 | 1 | Data
5 | 2 | Data
6 | 1 | Data
7 | 2 | Data
8 | 3 | Data
9 | 4 | Data
I want to group each set of RowNumbers So that my result is something like this:
ID | RowNumber | Group | Data
--------------------------------------
1 | 1 | a | Data
2 | 2 | a | Data
3 | 3 | a | Data
4 | 1 | b | Data
5 | 2 | b | Data
6 | 1 | c | Data
7 | 2 | c | Data
8 | 3 | c | Data
9 | 4 | c | Data
The only way I know where each group starts and stops is when the RowNumber starts over. How can I accomplish this? It also needs to be fairly efficient since the table I need to do this on has 52 Million Rows.
Additional Info
ID is truly sequential, but RowNumber may not be. I think RowNumber will always begin with 1 but for example the RowNumbers for group1 could be "1,1,2,2,3,4" and for group2 they could be "1,2,4,6", etc.
For the clarified requirements in the comments
The rownumbers for group1 could be "1,1,2,2,3,4" and for group2 they
could be "1,2,4,6" ... a higher number followed by a lower would be a
new group.
A SQL Server 2012 solution could be as follows.
Use LAG to access the previous row and set a flag to 1 if that row is the start of a new group or 0 otherwise.
Calculate a running sum of these flags to use as the grouping value.
Code
WITH T1 AS
(
SELECT *,
LAG(RowNumber) OVER (ORDER BY ID) AS PrevRowNumber
FROM YourTable
), T2 AS
(
SELECT *,
IIF(PrevRowNumber IS NULL OR PrevRowNumber > RowNumber, 1, 0) AS NewGroup
FROM T1
)
SELECT ID,
RowNumber,
Data,
SUM(NewGroup) OVER (ORDER BY ID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM T2
SQL Fiddle
Assuming ID is the clustered index the plan for this has one scan against YourTable and avoids any sort operations.
If the ids are truly sequential, you can do:
select t.*,
(id - rowNumber) as grp
from t
Also you can use recursive CTE
;WITH cte AS
(
SELECT ID, RowNumber, Data, 1 AS [Group]
FROM dbo.test1
WHERE ID = 1
UNION ALL
SELECT t.ID, t.RowNumber, t.Data,
CASE WHEN t.RowNumber != 1 THEN c.[Group] ELSE c.[Group] + 1 END
FROM dbo.test1 t JOIN cte c ON t.ID = c.ID + 1
)
SELECT *
FROM cte
Demo on SQLFiddle
How about:
select ID, RowNumber, Data, dense_rank() over (order by grp) as Grp
from (
select *, (select min(ID) from [Your Table] where ID > t.ID and RowNumber = 1) as grp
from [Your Table] t
) t
order by ID
This should work on SQL 2005. You could also use rank() instead if you don't care about consecutive numbers.