Count rows in partition with Order By - sql

I was trying to understand PARTITION BY in postgres by writing a few sample queries. I have a test table on which I run my query.
id integer | num integer
___________|_____________
1 | 4
2 | 4
3 | 5
4 | 6
When I run the following query, I get the output as I expected.
SELECT id, COUNT(id) OVER(PARTITION BY num) from test;
id | count
___________|_____________
1 | 2
2 | 2
3 | 1
4 | 1
But, when I add ORDER BY to the partition,
SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;
id | count
___________|_____________
1 | 1
2 | 2
3 | 1
4 | 1
My understanding is that COUNT is computed across all rows that fall into a partition. Here, I have partitioned the rows by num. The number of rows in the partition is the same, with or without an ORDER BY clause. Why is there a difference in the outputs?

When you add an order by to an aggregate used as a window function that aggregate turns into a "running count" (or whatever aggregate you use).
The count(*) will return the number of rows up until the "current one" based on the order specified.
The following query shows the different results for aggregates used with an order by. With sum() instead of count() it's a bit easier to see (in my opinion).
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
will result in:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
There are more examples in the Postgres manual

Your two expressions are:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
Why would you expect these to return the same values? The syntax is different for a reason.
The first returns the overall count for each num -- essentially joining back the aggregated value.
The second does a cumulative count. It does the COUNT() for each row of id, for all values up to that ids value.
Note that such cumulative counts would normally be implemented using RANK() (or related functions).
The cumulative count is subtly different from RANK(). The cumulative count implements:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK() is slightly different. The difference only matters when the ORDER BY keys have ties.

The "why" has already been explained by others. Sometimes you have an ordered window, and you have to do a count over the whole partition despite having an ORDER BY.
To do so, use an unbounded range with RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Related

Oracle Query that fetches one row per each _id

I have a table like :
ID | Val | Kind
----------------------
1 | a | 2
2 | b | 1
3 | c | 4
3 | c | 33
and I need to fetch one row per each id in Oracle SQL.
any ideas?
You can use row_number() to enumerate the rows. For an arbitrary row:
select t.*
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from t
) t
where seqnum = 1;
As I point out in a comment, though, this is unnecessary based on the data in your question. The ids are already unique.

How to use LAST_VALUE in PostgreSQL?

I have a little table to try to understand how the LAST_VALUE function works in PostgreSQL. It looks like this:
id | value
----+--------
0 | A
1 | B
2 | C
3 | D
4 | E
5 | [null]
6 | F
What I want to do is to use LAST_VALUE to fill the NULL value with the precedent non-NULL value, so the result should be this:
id | value
----+--------
0 | A
1 | B
2 | C
3 | D
4 | E
5 | E
6 | F
The query I tried to accomplish that is:
SELECT LAST_VALUE(value)
OVER (PARTITION BY id ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC)
FROM test;
From what I understand of the LAST_VALUE function, it takes all the rows before the current one as a window, sorts them following the ORDER By thing and then returns the last row of the window. With my ORDER BY, all the rows containing a NULL should be put on top of the window, so LAST_VALUE should return the last non NULL value. But it doesn't.
I am clearly missing something. Please help.
I'm not sure last_value will do what you want. It would be better to use lag:
select id,
coalesce(value, lag(value) OVER (order by id))
FROM test;
id | coalesce
----+----------
0 | A
1 | B
2 | C
3 | D
4 | E
5 | E
6 | F
(7 rows)
last_value will return the last value of the current frame. Since you partitioned by id, there's only ever one value in the current frame. lag will return the previous row (by default) in the frame, which seems to be exactly what you want.
To expand on this answer a bit, you can use row_number() to give you a good idea of the frame you are looking at. For your proposed solution, look at the row numbers for each row, when you partition by id:
SELECT id, row_number() OVER (PARTITION BY id ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC)
FROM test;
id | row_number
----+------------
0 | 1
1 | 1
2 | 1
3 | 1
4 | 1
5 | 1
6 | 1
(7 rows)
Each row is its own frame, so you won't be able to get anything values from other rows.
If we don't partition by id, but still use your ordering, you can see why this still won't work for last_value:
SELECT id, row_number() OVER (ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC, id)
FROM test;
id | row_number
----+------------
5 | 1
0 | 2
1 | 3
2 | 4
3 | 5
4 | 6
6 | 7
(7 rows)
In this case, the row that was NULL is first. By default, last_value will include rows up to the current row, which in this case is just the current row for id 5. You could include all rows in your frame:
SELECT id,
row_number() OVER (ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC,
id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
last_value(value) OVER (ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC, id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;
id | row_number | last_value
----+------------+------------
5 | 1 | F
0 | 2 | F
1 | 3 | F
2 | 4 | F
3 | 5 | F
4 | 6 | F
6 | 7 | F
(7 rows)
But now the last row is the end of the frame and it's clearly not what you want. If you're looking for the previous row, choose lag().
So, thanks to Jeremy's explanations and another post (PostgreSQL last_value ignore nulls) I finally figured it out:
SELECT id, value, first_value(value) OVER (partition by t.isnull) AS new_val
FROM(
SELECT id, value, SUM (CASE WHEN value IS NOT NULL THEN 1 END) OVER (ORDER BY id) AS isnull
FROM test) t;
This query returns the result I expected.
The trick here is to provide BETWEEN params, like this:
SELECT
id,
COALESCE(value, LAST_VALUE(value) OVER id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING))
FROM test;
The issue with your first attempt was -aside from partitioning- that ever since BETWEEN params weren't provided, it assumed these by default:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Even more confusing is that few window functions, like RANK, ROW_NUMBER, NTILE, etc. assume these by default:
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
But, your final solution is still more robust, since it handles contiguous null values. I just wanted to point out this default behavior since I've seen people going through this many times.

Group similar rows and count groups in PostgreSQL

I've got a table like this:
number | info | side
--------------------
1 | foo | a
2 | bar | a
3 | bar | a
4 | baz | a
5 | foo | a
6 | bar | b
7 | bar | b
8 | foo | a
9 | bar | a
10 | baz | a
I'd like to get how many times a bar group/package (e.g. rows 2,3 is a group, rows 6,7 is a group, row 9 is also a group) appears in the info column depending on side. I'm stuck because I don't really know what do google. Whenever I search for something like group rows or merge rows I always end up finding information about the group by feature.
However I think I need some kind of window function.
Here is what I'd like to achieve:
bar_a | bar_b
-------------
2 | 1
Use lag() to determine first rows of groups:
select
number, info, side,
lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
from my_table
order by 1;
number | info | side | start_of_group
--------+------+------+----------------
1 | foo | a | t
2 | bar | a | t
3 | bar | a | f
4 | baz | a | t
5 | foo | a | t
6 | bar | b | t
7 | bar | b | f
8 | foo | a | t
9 | bar | a | t
10 | baz | a | t
(10 rows)
Aggregate and filter the above result to get the desired output:
select concat(info, '_', side) as info_side, count(*)
from (
select
info, side,
lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
from my_table
) s
where info = 'bar' and start_of_group
group by 1
order by 1;
info_side | count
-----------+-------
bar_a | 2
bar_b | 1
(2 rows)
This is a "gaps-and-islands" problem, at its heart, if I understand correct. For this version, the difference of row numbers should work well.
select sum( (side = 'a')::int) as num_a,
sum( (side = 'b')::int) as num_b
from (select info, side, count(*) as cnt
from (select t.*,
row_number() over (order by number) as seqnum,
row_number() over (partition by info, side order by number) as seqnum_bs
from t
) t
where info = 'bar'
group by info, size, (seqnum - seqnum_bs)
) si;
You can make do with a single window function, which should be the fastest option:
SELECT side, count(*) AS count
FROM (
SELECT side, grp
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM tbl
WHERE info = 'bar'
) sub1
GROUP BY 1, 2
) sub2
GROUP BY 1
ORDER BY 1; -- optional
Or shorter, maybe not faster:
SELECT side, count(DISTINCT grp) AS count
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM tbl
WHERE info = 'bar'
) sub
GROUP BY 1
ORDER BY 1; -- optional
The "trick" is that adjacent rows forming a group (grp) have consecutive numbers. When subtracting the running count over the partition on side from the running count over all rows (number), members of a "group" get the same grp number.
If there are gaps in your serial column number, which is not the case in your demo but typically there are gaps (and you actually want to ignore such gaps?!), then use row_number() OVER (ORDER BY number) in a subquery instead of just number to close the gaps first:
SELECT side, count(DISTINCT grp) AS count
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM (SELECT info, side, row_number() OVER (ORDER BY number) AS number FROM tbl) tbl1
WHERE info = 'bar'
) sub2
GROUP BY 1
ORDER BY 1; -- optional
SQL Fiddle (with extended test case)
Related:
Select longest continuous sequence

sql aggregate data

this is not a specific dbms question, but a generic sql problem.
i have this dataset
userid | objecteid| count
--------------------------
1 | 1 | 12
1 | 2 | 15
1 | 3 | 6
2 | 4 | 30
2 | 1 | 1
2 | 5 | 9
with one query i need to find: for each user, the object with the maximum count
looking for a result like this:
userid | objecteid| count
--------------------------
1 | 2 | 15
2 | 4 | 30
because the object 2 has the max count for user 1 and the object 4 has the max count for user 2
This can easily be solved using window functions.
The following is standard ANSI SQL:
select userid, objecteid, "count"
from (
select userid, objecteid, "count",
max("count") over (partition by userid) as max_cnt
from the_table
) t
where "count" = max_cnt;
If there are two objects with the same count, both will be returned.
Alternatively this can also be done using row_number() instead:
select userid, objecteid, "count"
from (
select userid, objecteid, "count",
row_number() over (partition by userid order by "count" desc) as rn
from the_table
) t
where rn = 1;
Unlike the first query, this will only pick one row if a user has more than one object with the same count. If you want those duplicates returned, use dense_rank() instead of row_number()
SQLFiddle: http://sqlfiddle.com/#!15/f02a9/1
try this
Select * from tableName
where count in (
Select Max(count)
from tableName
group by userid
)

SQL RANK() versus ROW_NUMBER()

I'm confused about the differences between these. Running the following SQL gets me two idential result sets. Can someone please explain the differences?
SELECT ID, [Description], RANK() OVER(PARTITION BY StyleID ORDER BY ID) as 'Rank' FROM SubStyle
SELECT ID, [Description], ROW_NUMBER() OVER(PARTITION BY StyleID ORDER BY ID) as 'RowNumber' FROM SubStyle
You will only see the difference if you have ties within a partition for a particular ordering value.
RANK and DENSE_RANK are deterministic in this case, all rows with the same value for both the ordering and partitioning columns will end up with an equal result, whereas ROW_NUMBER will arbitrarily (non deterministically) assign an incrementing result to the tied rows.
Example: (All rows have the same StyleID so are in the same partition and within that partition the first 3 rows are tied when ordered by ID)
WITH T(StyleID, ID)
AS (SELECT 1,1 UNION ALL
SELECT 1,1 UNION ALL
SELECT 1,1 UNION ALL
SELECT 1,2)
SELECT *,
RANK() OVER(PARTITION BY StyleID ORDER BY ID) AS [RANK],
ROW_NUMBER() OVER(PARTITION BY StyleID ORDER BY ID) AS [ROW_NUMBER],
DENSE_RANK() OVER(PARTITION BY StyleID ORDER BY ID) AS [DENSE_RANK]
FROM T
Returns
StyleID ID RANK ROW_NUMBER DENSE_RANK
----------- -------- --------- --------------- ----------
1 1 1 1 1
1 1 1 2 1
1 1 1 3 1
1 2 4 4 2
You can see that for the three identical rows the ROW_NUMBER increments, the RANK value remains the same then it leaps to 4. DENSE_RANK also assigns the same rank to all three rows but then the next distinct value is assigned a value of 2.
ROW_NUMBER : Returns a unique number for each row starting with 1. For rows that have duplicate values,numbers are arbitarily assigned.
Rank : Assigns a unique number for each row starting with 1,except for rows that have duplicate values,in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.
This article covers an interesting relationship between ROW_NUMBER() and DENSE_RANK() (the RANK() function is not treated specifically). When you need a generated ROW_NUMBER() on a SELECT DISTINCT statement, the ROW_NUMBER() will produce distinct values before they are removed by the DISTINCT keyword. E.g. this query
SELECT DISTINCT
v,
ROW_NUMBER() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... might produce this result (DISTINCT has no effect):
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| a | 2 |
| a | 3 |
| b | 4 |
| c | 5 |
| c | 6 |
| d | 7 |
| e | 8 |
+---+------------+
Whereas this query:
SELECT DISTINCT
v,
DENSE_RANK() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... produces what you probably want in this case:
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| b | 2 |
| c | 3 |
| d | 4 |
| e | 5 |
+---+------------+
Note that the ORDER BY clause of the DENSE_RANK() function will need all other columns from the SELECT DISTINCT clause to work properly.
The reason for this is that logically, window functions are calculated before DISTINCT is applied.
All three functions in comparison
Using PostgreSQL / Sybase / SQL standard syntax (WINDOW clause):
SELECT
v,
ROW_NUMBER() OVER (window) row_number,
RANK() OVER (window) rank,
DENSE_RANK() OVER (window) dense_rank
FROM t
WINDOW window AS (ORDER BY v)
ORDER BY v
... you'll get:
+---+------------+------+------------+
| V | ROW_NUMBER | RANK | DENSE_RANK |
+---+------------+------+------------+
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 4 | 4 | 2 |
| c | 5 | 5 | 3 |
| c | 6 | 5 | 3 |
| d | 7 | 7 | 4 |
| e | 8 | 8 | 5 |
+---+------------+------+------------+
Simple query without partition clause:
select
sal,
RANK() over(order by sal desc) as Rank,
DENSE_RANK() over(order by sal desc) as DenseRank,
ROW_NUMBER() over(order by sal desc) as RowNumber
from employee
Output:
--------|-------|-----------|----------
sal |Rank |DenseRank |RowNumber
--------|-------|-----------|----------
5000 |1 |1 |1
3000 |2 |2 |2
3000 |2 |2 |3
2975 |4 |3 |4
2850 |5 |4 |5
--------|-------|-----------|----------
Quite a bit:
The rank of a row is one plus the number of ranks that come before the row in question.
Row_number is the distinct rank of rows, without any gap in the ranking.
http://www.bidn.com/blogs/marcoadf/bidn-blog/379/ranking-functions-row_number-vs-rank-vs-dense_rank-vs-ntile
Note, all these windowing functions return an integer-like value.
Often the database will choose a BIGINT datatype, and this take much more space than we need. And, we will rarely need a range from -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.
Cast the results as a BYTEINT, SMALLINT, or INTEGER.
These modern systems and hardware are so strong, so you may never see a meaningflul extra use of resources, but I think it's best-practice.
Look this example.
CREATE TABLE [dbo].#TestTable(
[id] [int] NOT NULL,
[create_date] [date] NOT NULL,
[info1] [varchar](50) NOT NULL,
[info2] [varchar](50) NOT NULL,
)
Insert some data
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (1, '1/1/09', 'Blue', 'Green')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (1, '1/2/09', 'Red', 'Yellow')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (1, '1/3/09', 'Orange', 'Purple')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (2, '1/1/09', 'Yellow', 'Blue')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (2, '1/5/09', 'Blue', 'Orange')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (3, '1/2/09', 'Green', 'Purple')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (3, '1/8/09', 'Red', 'Blue')
Repeat same Values for 1
INSERT INTO dbo.#TestTable (id, create_date, info1, info2) VALUES (1,
'1/1/09', 'Blue', 'Green')
Look All
SELECT * FROM #TestTable
Look your results
SELECT Id,
create_date,
info1,
info2,
ROW_NUMBER() OVER (PARTITION BY Id ORDER BY create_date DESC) AS RowId,
RANK() OVER(PARTITION BY Id ORDER BY create_date DESC) AS [RANK]
FROM #TestTable
Need to understand the different
I haven't done anything with rank, but I discovered this today with row_number().
select item, name, sold, row_number() over(partition by item order by sold) as row from table_name
This will result in some repeating row numbers since in my case each name holds all items. Each item will be ordered by how many were sold.
+--------+------+-----+----+
|glasses |store1| 30 | 1 |
|glasses |store2| 35 | 2 |
|glasses |store3| 40 | 3 |
|shoes |store2| 10 | 1 |
|shoes |store1| 20 | 2 |
|shoes |store3| 22 | 3 |
+--------+------+-----+----+
Also, pay attention to ORDER BY in PARTITION (Standard AdventureWorks db is used for example) when using RANK.
SELECT as1.SalesOrderID, as1.SalesOrderDetailID, RANK() OVER
(PARTITION BY as1.SalesOrderID ORDER BY as1.SalesOrderID ) ranknoequal
, RANK() OVER (PARTITION BY as1.SalesOrderID ORDER BY
as1.SalesOrderDetailId ) ranknodiff FROM Sales.SalesOrderDetail as1
WHERE SalesOrderId = 43659 ORDER BY SalesOrderDetailId;
Gives result:
SalesOrderID SalesOrderDetailID rank_same_as_partition rank_salesorderdetailid
43659 1 1 1
43659 2 1 2
43659 3 1 3
43659 4 1 4
43659 5 1 5
43659 6 1 6
43659 7 1 7
43659 8 1 8
43659 9 1 9
43659 10 1 10
43659 11 1 11
43659 12 1 12
But if change order by to (use OrderQty :
SELECT as1.SalesOrderID, as1.OrderQty, RANK() OVER (PARTITION BY
as1.SalesOrderID ORDER BY as1.SalesOrderID ) ranknoequal , RANK()
OVER (PARTITION BY as1.SalesOrderID ORDER BY as1.OrderQty ) rank_orderqty
FROM Sales.SalesOrderDetail as1 WHERE SalesOrderId = 43659 ORDER BY
OrderQty;
Gives:
SalesOrderID OrderQty rank_salesorderid rank_orderqty
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 2 1 7
43659 2 1 7
43659 3 1 9
43659 3 1 9
43659 4 1 11
43659 6 1 12
Notice how the Rank changes when we use OrderQty (rightmost column second table) in ORDER BY and how it changes when we use SalesOrderDetailID (rightmost column first table) in ORDER BY.