SQL: difference between PARTITION BY and GROUP BY - sql

I've been using GROUP BY for all types of aggregate queries over the years. Recently, I've been reverse-engineering some code that uses PARTITION BY to perform aggregations.
In reading through all the documentation I can find about PARTITION BY, it sounds a lot like GROUP BY, maybe with a little extra functionality added in.
Are they two versions of the same general functionality or are they something different entirely?

They're used in different places. GROUP BY modifies the entire query, like:
select customerId, count(*) as orderCount
from Orders
group by customerId
But PARTITION BY just works on a window function, like ROW_NUMBER():
select row_number() over (partition by customerId order by orderId)
as OrderNumberForThisCustomer
from Orders
GROUP BY normally reduces the number of rows returned by rolling
them up and calculating averages or sums for each row.
PARTITION BY does not affect the number of rows returned, but it
changes how a window function's result is calculated.

We can take a simple example.
Consider a table named TableA with the following values:
id firstname lastname Mark
-------------------------------------------------------------------
1 arun prasanth 40
2 ann antony 45
3 sruthy abc 41
6 new abc 47
1 arun prasanth 45
1 arun prasanth 49
2 ann antony 49
GROUP BY
The SQL GROUP BY clause can be used in a SELECT statement to collect
data across multiple records and group the results by one or more
columns.
In more simple words GROUP BY statement is used in conjunction with
the aggregate functions to group the result-set by one or more
columns.
Syntax:
SELECT expression1, expression2, ... expression_n,
aggregate_function (aggregate_expression)
FROM tables
WHERE conditions
GROUP BY expression1, expression2, ... expression_n;
We can apply GROUP BY in our table:
select SUM(Mark)marksum,firstname from TableA
group by id,firstName
Results:
marksum firstname
----------------
94 ann
134 arun
47 new
41 sruthy
In our real table we have 7 rows and when we apply GROUP BY id, the server group the results based on id:
In simple words:
here GROUP BY normally reduces the number of rows returned by rolling
them up and calculating Sum() for each row.
PARTITION BY
Before going to PARTITION BY, let us look at the OVER clause:
According to the MSDN definition:
OVER clause defines a window or user-specified set of rows within a
query result set. A window function then computes a value for each row
in the window. You can use the OVER clause with functions to compute
aggregated values such as moving averages, cumulative aggregates,
running totals, or a top N per group results.
PARTITION BY will not reduce the number of rows returned.
We can apply PARTITION BY in our example table:
SELECT SUM(Mark) OVER (PARTITION BY id) AS marksum, firstname FROM TableA
Result:
marksum firstname
-------------------
134 arun
134 arun
134 arun
94 ann
94 ann
41 sruthy
47 new
Look at the results - it will partition the rows and returns all rows, unlike GROUP BY.

partition by doesn't actually roll up the data. It allows you to reset something on a per group basis. For example, you can get an ordinal column within a group by partitioning on the grouping field and using rownum() over the rows within that group. This gives you something that behaves a bit like an identity column that resets at the beginning of each group.

PARTITION BY
Divides the result set into partitions. The window function is applied to each partition separately and computation restarts for each partition.
Found at this link: OVER Clause

PARTITION BY is analytic, while GROUP BY is aggregate. In order to use PARTITION BY, you have to contain it with an OVER clause.

It provides rolled-up data without rolling up
i.e. Suppose I want to return the relative position of sales region
Using PARTITION BY, I can return the sales amount for a given region and the MAX amount across all sales regions in the same row.
This does mean you will have repeating data, but it may suit the end consumer in the sense that data has been aggregated but no data has been lost - as would be the case with GROUP BY.

As of my understanding Partition By is almost identical to Group By, but with the following differences:
That group by actually groups the result set returning one row per group, which results therefore in SQL Server only allowing in the SELECT list aggregate functions or columns that are part of the group by clause (in which case SQL Server can guarantee that there are unique results for each group).
Consider for example MySQL which allows to have in the SELECT list columns that are not defined in the Group By clause, in which case one row is still being returned per group, however if the column doesn't have unique results then there is no guarantee what will be the output!
But with Partition By, although the results of the function are identical to the results of an aggregate function with Group By, still you are getting the normal result set, which means that one is getting one row per underlying row, and not one row per group, and because of this one can have columns that are not unique per group in the SELECT list.
So as a summary, Group By would be best when needs an output of one row per group, and Partition By would be best when one needs all the rows but still wants the aggregate function based on a group.
Of course there might also be performance issues, see http://social.msdn.microsoft.com/Forums/ms-MY/transactsql/thread/0b20c2b5-1607-40bc-b7a7-0c60a2a55fba.

PARTITION BY semantics
Your question was specifically about SQL Server, which currently only supports a PARTITION BY clause only in window functions, but as I've explained in this blog post about the various meanings of PARTITION BY in SQL, there are also others, including:
Window partitions (window functions are a SQL standard)
Table partitions (vendor specific extensions to organise storage, e.g. in Oracle or PostgreSQL)
MATCH_REGOGNIZE partitions (which is also a SQL standard)
MODEL or SPREADSHEET partitions (an Oracle extension to SQL)
OUTER JOIN partitions (a SQL standard)
Apart from the last one, which re-uses the PARTITION BY syntax to implement some sort of CROSS JOIN logic, all of these PARTITION BY clauses have the same meaning:
A partition separates a data set into subsets, which don’t overlap.
Based on this partitioning, further calculations or storage operations per partition can be implemented. E.g. with window functions, such as COUNT(*) OVER (PARTITION BY criteria), the COUNT(*) value is calculated per partition.
GROUP BY semantics
GROUP BY allows for similar partitioning behaviour, although it also transforms the semantics of your entire query in various weird ways. Most queries using GROUP BY can be rewritten using window functions, instead, although often, the GROUP BY syntax is more concise and possibly also better optimised.
For example, these are the logically the same, but I would expect the GROUP BY clause to perform better:
-- Classic
SELECT a, COUNT(*)
FROM t
GROUP BY a
-- Using window functions
SELECT DISTINCT a, COUNT(*) OVER (PARTITION BY a)
FROM t
The key difference is:
Window functions can also be non-aggregate functions, e.g. ROW_NUMBER()
Each window function can have its own PARTITION BY clause, whereas GROUP BY can only group by one set of expressions per query.

When you use GROUP BY, the resulting rows will be usually less then incoming rows.
But, when you use PARTITION BY, the resulting row count should be the same as incoming.

Small observation. Automation mechanism to dynamically generate SQL using the 'partition by' it is much simpler to implement in relation to the 'group by'. In the case of 'group by', We must take care of the content of 'select' column.
Sorry for My English.

Suppose we have 14 records of name column in table
in group by
select name,count(*) as totalcount from person where name='Please fill out' group BY name;
it will give count in single row i.e 14
but in partition by
select row_number() over (partition by name) as total from person where name = 'Please fill out';
it will 14 rows of increase in count

It has really different usage scenarios.
When you use GROUP BY you merge some of the records for the columns that are same and you have an aggregation of the result set.
However when you use PARTITION BY your result set is same but you just have an aggregation over the window functions and you don't merge the records, you will still have the same count of records.
Here is a rally helpful article explaining the difference:
http://alevryustemov.com/sql/sql-partition-by/

-- BELOW IS A SAMPLE WHICH OUTLINES THE SIMPLE DIFFERENCES
-- READ IT AND THEN EXECUTE IT
-- THERE ARE THREE ROWS OF EACH COLOR INSERTED INTO THE TABLE
-- CREATE A database called testDB
-- use testDB
USE [TestDB]
GO
-- create Paints table
CREATE TABLE [dbo].[Paints](
[Color] [varchar](50) NULL,
[glossLevel] [varchar](50) NULL
) ON [PRIMARY]
GO
-- Populate Table
insert into paints (color, glossLevel)
select 'red', 'eggshell'
union
select 'red', 'glossy'
union
select 'red', 'flat'
union
select 'blue', 'eggshell'
union
select 'blue', 'glossy'
union
select 'blue', 'flat'
union
select 'orange', 'glossy'
union
select 'orange', 'flat'
union
select 'orange', 'eggshell'
union
select 'green', 'eggshell'
union
select 'green', 'glossy'
union
select 'green', 'flat'
union
select 'black', 'eggshell'
union
select 'black', 'glossy'
union
select 'black', 'flat'
union
select 'purple', 'eggshell'
union
select 'purple', 'glossy'
union
select 'purple', 'flat'
union
select 'salmon', 'eggshell'
union
select 'salmon', 'glossy'
union
select 'salmon', 'flat'
/* COMPARE 'GROUP BY' color to 'OVER (PARTITION BY Color)' */
-- GROUP BY Color
-- row quantity defined by group by
-- aggregate (count(*)) defined by group by
select count(*) from paints
group by color
-- OVER (PARTITION BY... Color
-- row quantity defined by main query
-- aggregate defined by OVER-PARTITION BY
select color
, glossLevel
, count(*) OVER (Partition by color)
from paints
/* COMPARE 'GROUP BY' color, glossLevel to 'OVER (PARTITION BY Color, GlossLevel)' */
-- GROUP BY Color, GlossLevel
-- row quantity defined by GROUP BY
-- aggregate (count(*)) defined by GROUP BY
select count(*) from paints
group by color, glossLevel
-- Partition by Color, GlossLevel
-- row quantity defined by main query
-- aggregate (count(*)) defined by OVER-PARTITION BY
select color
, glossLevel
, count(*) OVER (Partition by color, glossLevel)
from paints

Related

Scalar subquery producing more than 1 record with partition - SQL

I have data with two rows as follows:
group_id item_no
weoifne 1
weoifne 2
I want to retrieve the max item_no for each group_id. I'm using this query:
SELECT MAX(item_no)
OVER (PARTITION BY group_id)
FROM my_table;
I need only one record because I'm embedding this query in a CASE WHEN statement to apply logic based on whether or not item_no is the highest value per group.
Desired Output:
2
Actual Output:
2
2
How do I modify my query to only output one record with the maximum item_no per group_id?
Use an aggregate function along with GROUP BY instead of an window function.
A window function, also known as an analytic function, computes values over a group of rows and returns a single result for each row. This is different from an aggregate function, which returns a single result for a group of rows.
https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls
SELECT group_id, MAX(item_no)
FROM my_table
GROUP BY group_id;
If you still want to use the window function, you can use DISTINCT in your script to get rid of the duplicates as shown below. DISTINCT works across all the columns
SELECT DISTINCT group_id
, MAX(item_no) OVER (PARTITION BY group_id)
FROM my_table

how to find maximum of sum of number using if else in procedure in sap hana sql

I want to list out the product which has highest sales amount on date wise.
note: highest sales amount in the sense max(sum(sales_amnt)...
by using if or case In the procedure in sap hana SQL....
I did this by using with the clause :
/--------------------------CORRECT ONE ----------------------------------------------/
WITH ranked AS
(
SELECT Dense_RAnk() OVER (ORDER BY SUM("SALES_AMNT"), "SALES_DATE", "PROD_NAME") as rank,
SUM("SALES_AMNT") AS Amount, "PROD_NAME",count(*), "SALES_DATE" FROM "KABIL"."DATE"
GROUP BY "SALES_DATE", "PROD_NAME"
)
SELECT "SALES_DATE", "PROD_NAME",Amount
FROM ranked
WHERE rank IN ( select MAX(rank) from ranked group by "SALES_DATE")
ORDER BY "SALES_DATE" DESC;
this is my table
You can not use IF along with SELECT statement. Note that, you can achieve most of boolean logics with CASE statement syntax
In select, you are applying it over a column and your logic will be executed as many as times the count of result set rows. Hence , righting an imperative logic is not well appreciated. Still, if you want to do the same, create a calculation view and use intermediate calculated columns to achieve what you are expecting .
try this... i got an answer ...
select "SALES_DATE","PROD_NAME",sum("SALES_AMNT")
from "KABIL"."DATE"
group by "SALES_DATE","PROD_NAME"
having (SUM("SALES_AMNT"),"SALES_DATE") IN (select
MAX(SUM_SALES),"SALES_DATE"
from (select SUM("SALES_AMNT")
as
SUM_SALES,"SALES_DATE","PROD_NAME"
from "KABIL"."DATE"
group by "SALES_DATE","PROD_NAME"
)
group by "SALES_DATE");

Group by or Distinct - But several fields

How can I use a Distinct or Group by statement on 1 field with a SELECT of All or at least several ones?
Example: Using SQL SERVER!
SELECT id_product,
description_fr,
DiffMAtrice,
id_mark,
id_type,
NbDiffMatrice,
nom_fr,
nouveaute
From C_Product_Tempo
And I want Distinct or Group By nom_fr
JUST GOT THE ANSWER:
select id_product, description_fr, DiffMAtrice, id_mark, id_type, NbDiffMatrice, nom_fr, nouveaute
from (
SELECT rn = row_number() over (partition by [nom_fr] order by id_mark)
, id_product, description_fr, DiffMAtrice, id_mark, id_type, NbDiffMatrice, nom_fr, nouveaute
From C_Product_Tempo
) d
where rn = 1
And this works prfectly!
If I'm understanding you correctly, you just want the first row per nom_fr. If so, you can simply use a subquery to get the lowest id_product per nom_fr, and just get the corresponding rows;
SELECT * FROM C_Product_Tempo WHERE id_product IN (
SELECT MIN(id_product) FROM C_Product_Tempo GROUP BY nom_fr
);
An SQLfiddle to test with.
You need to decide what to do with the other fields. For example, for numeric fields, do you want a sum? Average? Max? Min? For non-numeric fields to you want the values from a particular record if there are more than one with the same nom_fr?
Some SQL Systems allow you to get a "random" record when you do a GROUP BY, but SQL Server will not - you must define the proper aggregation for columns that are not in the GROUP BY.
GROUP BY is used to group in conjunction with an aggregate function (see http://www.w3schools.com/sql/sql_groupby.asp), so it's no use grouping without counting, summing up etc. DISTINCT eleminates duplicates but how that matches with the other columns you want to extract, I can't imagine, because some rows will be removed from the result.

Sybase: HAVING operates on rows?

I've came across the following SYBASE SQL:
-- Setup first
create table #t (id int, ts int)
go
insert into #t values (1, 2)
insert into #t values (1, 10)
insert into #t values (1, 20)
insert into #t values (1, 30)
insert into #t values (2, 5)
insert into #t values (2, 13)
insert into #t values (2, 25)
go
declare #time int select #time=11
-- This is the SQL I am asking about
select * from (select * from #t where ts <= #time) t group by id having ts = max(ts)
go
The results of this SQL are
id ts
----------- -----------
1 10
2 5
This looks like HAVING condition applied to rows rather than groups. Can someone please point me at a place is Sybase 15.5 documentation where this case is described? All I see is "HAVING operates on groups". The closest I see in the docs is:
The having clause can include columns or expressions that are not in
the select list and not in the group by clause.
(Quote from here).
However, they don't exactly explain what happens when you do that.
My understanding: Yes, fundamentally, HAVING operates on rows. By omitting a GROUP BY, it operates on all result rows within a single "supergroup" rather than on rows-within-groups. Read the section "How group by and having queries with aggregates work" in your originally-linked Sybase docco:-
How group by and having queries with aggregates work
The where clause excludes rows that do not meet its search conditions; its function remains the same for grouped or nongrouped queries.
The group by clause collects the remaining rows into one group for each unique value in the group by expression. Omitting group by creates a single group for the whole table.
Aggregate functions specified in the select list calculate summary values for each group. For scalar aggregates, there is only one value for the table. Vector aggregates calculate values for the distinct groups.
The having clause excludes groups from the results that do not meet its search conditions. Even though the having clause tests only rows, the presence or absence of a group by clause may make it appear to be operating on groups:
When the query includes group by, having excludes result group rows. This is why having seems to operate on groups.
When the query has no group by, having excludes result rows from the (single-group) table. This is why having seems to operate on rows (the results are similar to where clause results).
Secondly, a brief summary appears in the section "How the having, group by, and where clauses interact":-
How the having, group by, and where clauses interact
When you include the having, group by, and where clauses in a query, the sequence in which each clause affects the rows determines the final results:
The where clause excludes rows that do not meet its search conditions.
The group by clause collects the remaining rows into one group for each unique value in the group by expression.
Aggregate functions specified in the select list calculate summary values for each group.
The having clause excludes rows from the final results that do not meet its search conditions.
#SQLGuru's explanation is an illustration of this.
Edit...
On a related point, I was surprised by the behaviour of non-ANSI-conforming queries that utilise TSQL "extended columns". Sybase handles the extended columns (i) after the WHERE clause (ii) by creating extra joins to the original tables and (iii) the WHERE clause is not used in the join. Such queries might return more rows than expected and the HAVING clause then requires additional conditions to filter these out.
See examples b, c and d under "Transact-SQL extensions to group by and having" on the page of your originally-linked docco. I found it useful to install the pubs2 sample database from Sybase to play along with the examples.
I haven't done Sybase since it shared code with MS SQL Server....90's, but my interpretation of what you are doing is this:
First, the list is filtered to <= 11
id ts
1 2
1 10
2 5
Everything else is filtered out.
Next, you are filtering the list to the rows where TS = the Max(TS) for that group.
id ts
1 10
2 5
10 is the Max(TS) for group 1 and 5 is the Max(TS) for group 2. Those two rows are the ones that remain. What result would you expect otherwise?
If you read the documentation here, it seems that Sybase use of columns in the having clause that don't appear in the group by clause is different from MySQL.
The example they give has this explanation:
The Transact-SQL extended column, price (in the select list, but not
an aggregate and not in the group by clause), causes all qualified
rows to display in each qualified group, even though a standard group
by clause produces a single row per group. The group by still affects
the vector aggregate, which computes the average price per group
displayed on each row of each group (they are the same values that
were computed for example a):
So, ts = max(ts) essentially does this:
select *
from (select t.*,
max(ts) over (partition by id) as maxts
from #t
where ts <= #time
) t
where ts = maxts
The subquery is important, because the where clause gets used for the max() calculation and all rows would be returned.
I find this behavior rather confusing and non-standard. I would replace it with more typical constructs. These are about the same level of complexity and seem clearer to a larger audience.

GROUP BY combined with ORDER BY

The GROUP BY clause groups the rows, but it does not necessarily sort the results in any particular order. To change the order, use the ORDER BY clause, which follows the GROUP BY clause. The columns used in the ORDER BY clause must appear in the SELECT list, which is unlike the normal use of ORDER BY. [Oracle by Example, fourth Edition, page 274]
Why is that? Why does using GROUP BY influence the required columns in the SELECT clause?
Also, in the case where I do not use GROUP BY: Why would I want to ORDER BY some columns but then select only a subset of the columns?
Actually the statement is not entirely true as Dave Costa's example shows.
The Oracle documentation says that an expression can be used but the expression must be based on the columns in the selection list.
expr - expr orders rows based on their value for expr. The expression is based on
columns in the select list or columns in the tables, views, or materialized views in the
FROM clause. Source: Oracle® Database
SQL Language Reference
11g Release 2 (11.2)
E26088-01
September 2011. Page 19-33
From the the same work page 19-13 and 19-33 (Page 1355 and 1365 in the PDF)
http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#SQLRF01702
http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#i2171079
The bold text from your quote is incorrect (it's probably an oversimplification that is true in many common use cases, but it is not strictly true as a requirement). For instance, this statement executes just fine, although AVG(val) is not in the select list:
WITH DATA AS (SELECT mod(LEVEL,3) grp, LEVEL val FROM dual CONNECT BY LEVEL < 100)
SELECT grp,MIN(val),MAX(val)
FROM DATA
GROUP BY grp
ORDER BY AVG(val)
The expressions in the ORDER BY clause simply have to be possible to evaluate in the context of the GROUP BY. For instance, ORDER BY val would not work in the above example, because the expression val does not have a distinct value for each row produced by the grouping.
As to your second question, you may care about the ordering but not about the value of the ordering expression. Excluding unneeded expressions from the select lists reduces the amount of data that must actually be sent from the server to the client.
First:
The implementation of group by is one which creates a new resultset that differs in structure to the original from clause (table view or some joined tables). That resultset is defined by what is selected.
Not every SQL RDBMS has this restriction, though it is a always requirement that what is ordered by be either an aggregate function of the non-grouped columns (AVG, SUM, etc) or one of the columns grouped by, or functions upon more than one of those results (like adding two columns), because this is a logical requirement of the result of the grouping operation.
Second:
Because you only care about that column for the ordering. For example, you might have a list of the top selling singles without giving their sales (the NYT Bestsellers keeps some details of their data a secret, but do have a ranked list). Of course, you can get around this by just selecting that column and then not using it.
The data is aggregated before it is sorted for the ORDER BY.
If you try to order by any other column (that is not in the group by list or an aggregation function), what value would be used? There is no single value to use for ordering.
I believe that you can use combinations of the values for sorting. So you can say:
order by a+b
If a and b are in the group by. You just cannot introduce columns not mentioned in the SELECT. I believe you can use aggregation functions not mentioned in the SELECT, however.
Sample table
sample.grades
Name Grade Score
Adam A 95
Bob A 97
Charlie C 75
First Query using GROUP BY
Select grade, count(Grade) from sample.grades GROUP BY Grade
Output
Grade Count
A 2
C 1
Second Query using order by
select Name, score from sample grades order by score
Output
Bob A 97
Adam A 95
Charlie C 75
Third Query using GROUP BY and ordering
Select grade, count(Grade) from sample.grades GROUP BY Grade desc
Output
Grade Count
A 2
C 1
Once you start using things like Count, you must have group by. You can use them together, but they have very different uses, as I hope the examples clearly show.
To try and answer the question, why does group by effect the items in the select section, because that is what group by is meant to do. You can't do the count of a column if you do not group by that column.
Second question, why would you want to order by but not select all the columns?
If I want to order by the score, but do not care about the actual grade or even the score I might do
select name from sample.grades order by score
Output
Name
Bob
Adam
Charlie
Which results do you expect to see ordering by columns not listed in the select list and not participated in group by clause? at any case all kind of sort by non-mentioned in SELECT list columns will be omitted so Oracle guys added the restriction correctly.
with c as (
select 1 id, 2 value from dual
union all
select 1 id, 3 value from dual
union all
select 2 id, 3 value from dual
)
select id
from c
group by id
order by count(*) desc
Here my inderstanding
"The GROUP BY clause groups the rows, but it does not necessarily sort the results in any particular order."
-> you can use Group by without order by
"To change the order, use the ORDER BY clause, which follows the GROUP BY clause."
-> the rows are selected by defaut with primary key, and if you add order by you must add after group by
"The columns used in the ORDER BY clause must appear in the SELECT list, which is unlike the normal use of ORDER BY."