Why shouldn’t you use DISTINCT when you could use GROUP BY? - sql

According to tips from MySQL performance wiki:
Don't use DISTINCT when you have or could use GROUP BY.
Can somebody post example of queries where GROUP BY can be used instead of DISTINCT?

If you know that two columns from your result are always directly related then it's slower to do this:
SELECT DISTINCT CustomerId, CustomerName FROM (...)
than this:
SELECT CustomerId, CustomerName FROM (...) GROUP BY CustomerId
because in the second case it only has to compare the id, but in the first case it has to compare both fields. This is a MySQL specific trick. It won't work with other databases.

SELECT Code
FROM YourTable
GROUP BY Code
vs
SELECT DISTINCT Code
FROM YourTable

The basic rule : Put all the columns from the SELECT clause into the GROUP BY clause
so
SELECT DISTINCT a,b,c FROM D
becomes
SELECT a,b,c FROM D GROUP BY a,b,c

Example.
Relation customer(ssnum,name, zipcode, address) PK(ssnum). ssnum is social security number.
SQL:
Select DISTINCT ssnum from customer where zipcode=1234 group by name
This SQL statement returns unique records for those customer's that have zipcode 1234. At the end results are grouped by name.
Here DISTINCT is no not necessary. because you are selecting ssnum which is already unique because ssnun is primary key. two person can not have same ssnum.
In this case Select ssnum from customer where zipcode=1234 group by name will give better performance than "... DISTINCT.......".
DISTINCT is an expensive operation in a DBMS.

Related

Select query group by alias

I have this query:
SELECT * FROM TABLE1 WHERE AREA_CODE IN ('929', '718', '347', '646') GROUP BY AREA_CODE
Is it possible to get only one record row with name 'NEW_YORK_AREA' that includes all these four area codes? To be more clear, let's say you have 4 records in the table for each area code listed above but you want to get only one result(row) with alias 'NEW_YOUR_AREA'. I hope it is clear, let me know if you have any questions, I will edit the question. Thank you all and have a great day.
UPDATE: requirements have changed and it is no longer needed. Thank you all for your help! :)
DB2 supports listagg(). So:
SELECT 'NEW_YORK_AREA' as cityname,
LISTAGG(AREA_CODE, ',') WITHIN GROUP (ORDER BY AREA_CODE) as areacodes
FROM TABLE1
WHERE AREA_CODE IN ('212', '929', '718', '347', '646') ;
I helpfully added 212, the most famous NYC area code ;)
If you have duplicates, then you need to use a subquery to remove them before aggregating.
Logically, what you want to do is group everything into the same category. You could do this by explicitly grouping all rows by a single value:
select 'NEW_YORK_AREA',
--whatever functions you need to aggregate the data here.
count(var1),
max(var2)
from table1
where area_code in ('929', '718', '347', '646')
group by 1
However, if the only functions that refer to the data in the table are aggregate functions, DB2 lets you omit the group by, and it will automatically group everything into a single row. The following is equivalent to the above query:
select 'NEW_YORK_AREA',
count(var1),
max(var2)
from table1
where area_code in ('929', '718', '347', '646')
What about creating a AREA_CODE_GROUP table
AREA_GROUP,AREA_CODE
'NEW_YORK_AREA','929'
'NEW_YORK_AREA','718'
'NEW_YORK_AREA','347'
'NEW_YORK_AREA','646'
that you can join:
SELECT t.* FROM TABLE1 "t"
INNER JOIN AREA_CODE_GROUP "g"
ON t.AREA_CODE = g.AREA_CODE
WHERE AREA_GROUP = 'NEW_YORK_AREA'

SQL Query to select one field order by other field

I want to select a field in select statement and order by with another field but sql server doesn't allows this as it says order by item must appear in the select statement if select distinct is specified.
This is what I tried :
select DISTINCT format_type
from Labels_Add_Label
where external_group_id= 2826
order by group_sequence
What changes are required to do in this query?
Please provide the changed query
You can rewrite your query this way (equivalent to distinct):
SELECT format_type
FROM Labels_Add_Label
WHERE external_group_id= 2826
GROUP BY format_type;
and you can't use ORDER BY group_sequence here. There may be more than one row with same format_type but different group_sequence. SQL server doesn't know which one should be used for the ordering.
You can however use aggregate functions with a GROUP BY query:
SELECT format_type
FROM Labels_Add_Label
WHERE external_group_id= 2826
GROUP BY format_type;
ORDER BY MIN(group_sequence) ; -- or MAX(group_sequence)
I just took this question to test my knowledge and have come with the following solution (using CTE in MS SQL Server), please correct me if I'm wrong - Using the NORTHWIND Database (Employees Table) on MS SQL Server, I have written this query - This could be one other option that could be used, if there be a need to!
WITH CTE_Employees(FirstName, LastName, BirthDate)
AS
(
SELECT FirstName, LastName, BirthDate
FROM Employees
WHERE Region IS NOT NULL
)
SELECT FirstName FROM CTE_Employees ORDER BY BirthDate DESC
As mentioned above, there can be a same Employee FirstName but with a different LastName, hence SQL Server imposes a condition where we can't use DISTINCT in conjunction with ORDER BY...
Hope this helps!

Group by or Distinct - But several fields

How can I use a Distinct or Group by statement on 1 field with a SELECT of All or at least several ones?
Example: Using SQL SERVER!
SELECT id_product,
description_fr,
DiffMAtrice,
id_mark,
id_type,
NbDiffMatrice,
nom_fr,
nouveaute
From C_Product_Tempo
And I want Distinct or Group By nom_fr
JUST GOT THE ANSWER:
select id_product, description_fr, DiffMAtrice, id_mark, id_type, NbDiffMatrice, nom_fr, nouveaute
from (
SELECT rn = row_number() over (partition by [nom_fr] order by id_mark)
, id_product, description_fr, DiffMAtrice, id_mark, id_type, NbDiffMatrice, nom_fr, nouveaute
From C_Product_Tempo
) d
where rn = 1
And this works prfectly!
If I'm understanding you correctly, you just want the first row per nom_fr. If so, you can simply use a subquery to get the lowest id_product per nom_fr, and just get the corresponding rows;
SELECT * FROM C_Product_Tempo WHERE id_product IN (
SELECT MIN(id_product) FROM C_Product_Tempo GROUP BY nom_fr
);
An SQLfiddle to test with.
You need to decide what to do with the other fields. For example, for numeric fields, do you want a sum? Average? Max? Min? For non-numeric fields to you want the values from a particular record if there are more than one with the same nom_fr?
Some SQL Systems allow you to get a "random" record when you do a GROUP BY, but SQL Server will not - you must define the proper aggregation for columns that are not in the GROUP BY.
GROUP BY is used to group in conjunction with an aggregate function (see http://www.w3schools.com/sql/sql_groupby.asp), so it's no use grouping without counting, summing up etc. DISTINCT eleminates duplicates but how that matches with the other columns you want to extract, I can't imagine, because some rows will be removed from the result.

GROUP BY / aggregate function confusion in SQL

I need a bit of help straightening out something, I know it's a very easy easy question but it's something that is slightly confusing me in SQL.
This SQL query throws a 'not a GROUP BY expression' error in Oracle. I understand why, as I know that once I group by an attribute of a tuple, I can no longer access any other attribute.
SELECT *
FROM order_details
GROUP BY order_no
However this one does work
SELECT SUM(order_price)
FROM order_details
GROUP BY order_no
Just to concrete my understanding on this.... Assuming that there are multiple tuples in order_details for each order that is made, once I group the tuples according to order_no, I can still access the order_price attribute for each individual tuple in the group, but only using an aggregate function?
In other words, aggregate functions when used in the SELECT clause are able to drill down into the group to see the 'hidden' attributes, where simply using 'SELECT order_no' will throw an error?
In standard SQL (but not MySQL), when you use GROUP BY, you must list all the result columns that are not aggregates in the GROUP BY clause. So, if order_details has 6 columns, then you must list all 6 columns (by name - you can't use * in the GROUP BY or ORDER BY clauses) in the GROUP BY clause.
You can also do:
SELECT order_no, SUM(order_price)
FROM order_details
GROUP BY order_no;
That will work because all the non-aggregate columns are listed in the GROUP BY clause.
You could do something like:
SELECT order_no, order_price, MAX(order_item)
FROM order_details
GROUP BY order_no, order_price;
This query isn't really meaningful (or most probably isn't meaningful), but it will 'work'. It will list each separate order number and order price combination, and will give the maximum order item (number) associated with that price. If all the items in an order have distinct prices, you'll end up with groups of one row each. OTOH, if there are several items in the order at the same price (say £0.99 each), then it will group those together and return the maximum order item number at that price. (I'm assuming the table has a primary key on (order_no, order_item) where the first item in the order has order_item = 1, the second item is 2, etc.)
The order in which SQL is written is not the same order it is executed.
Normally, you would write SQL like this:
SELECT
FROM
JOIN
WHERE
GROUP BY
HAVING
ORDER BY
Under the hood, SQL is executed like this:
FROM
JOIN
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
Reason why you need to put all the non-aggregate columns in SELECT to the GROUP BY is the top-down behaviour in programming. You cannot call something you have not declared yet.
Read more: https://sqlbolt.com/lesson/select_queries_order_of_execution
SELECT *
FROM order_details
GROUP BY order_no
In the above query you are selecting all the columns because of that its throwing an error not group by something like..
to avoid that you have to mention all the columns whichever in select statement all columns must be in group by clause..
SELECT *
FROM order_details
GROUP BY order_no,order_details,etc
etc it means all the columns from order_details table.
To use group by clause you have to mention all the columns from select statement in to group by clause but not the column from aggregate function.
TO do this instead of group by you can use partition by clause you can use only one port to group as a partition by.
you can also make it as partition by 1
use Common table expression(CTE) to avoid this issue.
multiple CTes also come handy, pasting a case where I have used...maybe helpful
with ranked_cte1 as
( select r.mov_id,DENSE_RANK() over ( order by r.rev_stars desc )as rankked from ratings r ),
ranked_cte2 as ( select * from movie where mov_id=(select mov_id from ranked_cte1 where rankked=7 ) ) select * from ranked_cte2
select * from movie where mov_id=902

Can I group by something that isn't in the SELECT line?

Given a command in SQL;
SELECT ...
FROM ...
GROUP BY ...
Can I group by something that isn't in the SELECT line?
Yes.
This is often used in the superaggregate queries like this:
SELECT AVG(cnt)
FROM (
SELECT COUNT(*) AS cnt
FROM sales
GROUP BY
product
HAVING COUNT(*) > 10
) q
, which aggregate the aggregates.
Yes of course e.g.
select
count(*)
from
some_table_with_updated_column
group by
trunc(updated, 'MM.YYYY')
Yes you can do it, but if you do that you won't be able to tell which result is for which group.
As a result, you almost always want to return the columns you've grouped by in the select clause. But you don't have to.
Yes, you can. Example:
select count(1)
from sales
group by salesman_id
What you can't do, of course, if having something on your select clause (other than aggregate functions) that are not part of the group by clause.
Hmm, I think the question should have been in the other way round like,
Can I SELECT something that is not there in the GROUP BY?
It's alright to write a code like:
SELECT customerId, count(orderId) FROM orders
GROUP BY customerId, orderedOn
If you want to find out the number of orders done by a customer datewise.
But you cannot do it the other way round:
SELECT customerId, orderedOn count(orderId) FROM orders
GROUP BY customerId
You can issue an aggregate function on the column that is not there in the group by. But you cannot give it in the select line without the aggregate function. As it will not make much sense. Like for the above query. You group by just customerId for order counts and you want the date also to be printed in the output??!! You don't involve the date factor in the group for counting then will it mean something to have a date in it?
I don't know about other DBMS' but DB2/z, for one, does this just fine. It's not required to have the column in the select portion but, of course, it does have to extract the data from the table in order to aggregate so you're probably not saving any time by leaving it off. You should only select the columns that you need, aggregation of the data is a separate task from that.
I'm pretty certain the SQL standard allows this (although that's only based on the knowledge that the mainframe DB2 product follows it pretty closely).