Aggregate functions on a single column only? - sql

My learning goal: is to find how to find an ingredient and see which recipe uses any given ingredient the most.
E.g.
+------------+--------------+--------+
| Pizza | Ingredient | Amount |
+------------+--------------+--------+
| Anchovy | Anchovy | 200 |
+------------+--------------+--------+
| Meatlovers | Pepparoni | 150 |
+------------+--------------+--------+
| X pizza | X ingredient | 50 |
+------------+--------------+--------+
Through:
(a) SELECT INGREDIENT,MAX(AMOUNT) FROM RECIPE GROUP BY INGREDIENT;
Works wonderfully, however I wish to know the pizza name of the recipe.
(b) SELECT NAME,INGREDIENT,MAX(AMOUNT) FROM RECIPE GROUP BY INGREDIENT,NAME;
Doesn't work as expected -I want the name to be appended to result set of (a). Although, what I get is all pizza, ingredient, max amounts. I'm assuming the max function is applying itself to the pizza column as well, which I do not want. Is there a way to specify an aggregate function to only be applied to two desired columns and leave one (only for viewing purposes).

PostgreSql supports window functions, so the easy way is this:
SELECT Pizza,
Ingredient,
MAX(Amount) OVER(PARTITION BY Ingredient) As MaxAmount
FROM Recipe
Reading the question again, following Damien's comment, I think that what you are asking will not get you the results you want.
In the beginning of the question, you wrote:
My learning goal: is to find how to find an ingredient and see which recipe uses any given ingredient the most. see which recipe uses any given ingredient the most.
Later you wrote:
I want the name to be appended to result set of (a)
These statements conflict.
To know which pizza is using the most of a specific ingredient, as you stated in your first statement, use the (b) query from your question. You can order the results of it by ingredient, following the MAX(AMOUNT) column in a descending order - this will enable you to see what pizza is using the most of each ingredient easily.
SELECT Name, Ingredient, MAX(Amount) AS MaxAmount
FROM Recipe
GROUP BY Ingredient,Name
ORDER BY Ingredient, MaxAmount DESC;
The query in my answer, however, will get you what you what you are asking in your second statement - get the maximum value for each ingredient, grouped only by ingredient, but adding the pizza name to the result set. (In other words - append the pizza name to the result set of (a))

A standard modern approach to this would be to use a window function to assign row numbers:
SELECT
*
FROM
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY Ingredient ORDER BY Amount DESC) as rn
FROM
Recipe) r
where
r.rn = 1
This will arbitrarily select one row as the top row if there are multiple rows with the same highest Amount for a particular ingredient. To take more control over the query to break ties, add another ORDER BY expression within the OVER clause. In the alternative, if you wish to see all tying rows, use RANK() instead of ROW_NUMBER().

use corelated sub-query
SELECT r.*
FROM RECIPE AS r
where r.AMOUNT =
( select MAX(AMOUNT)
FROM RECIPE r1 where
r1.INGREDIENT=r.INGREDIENT
GROUP BY r1.INGREDIENT
)

Related

Select top using SQL Server returns different output than select *

I tried to get select top n data from a database based on alphabetical & numbering format. The output must order by alphabet first and number after that.
When I try to get all data (select *), I get the correct output:
select nocust, share
from TB_STOCK
where share = ’BBCA’
and concat(share, nocust) < ‘ZZZZZZZZ’
order by
case when nocust like ‘[a-z]%’ then 0 else 1 end
nocust | share
-------+--------
a522 | BBCA
b454 | BBCA
k007 | BBCA
p430 | BBCA
q797 | BBCA
s441 | BBCA
s892 | BBCA
u648 | BBCA
v107 | BBCA
4211 | BBCA
6469 | BBCA
6751 | BBCA
But when I try to select top n (ex : top 5), I get different output than expected (not like select * from table) :
select top 5 nocust, share
from TB_STOCK
where share = ’BBCA’
and concat(share, nocust) < ‘ZZZZZZZZ’
order by
case when nocust like ‘[a-z]%’ then 0 else 1 end
nocust | share
-------+--------
k007 | BBCA
b454 | BBCA
a522 | BBCA
p430 | BBCA
q797 | BBCA
I expect the mistake is somewhere between the concat and order by, can someone tell me how to get the right top 5 output like :
nocust | share
-------+--------
a522 | BBCA
b454 | BBCA
k007 | BBCA
p430 | BBCA
q797 | BBCA
You have a very strange ORDER BY - it only makes sure entries with a letter at the beginning are ordered before those which have a number in the beginning - but you're NOT actually ordering by the values itself. No specific ORDER BY means: there's no guarantee as to how the rows will be ordered - as you're seeing here.
You need to adapt your ORDER BY to:
ORDER BY
CASE WHEN nocust LIKE '[a-z]%' THEN 1 ELSE 0 END,
nocust
NOW you're actually ordering by nocust - and now, I'm pretty sure, the outputs will be identical
Your ORDER BY is not a stable sort; it sorts data broadly into one of two categories but doesn't specify in enough detail how items are to then be sorted within the category. This means in the TOP 5 form sqlserver is free to choose a data access strategy that means it can easily stop after it has found 5 rows whose data is such that the case when returns 0
Suppose you have this output from SELECT * ... ORDER BY Category
Category, Thing
Animal, Cat
Animal, Dog
Animal, Goat
Vegetable, Potato
Vegetable, Turnip
Vegetable, Swede
There is absolutely no guarantee that if you do a SELECT TOP 2 * ... ORDER BY category that you will get "Cat, Dog" in that order. You could reasonably get "Goat, Dog" today and "Cat, Goat" tomorrow, when SQL server has shuffled its indexes around after new data was added. The only thing you can guarantee with a top 2 order by category is that, so long as there are at least two animals in the db, and there is no new category that is alphabetically earlier than "animal" you'll get two animals
Is it this way because an optimization of TOP N means that sqlserver can stop early once it has N rows that meet the criteria; it doesn't need to access and sort a million rows if it already found 5 rows that have a category that would be first in the sort. Let's imagine it can know the distinct values and the count of those values in the column as part of its statistics, it can sort those distinct values to know which ones will come first then go and find any 5 random rows that have a value that will sort first, and return them. Essentially sql server may think "I know I have 3 'animal', and animals come before everything else, and the user wants 2. I'll just start reading rows and stop after I get 2 animals" rather than "I'll read every Thing, sort all million of them on category, then take the first 2 rows"
This could be hugely faster than sorting a million rows then plucking the first X
To get repeatable results each time you have to make the sort stable by specify sort conditions that guarantee the Thing within the Category, will be sorted right the way down to where there is no ambiguity
Add more columns to your order by so that every row has a guaranteed place in the overall ordering and then your sort will be stable and TOP N will return the same rows each time. To make a sort stable the collection of columns you sort by has to have a unique combination of values. You could sort by 20 columns but if there are any rows where all 30 of those columns have identical values (and differentiation only occurs on the 21st value, which you don't order by) then the sort order isn't guaranteed
I am trying to answer this in different perspective.
First it should be clear that Optimizer make the best possible plan quickly.
Optimizer select index or do not select index in most cost effective manner.
I am using Adventure 2016 database and Production.Product has 504 rows.
select [Name],ProductNumber from Production.Product
order by [Name]
It sort the rows as expected.
select top 5 [Name],ProductNumber from Production.Product
order by [Name]
It sort the rows as expected.
If I use case statement in Order
select [Name],ProductNumber from Production.Product
order by case when [name] like '[a]%' then 1 else -1 end
It sort the record as intended. All 504 rows are process.
If I use less than equal to 20% of total rows in Top like
select Top 5 [Name],ProductNumber from Production.Product
order by case when [name] like '[a]%' then 1 else -1 end
Then it pick first n records and display n record quickly.
Sorting was not as expected.
If I use more 20% of total rows in Top like
select Top (101) [Name],ProductNumber from Production.Product
order by case when [name] like '[a]%' then 1 else -1 end
It will process all 504 rows and sort accordingly.
Sorting result is as expected.
In all above case Clustered Index Scan (Product id) is done.
In this example [Name]and ProductNumber are two different non clustered index.
But it was not selected.
You can do this,
;With CTE as(
select nocust, share ,
case when nocust like ‘[a-z]%’ then 0 else 1 end SortCol
from TB_STOCK
where share = ’BBCA’
and concat(share, nocust) < ‘ZZZZZZZZ’
)
select top 5 * from CTE
order by SortCol

Query to order data while maintaining grouping?

I have a request which I can accomplish in code but am wondering if it is at all possible do do on SQL alone. I have a products table that has a Category column and a Price column. What I want to achieve is all of the products grouped together by Category, and then ordered by the cheapest to most expensive in both the category and all the categories combined. So for example :
Category | Price
--------------|---------------------
Basin | 500
Basin | 700
Basin | 750
Accessories | 550
Accessories | 700
Accessories | 1000
Bath | 700
As you can see the cheapest item is a basin for 500, then an Accessory for 550 then a bath for 700. So I need the categories of products to be sorted by their cheapest item, and then each category itself in turn to be sorted cheapest to most expensive.
I have tried partitioning, grouping sets ( which i know nothing about ) but still no luck so eventually resorted to my strength ( C# ) but would prefer to do it straight in SQL if possible. One last side note : This query is hit quite often so performance is key so if possible i would like to avoid temp tables / cursors etc
I think using MIN() with a window (OVER) makes it clearest what the intent is:
declare #t table (Category varchar(19) not null,Price int not null)
insert into #t (Category,Price) values
('Basin',500),
('Basin',700),
('Basin',750),
('Accessories',550),
('Accessories',700),
('Accessories',1000),
('Bath',700)
;With FindLowest as (
select *,
MIN(Price) OVER (PARTITION BY Category) as Lowest
from
#t
)
select * from FindLowest
order by Lowest,Category,Price
If two categories share the same lowest price, this will still keep the two categories separate and sort them alphabetically.
Select...
Order by category, price desc
SELECT p.category,p.price
FROM products p,(select category,min(price) mn from products group by category order by mn) tab1
WHERE p.category=tab1.category
GROUP BY p.category,p.price,tab1.mn
order by tab1.mn,p.category;
Is this what you want?
I think, you do not need GROUP BY clause in your query. If I got your goal correctly, you can try by substituting actual categories in your ORDER BY clause with the minimum price per category inside the subquery.That will allow you getting all the categories sequential, i.e. not Basin - 500; Accessories - 550, but everything for Basin first. After that, you can group by ordinary Price inside each category.
SELECT *
FROM products p
ORDER BY
(SELECT MIN(Price) FROM products p2 WHERE p2.Category=p.Category
),
Price;

SQL count returning wrong values

Was asked the question in regards to a olympic database.
"For each competitor who was in more than one event, list their given-name, family-name, their number of events, and their best and worst places (and no other data).
List this information in in descending order of their number of events (i.e. most events first),
and (when competitors have the same number of events) then in ascending order of best place,
and (when the same best place) then in ascending order of worst place,
and (when the same worst place) then in ascending order of their family-name."
I have been trying to write the query to do this, however it does not return the correct count. It stops at a count of 2 and does not return the correct min or max values.
The Tables are:
Competitors Table
Competitornum | Givenname | Familyname | Gender | Dateofbirth | CountryCode |
Results Table has
Eventid | Competitornum | Place | Lane | Elapsedtime | Note |
Query
SELECT c.Givenname, Familyname, COUNT(r.Competitornum), MIN(r.Place), MAX(r.Place)
FROM Competitors c
JOIN Results r
ON c.Competitornum=r.Competitornum
Group by c.Givenname, familyname, place
Having Count(r.Competitornum) > 1
This is the query I have come up with so far and am not sure what I have wrong.
You don't want to include place in your GROUP BY clause.
And you will need an ORDER BY clause to meet all the requirements.
SELECT c.GivenName, c.FamilyName, COUNT(r.CompetitorNum) events,
MIN(r.Place) best_place, MAX(r.Place) worst_place
FROM Competitors c
JOIN Results r ON c.CompetitorNum = r.CompetitorNum
GROUP BY c.GivenName, c.FamilyName
HAVING COUNT(r.Competitornum) > 1
ORDER BY events DESC, best_place ASC, worst_place ASC, c.FamilyName ASC;
Try this, remove place from the GROUP BY
Group by c.Givenname, familyname
remove the place field from the group by clause and then try to run your query.

JavaDB: get ordered records in the subquery

I have the following "COMPANIES_BY_NEWS_REPUTATION" in my JavaDB database (this is some random data just to represent the structure)
COMPANY | NEWS_HASH | REPUTATION | DATE
-------------------------------------------------------------------
Company A | 14676757 | 0.12345 | 2011-05-19 15:43:28.0
Company B | 454564556 | 0.78956 | 2011-05-24 18:44:28.0
Company C | 454564556 | 0.78956 | 2011-05-24 18:44:28.0
Company A | -7874564 | 0.12345 | 2011-05-19 15:43:28.0
One news_hash may relate to several companies while a company can relate to several news_hashes as well. Reputation and date are bound to the news_hash.
What I need to do is calculate the average reputation of last 5 news for every company. In order to do that I somehow feel that I need to user 'order by' and 'offset' in a subquery as shown in the code below.
select COMPANY, avg(REPUTATION) from
(select * from COMPANY_BY_NEWS_REPUTATION order by "DATE" desc
offset 0 rows fetch next 5 row only) as TR group by COMPANY;
However, JavaDB allows neither ORDER BY, nor OFFSET in a subquery. Could anyone suggest a working solution for my problem please?
Which version of JavaDB are you using? According to the chapter TableSubquery in the JavaDB documentation, table subqueries do support order by and fetch next, at least in version 10.6.2.1.
Given that subqueries can be ordered and the size of the result set can be limited, the following (untested) query might do what you want:
select COMPANY, (select avg(REPUTATION)
from (select REPUTATION
from COMPANY_BY_NEWS_REPUTATION
where COMPANY = TR.COMPANY
order by DATE desc
fetch first 5 rows only))
from (select distinct COMPANY
from COMPANY_BY_NEWS_REPUTATION) as TR
This query retrieves all distinct company names from COMPANY_BY_NEWS_REPUTATION, then retrieves the average of the last five reputation rows for each company. I have no idea whether it will perform sufficiently, that will likely depend on the size of your data set and what indexes you have in place.
If you have a list of unique company names in another table, you can use that instead of the select distinct ... subquery to retrieve the companies for which to calculate averages.

Is there any difference between GROUP BY and DISTINCT

I learned something simple about SQL the other day:
SELECT c FROM myTbl GROUP BY C
Has the same result as:
SELECT DISTINCT C FROM myTbl
What I am curious of, is there anything different in the way an SQL engine processes the command, or are they truly the same thing?
I personally prefer the distinct syntax, but I am sure it's more out of habit than anything else.
EDIT: This is not a question about aggregates. The use of GROUP BY with aggregate functions is understood.
MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."
However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.
A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?
(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinct and screw => get list of unique values in a table column)
GROUP BY lets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT.
On the other hand DISTINCT just removes duplicates.
For example, if you have a bunch of purchase records, and you want to know how much was spent by each department, you might do something like:
SELECT department, SUM(amount) FROM purchases GROUP BY department
This will give you one row per department, containing the department name and the sum of all of the amount values in all rows for that department.
What's the difference from a mere duplicate removal functionality point of view
Apart from the fact that unlike DISTINCT, GROUP BY allows for aggregating data per group (which has been mentioned by many other answers), the most important difference in my opinion is the fact that the two operations "happen" at two very different steps in the logical order of operations that are executed in a SELECT statement.
Here are the most important operations:
FROM (including JOIN, APPLY, etc.)
WHERE
GROUP BY (can remove duplicates)
Aggregations
HAVING
Window functions
SELECT
DISTINCT (can remove duplicates)
UNION, INTERSECT, EXCEPT (can remove duplicates)
ORDER BY
OFFSET
LIMIT
As you can see, the logical order of each operation influences what can be done with it and how it influences subsequent operations. In particular, the fact that the GROUP BY operation "happens before" the SELECT operation (the projection) means that:
It doesn't depend on the projection (which can be an advantage)
It cannot use any values from the projection (which can be a disadvantage)
1. It doesn't depend on the projection
An example where not depending on the projection is useful is if you want to calculate window functions on distinct values:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
GROUP BY rating
When run against the Sakila database, this yields:
rating rn
-----------
G 1
NC-17 2
PG 3
PG-13 4
R 5
The same couldn't be achieved with DISTINCT easily:
SELECT DISTINCT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
That query is "wrong" and yields something like:
rating rn
------------
G 1
G 2
G 3
...
G 178
NC-17 179
NC-17 180
...
This is not what we wanted. The DISTINCT operation "happens after" the projection, so we can no longer remove DISTINCT ratings because the window function was already calculated and projected. In order to use DISTINCT, we'd have to nest that part of the query:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM (
SELECT DISTINCT rating FROM film
) f
Side-note: In this particular case, we could also use DENSE_RANK()
SELECT DISTINCT rating, dense_rank() OVER (ORDER BY rating) AS rn
FROM film
2. It cannot use any values from the projection
One of SQL's drawbacks is its verbosity at times. For the same reason as what we've seen before (namely the logical order of operations), we cannot "easily" group by something we're projecting.
This is invalid SQL:
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY name
This is valid (repeating the expression)
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY first_name || ' ' || last_name
This is valid, too (nesting the expression)
SELECT name
FROM (
SELECT first_name || ' ' || last_name AS name
FROM customer
) c
GROUP BY name
I've written about this topic more in depth in a blog post
There is no difference (in SQL Server, at least). Both queries use the same execution plan.
http://sqlmag.com/database-performance-tuning/distinct-vs-group
Maybe there is a difference, if there are sub-queries involved:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
There is no difference (Oracle-style):
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212
Use DISTINCT if you just want to remove duplicates. Use GROUPY BY if you want to apply aggregate operators (MAX, SUM, GROUP_CONCAT, ..., or a HAVING clause).
I expect there is the possibility for subtle differences in their execution.
I checked the execution plans for two functionally equivalent queries along these lines in Oracle 10g:
core> select sta from zip group by sta;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH GROUP BY | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
core> select distinct sta from zip;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH UNIQUE | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
The middle operation is slightly different: "HASH GROUP BY" vs. "HASH UNIQUE", but the estimated costs etc. are identical. I then executed these with tracing on and the actual operation counts were the same for both (except that the second one didn't have to do any physical reads due to caching).
But I think that because the operation names are different, the execution would follow somewhat different code paths and that opens the possibility of more significant differences.
I think you should prefer the DISTINCT syntax for this purpose. It's not just habit, it more clearly indicates the purpose of the query.
For the query you posted, they are identical. But for other queries that may not be true.
For example, it's not the same as:
SELECT C FROM myTbl GROUP BY C, D
I read all the above comments but didn't see anyone pointed to the main difference between Group By and Distinct apart from the aggregation bit.
Distinct returns all the rows then de-duplicates them whereas Group By de-deduplicate the rows as they're read by the algorithm one by one.
This means they can produce different results!
For example, the below codes generate different results:
SELECT distinct ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
SELECT ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
GROUP BY Name
If there are 10 names in the table where 1 of which is a duplicate of another then the first query returns 10 rows whereas the second query returns 9 rows.
The reason is what I said above so they can behave differently!
If you use DISTINCT with multiple columns, the result set won't be grouped as it will with GROUP BY, and you can't use aggregate functions with DISTINCT.
GROUP BY has a very specific meaning that is distinct (heh) from the DISTINCT function.
GROUP BY causes the query results to be grouped using the chosen expression, aggregate functions can then be applied, and these will act on each group, rather than the entire resultset.
Here's an example that might help:
Given a table that looks like this:
name
------
barry
dave
bill
dave
dave
barry
john
This query:
SELECT name, count(*) AS count FROM table GROUP BY name;
Will produce output like this:
name count
-------------
barry 2
dave 3
bill 1
john 1
Which is obviously very different from using DISTINCT. If you want to group your results, use GROUP BY, if you just want a unique list of a specific column, use DISTINCT. This will give your database a chance to optimise the query for your needs.
If you are using a GROUP BY without any aggregate function then internally it will treated as DISTINCT, so in this case there is no difference between GROUP BY and DISTINCT.
But when you are provided with DISTINCT clause better to use it for finding your unique records because the objective of GROUP BY is to achieve aggregation.
They have different semantics, even if they happen to have equivalent results on your particular data.
Please don't use GROUP BY when you mean DISTINCT, even if they happen to work the same. I'm assuming you're trying to shave off milliseconds from queries, and I have to point out that developer time is orders of magnitude more expensive than computer time.
In Teradata perspective :
From a result set point of view, it does not matter if you use DISTINCT or GROUP BY in Teradata. The answer set will be the same.
From a performance point of view, it is not the same.
To understand what impacts performance, you need to know what happens on Teradata when executing a statement with DISTINCT or GROUP BY.
In the case of DISTINCT, the rows are redistributed immediately without any preaggregation taking place, while in the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.
Don’t think now that GROUP BY is always better from a performance point of view. When you have many different values, the preaggregation step of GROUP BY is not very efficient. Teradata has to sort the data to remove duplicates. In this case, it may be better to the redistribution first, i.e. use the DISTINCT statement. Only if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the deduplication step takes place, after redistribution.
In short, DISTINCT vs. GROUP BY in Teradata means:
GROUP BY -> for many duplicates
DISTINCT -> no or a few duplicates only .
At times, when using DISTINCT, you run out of spool space on an AMP. The reason is that redistribution takes place immediately, and skewing could cause AMPs to run out of space.
If this happens, you have probably a better chance with GROUP BY, as duplicates are already removed in a first step, and less data is moved across the AMPs.
group by is used in aggregate operations -- like when you want to get a count of Bs broken down by column C
select C, count(B) from myTbl group by C
distinct is what it sounds like -- you get unique rows.
In sql server 2005, it looks like the query optimizer is able to optimize away the difference in the simplistic examples I ran. Dunno if you can count on that in all situations, though.
In that particular query there is no difference. But, of course, if you add any aggregate columns then you'll have to use group by.
You're only noticing that because you are selecting a single column.
Try selecting two fields and see what happens.
Group By is intended to be used like this:
SELECT name, SUM(transaction) FROM myTbl GROUP BY name
Which would show the sum of all transactions for each person.
From a 'SQL the language' perspective the two constructs are equivalent and which one you choose is one of those 'lifestyle' choices we all have to make. I think there is a good case for DISTINCT being more explicit (and therefore is more considerate to the person who will inherit your code etc) but that doesn't mean the GROUP BY construct is an invalid choice.
I think this 'GROUP BY is for aggregates' is the wrong emphasis. Folk should be aware that the set function (MAX, MIN, COUNT, etc) can be omitted so that they can understand the coder's intent when it is.
The ideal optimizer will recognize equivalent SQL constructs and will always pick the ideal plan accordingly. For your real life SQL engine of choice, you must test :)
PS note the position of the DISTINCT keyword in the select clause may produce different results e.g. contrast:
SELECT COUNT(DISTINCT C) FROM myTbl;
SELECT DISTINCT COUNT(C) FROM myTbl;
I know it's an old post. But it happens that I had a query that used group by just to return distinct values when using that query in toad and oracle reports everything worked fine, I mean a good response time. When we migrated from Oracle 9i to 11g the response time in Toad was excellent but in the reporte it took about 35 minutes to finish the report when using previous version it took about 5 minutes.
The solution was to change the group by and use DISTINCT and now the report runs in about 30 secs.
I hope this is useful for someone with the same situation.
Sometimes they may give you the same results but they are meant to be used in different sense/case. The main difference is in syntax.
Minutely notice the example below. DISTINCT is used to filter out the duplicate set of values. (6, cs, 9.1) and (1, cs, 5.5) are two different sets. So DISTINCT is going to display both the rows while GROUP BY Branch is going to display only one set.
SELECT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT DISTINCT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT * FROM student GROUP BY Branch;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 2 | mech | 6.3 |
+------+--------+------+
4 rows in set (0.001 sec)
Sometimes the results that can be achieved by GROUP BY clause is not possible to achieved by DISTINCT without using some extra clause or conditions. E.g in above case.
To get the same result as DISTINCT you have to pass all the column names in GROUP BY clause like below. So see the syntactical difference. You must have knowledge about all the column names to use GROUP BY clause in that case.
SELECT * FROM student GROUP BY Id, Branch, CGPA;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 1 | cs | 5.5 |
| 2 | mech | 6.3 |
| 3 | civil | 7.2 |
| 4 | eee | 8.2 |
| 6 | cs | 9.1 |
+------+--------+------+
Also I have noticed GROUP BY displays the results in ascending order by default which DISTINCT does not. But I am not sure about this. It may be differ vendor wise.
Source : https://dbjpanda.me/dbms/languages/sql/sql-syntax-with-examples#group-by
In terms of usage, GROUP BY is used for grouping those rows you want to calculate. DISTINCT will not do any calculation. It will show no duplicate rows.
I always used DISTINCT if I want to present data without duplicates.
If I want to do calculations like summing up the total quantity of mangoes, I will use GROUP BY
In Hive (HQL), GROUP BY can be way faster than DISTINCT, because the former does not require comparing all fields in the table.
See: https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct.
The way I always understood it is that using distinct is the same as grouping by every field you selected in the order you selected them.
i.e:
select distinct a, b, c from table;
is the same as:
select a, b, c from table group by a, b, c
Funtional efficiency is totally different.
If you would like to select only "return value" except duplicate one, use distinct is better than group by. Because "group by" include ( sorting + removing ) , "distinct" include ( removing )
Generally we can use DISTINCT for eliminate the duplicates on Specific Column in the table.
In Case of 'GROUP BY' we can Apply the Aggregation Functions like
AVG, MAX, MIN, SUM, and COUNT on Specific column and fetch
the column name and it aggregation function result on the same column.
Example :
select specialColumn,sum(specialColumn) from yourTableName group by specialColumn;
There is no significantly difference between group by and distinct clause except the usage of aggregate functions.
Both can be used to distinguish the values but if in performance point of view group by is better.
When distinct keyword is used , internally it used sort operation which can be view in execution plan.
Try simple example
Declare #tmpresult table
(
Id tinyint
)
Insert into #tmpresult
Select 5
Union all
Select 2
Union all
Select 3
Union all
Select 4
Select distinct
Id
From #tmpresult