SQL: aggregate and condense - sql

I could do with something I can’t figure out how to do. I have a simple query that I need to aggregate and condense. Here is the base script:
select t.serial, t.make, t.model, t.upgrade_date, t.software_version
FROM table t
Order by t.serial
These are electrical devices that have been software upgraded (up to 3 times). The script above, of course, will provided multiple rows for the same device but with varying software versions while the make and model remain constant. I understand how to aggregate this to a single row using max to, for example, show the latest software version. However, what I need is to aggregate and have a condensed column showing EVERY software_version (in order) the device has been at.
It could look like this:
+------------------------------------------+
| Serial | Make | Model | Version_history |
+------------------------------------------+
| A Terra 1 1.1, 1.9, 2.1 |
+------------------------------------------+
Alternatively, and possibly better, it would be fine to show each software_version as a column header with ‘upgrade_date’ if the device was at that version. Actually, it would be useful to know both methods.
This could look like this:
+------------------------------------------------------------------------+
| Serial Make Model 1.1 1.9 2.1 2.3 |
+------------------------------------------------------------------------+
| A Terra 1 01-Jan-2019 17-Jun-2019 31-Dec-2019 |
+------------------------------------------------------------------------+
No doubt this is really simple but I can’t for the life of me figure it out.

Use LISTAGG:
select serial,
make,
model,
LISTAGG( software_version, ',' ) WITHIN GROUP ( ORDER BY upgrade_date )
AS Version_history,
LISTAGG( upgrade_date, ',' ) WITHIN GROUP ( ORDER BY upgrade_date )
AS Version_dates
FROM table
GROUP BY serial, make, model
Order by serial
Alternatively, and possibly better, it would be fine to show each software_version as a column header with ‘upgrade_date’ if the device was at that version.
If you know the versions then you can do this with a PIVOT:
SELECT serial,
make,
model,
"1.1",
"1.9",
"2.1",
"2.3"
FROM table
PIVOT ( MAX( upgrade_date ) FOR software_version IN (
'1.1' AS "1.1",
'1.9' AS "1.9",
'2.1' AS "2.1",
'2.3' AS "2.3"
) )
db<>fiddle
If you don't know the versions then this would be a dynamic pivot and is not a simple SQL query. You would need to use a two stage process to first generate a dynamic SQL query and then second use EXECUTE IMMEDIATE to run the query. In general, you shouldn't use dynamic pivots and, instead, transpose rows to columns in whatever programming interface you're using to access the database (Java, Python, C#, etc.).

Related

Incremental integer ID in Impala

I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which
Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.
Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.
The desired result should look like:
-- UUID() provided as example - I want to achieve the `my_id`-column.
| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1 | 50d53ca4-b...| "a" |
| 2 | 6ba8dd54-1...| "b" |
| 3 | 515362df-f...| "c" |
| 4 | a52db5e9-e...| "d" |
|-------|--------------|--------------|
How can I achieve the desired result (integer-ID ranging from 1..n)?
Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.
Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
, some_content
FROM some_table
row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.
The command creates the desired incremental ID without any nested SQL-statements or other tricks.
| my_id | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|
I used Markus's answer for a large partitioned table and found that I was getting duplicate ids. I think the ids were only unique within their partition; possibly PARTITION BY "dummy" leads Impala to think that each partition can execute row_number() on its own. I was able to get it working by specifying an actual column to order by and no partition by:
SELECT
row_number() OVER (ORDER BY actual_column) as my_id
, some_content
FROM some_table
It doesn't seem to matter whether the values in the column are unique (mine weren't), but using the actual partition key might result in the same issue as the "dummy" column.
Understandably, it took a lot longer to run than the dummy version.

SQL: Using row values as column headers

I'm building a fairly large table for a client, and they want the data represented in a certain way. My table is currently like this:
DATE_RECEIVED | NAME | DOB | ANALYTE | RESULT
'YYYY/MM' |STRING |'MM/DD/YYYY' | String | String
2011/03 |Name, A| 07/31/1056 | AAAA | Positive
2011/03 |Name, A| 07/31/1056 | BBBB | Negative
What I need to do is something like a pivot - each "Analyte" is to be its own column, with the "result" being the value in the column, like this:
DATE_RECEIVED | NAME | DOB | AAAA | BBBB |
2011/03 |Name, A| 07/31/1056 | Positive | Negative |
I've tried a few things with PIVOT, but I think that I'm still too novice to understand how the logic for that function works. Either that, or the version of SQL I'm using doesn't support pivoting. Looking through similar questions on this site didn't really get me any closer to solving the problem because I don't really feel like I understand my problem well enough to know what I need to do to fix it. Anyway, I'm completely stumped. If anyone could give me a place to start, that would be extremely helpful. Thanks!
...Also, I know I'm using Oracle SQL, but I don't know what version. If it helps, I'm writing everything in TOAD for Oracle version 12.6.
If you can't use PIVOT (maybe your version of Oracle doesn't support it), then it is possible to use CASE instead together with an aggregate:
SELECT date_received, name, dob
, MAX(CASE WHEN analyte = 'AAAA' THEN result END) AS aaaa
, MAX(CASE WHEN analyte = 'BBBB' THEN result END) AS bbbb
FROM mytable
GROUP BY date_received, name, dob
UPDATE: Here is how you might accomplish the same thing with PIVOT:
SELECT * FROM (
SELECT date_received, name, dob, analyte, result
FROM mytable
) PIVOT ( MAX(result) FOR (analyte) IN ('AAAA' AS a,'BBBB' AS b) );
SQL Fiddle Demo here.
Note that with a PIVOT you still need an aggregate function - MAX() will do here, and so will MIN if there is only one value! AS in the PIVOT clause names the resulting columns.
The other thing with PIVOT is that the possible values have to be explicitly named. You can get around that by using XML (read more about how to do that), but then all your results will be XML-ized.

Is there a way to apply a moving limit in SQL>

I have a large database I use for plotting and data examination. For simplicity, say it looks something like this:
| id | day | obs |
+----------+-----------+-----------+
| 1 | 500 | 4.5 |
| 2 | 500 | 4.4 |
| 3 | 500 | 4.7 |
| 4 | 500 | 4.8 |
| 5 | 600 | 5.1 |
| 6 | 600 | 5.2 |
...
This could be stock market data, where we have many points per day that are measured.
What I want to do is look at much longer trends, where the multiple points per day are unnecessarily resolved, and clog my plotting application. (I want to look at 30000 days, each has about 100 observations).
Is there a way to do something like SELECT ... LIMIT 1 PER "day"
I suppose I could perform a few SELECT DISTINCT queries to find correct ID's, but I'd rather do something simple if it is built in.
It doesn't matter if its the first, last, or an average value per day. Just a single value. I just prefer what is fastest.
Also, this I'd like to do this for Postgres, MySQL, and SQLite. My application is built to use all three and I frequently switch between them.
Thanks!
Background: This is for a Ruby on Rails plotting application, so a trick with ActiveRecord will work too. https://github.com/ZachDischner/Rails-Plotter
You need to tag your question with the brand of RDBMS you're using. Frequently for Rails developers, they're using MySQL, but the answer to your question depends on this.
For all brands except for MySQL, the correct and standard solution is to use windowing functions:
SELECT * FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY day) AS RN, *
FROM stockmarketdata
) AS t
WHERE t.RN = 1;
For MySQL, which doesn't support windowing functions yet, you can simulate them in a kind of clumsy way with session variables:
SELECT * FROM (SELECT #day:=0, #r:=0) AS _init,
(
SELECT IF(day=#day, #r:=#r+1, #r:=0) AS RN, #day:=day AS d, *
FROM stockmarketdata
) AS t
WHERE t.RN = 1
You left a lot of room for options with your statement:
It doesn't matter if its the first, last, or an average value per day. Just a single value. I just prefer what is fastest.
So, I'm going to leave the id out of it and first propose going with average of obs for each group as the simplest and probably the most practical, though maybe not the fastest to be running stat functions vs. limit:
MyModel.group(:day).average(:obs)
If you wanted the minimum:
MyModel.group(:day).minimum(:obs)
If you wanted the maximum:
MyModel.group(:day).maximum(:obs)
(Note: The following 2 examples are less efficient than just entering the SQL, but might be more portable.)
But you might want all three:
ActiveRecord::Base.connection.execute(MyModel.select('MIN(obs), AVG(obs), MAX(obs)').group(:day).to_sql).to_a
Or just the data without hashes:
ActiveRecord::Base.connection.exec_query(MyModel.select('MIN(obs), AVG(obs), MAX(obs)').group(:day).to_sql)
If you want median, see this question which is more DB specific, and there are other related posts about it if you search.
And for more, some DB's like postgres have variance(...), stddev(...), etc. built-in.
Finally, check out the query section in the Rails guide and ARel for more info on constructing queries. You can do a limit in an ActiveRecord relation via first or limit for example, and in ARel, take lets you do a limit. Subqueries are possible too, as shown in answers to this question, and so is group by, etc. If you are sharing this project with others, try to limit the amount of non-portable SQL you are using unless you plan on adding support for other databases on your own and maintaining that.

SQL: Select distinct based on regular expression

Basically, I'm dealing with a horribly set up table that I'd love to rebuild, but am not sure I can at this point.
So, the table is of addresses, and it has a ton of similar entries for the same address. But there are sometimes slight variations in the address (i.e., a room # is tacked on IN THE SAME COLUMN, ugh).
Like this:
id | place_name | place_street
1 | Place Name One | 1001 Mercury Blvd
2 | Place Name Two | 2388 Jupiter Street
3 | Place Name One | 1001 Mercury Blvd, Suite A
4 | Place Name, One | 1001 Mercury Boulevard
5 | Place Nam Two | 2388 Jupiter Street, Rm 101
What I would like to do is in SQL (this is mssql), if possible, is do a query that is like:
SELECT DISTINCT place_name, place_street where [the first 4 letters of the place_name are the same] && [the first 4 characters of the place_street are the same].
to, I guess at this point, get:
Plac | 1001
Plac | 2388
Basically, then I can figure out what are the main addresses I have to break out into another table to normalize this, because the rest are just slight derivations.
I hope that makes sense.
I've done some research and I see people using regular expressions in SQL, but a lot of them seem to be using C scripts or something. Do I have to write regex functions and save them into the SQL Server before executing any regular expressions?
Any direction on whether I can just write them in SQL or if I have another step to go through would be great.
Or on how to approach this problem.
Thanks in advance!
Use the SQL function LEFT:
SELECT DISTINCT LEFT(place_name, 4)
I don't think you need regular expressions to get the results you describe. You just want to trim the columns and group by the results, which will effectively give you distinct values.
SELECT left(place_name, 4), left(place_street, 4), count(*)
FROM AddressTable
GROUP BY left(place_name, 4), left(place_street, 4)
The count(*) column isn't necessary, but it gives you some idea of which values might have the most (possibly) duplicate address rows in common.
I would recommend you look into Fuzzy Search Operations in SQL Server. You can match the results much better than what you are trying to do. Just google sql server fuzzy search.
Assuming at least SQL Server 2005 for the CTE:
;with cteCommonAddresses as (
select left(place_name, 4) as LeftName, left(place_street,4) as LeftStreet
from Address
group by left(place_name, 4), left(place_street,4)
having count(*) > 1
)
select a.id, a.place_name, a.place_street
from cteCommonAddresses c
inner join Address a
on c.LeftName = left(a.place_name,4)
and c.LeftStreet = left(a.place_street,4)
order by a.place_name, a.place_street, a.id

Is there any difference between GROUP BY and DISTINCT

I learned something simple about SQL the other day:
SELECT c FROM myTbl GROUP BY C
Has the same result as:
SELECT DISTINCT C FROM myTbl
What I am curious of, is there anything different in the way an SQL engine processes the command, or are they truly the same thing?
I personally prefer the distinct syntax, but I am sure it's more out of habit than anything else.
EDIT: This is not a question about aggregates. The use of GROUP BY with aggregate functions is understood.
MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."
However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.
A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?
(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinct and screw => get list of unique values in a table column)
GROUP BY lets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT.
On the other hand DISTINCT just removes duplicates.
For example, if you have a bunch of purchase records, and you want to know how much was spent by each department, you might do something like:
SELECT department, SUM(amount) FROM purchases GROUP BY department
This will give you one row per department, containing the department name and the sum of all of the amount values in all rows for that department.
What's the difference from a mere duplicate removal functionality point of view
Apart from the fact that unlike DISTINCT, GROUP BY allows for aggregating data per group (which has been mentioned by many other answers), the most important difference in my opinion is the fact that the two operations "happen" at two very different steps in the logical order of operations that are executed in a SELECT statement.
Here are the most important operations:
FROM (including JOIN, APPLY, etc.)
WHERE
GROUP BY (can remove duplicates)
Aggregations
HAVING
Window functions
SELECT
DISTINCT (can remove duplicates)
UNION, INTERSECT, EXCEPT (can remove duplicates)
ORDER BY
OFFSET
LIMIT
As you can see, the logical order of each operation influences what can be done with it and how it influences subsequent operations. In particular, the fact that the GROUP BY operation "happens before" the SELECT operation (the projection) means that:
It doesn't depend on the projection (which can be an advantage)
It cannot use any values from the projection (which can be a disadvantage)
1. It doesn't depend on the projection
An example where not depending on the projection is useful is if you want to calculate window functions on distinct values:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
GROUP BY rating
When run against the Sakila database, this yields:
rating rn
-----------
G 1
NC-17 2
PG 3
PG-13 4
R 5
The same couldn't be achieved with DISTINCT easily:
SELECT DISTINCT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
That query is "wrong" and yields something like:
rating rn
------------
G 1
G 2
G 3
...
G 178
NC-17 179
NC-17 180
...
This is not what we wanted. The DISTINCT operation "happens after" the projection, so we can no longer remove DISTINCT ratings because the window function was already calculated and projected. In order to use DISTINCT, we'd have to nest that part of the query:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM (
SELECT DISTINCT rating FROM film
) f
Side-note: In this particular case, we could also use DENSE_RANK()
SELECT DISTINCT rating, dense_rank() OVER (ORDER BY rating) AS rn
FROM film
2. It cannot use any values from the projection
One of SQL's drawbacks is its verbosity at times. For the same reason as what we've seen before (namely the logical order of operations), we cannot "easily" group by something we're projecting.
This is invalid SQL:
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY name
This is valid (repeating the expression)
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY first_name || ' ' || last_name
This is valid, too (nesting the expression)
SELECT name
FROM (
SELECT first_name || ' ' || last_name AS name
FROM customer
) c
GROUP BY name
I've written about this topic more in depth in a blog post
There is no difference (in SQL Server, at least). Both queries use the same execution plan.
http://sqlmag.com/database-performance-tuning/distinct-vs-group
Maybe there is a difference, if there are sub-queries involved:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
There is no difference (Oracle-style):
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212
Use DISTINCT if you just want to remove duplicates. Use GROUPY BY if you want to apply aggregate operators (MAX, SUM, GROUP_CONCAT, ..., or a HAVING clause).
I expect there is the possibility for subtle differences in their execution.
I checked the execution plans for two functionally equivalent queries along these lines in Oracle 10g:
core> select sta from zip group by sta;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH GROUP BY | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
core> select distinct sta from zip;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH UNIQUE | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
The middle operation is slightly different: "HASH GROUP BY" vs. "HASH UNIQUE", but the estimated costs etc. are identical. I then executed these with tracing on and the actual operation counts were the same for both (except that the second one didn't have to do any physical reads due to caching).
But I think that because the operation names are different, the execution would follow somewhat different code paths and that opens the possibility of more significant differences.
I think you should prefer the DISTINCT syntax for this purpose. It's not just habit, it more clearly indicates the purpose of the query.
For the query you posted, they are identical. But for other queries that may not be true.
For example, it's not the same as:
SELECT C FROM myTbl GROUP BY C, D
I read all the above comments but didn't see anyone pointed to the main difference between Group By and Distinct apart from the aggregation bit.
Distinct returns all the rows then de-duplicates them whereas Group By de-deduplicate the rows as they're read by the algorithm one by one.
This means they can produce different results!
For example, the below codes generate different results:
SELECT distinct ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
SELECT ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
GROUP BY Name
If there are 10 names in the table where 1 of which is a duplicate of another then the first query returns 10 rows whereas the second query returns 9 rows.
The reason is what I said above so they can behave differently!
If you use DISTINCT with multiple columns, the result set won't be grouped as it will with GROUP BY, and you can't use aggregate functions with DISTINCT.
GROUP BY has a very specific meaning that is distinct (heh) from the DISTINCT function.
GROUP BY causes the query results to be grouped using the chosen expression, aggregate functions can then be applied, and these will act on each group, rather than the entire resultset.
Here's an example that might help:
Given a table that looks like this:
name
------
barry
dave
bill
dave
dave
barry
john
This query:
SELECT name, count(*) AS count FROM table GROUP BY name;
Will produce output like this:
name count
-------------
barry 2
dave 3
bill 1
john 1
Which is obviously very different from using DISTINCT. If you want to group your results, use GROUP BY, if you just want a unique list of a specific column, use DISTINCT. This will give your database a chance to optimise the query for your needs.
If you are using a GROUP BY without any aggregate function then internally it will treated as DISTINCT, so in this case there is no difference between GROUP BY and DISTINCT.
But when you are provided with DISTINCT clause better to use it for finding your unique records because the objective of GROUP BY is to achieve aggregation.
They have different semantics, even if they happen to have equivalent results on your particular data.
Please don't use GROUP BY when you mean DISTINCT, even if they happen to work the same. I'm assuming you're trying to shave off milliseconds from queries, and I have to point out that developer time is orders of magnitude more expensive than computer time.
In Teradata perspective :
From a result set point of view, it does not matter if you use DISTINCT or GROUP BY in Teradata. The answer set will be the same.
From a performance point of view, it is not the same.
To understand what impacts performance, you need to know what happens on Teradata when executing a statement with DISTINCT or GROUP BY.
In the case of DISTINCT, the rows are redistributed immediately without any preaggregation taking place, while in the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.
Don’t think now that GROUP BY is always better from a performance point of view. When you have many different values, the preaggregation step of GROUP BY is not very efficient. Teradata has to sort the data to remove duplicates. In this case, it may be better to the redistribution first, i.e. use the DISTINCT statement. Only if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the deduplication step takes place, after redistribution.
In short, DISTINCT vs. GROUP BY in Teradata means:
GROUP BY -> for many duplicates
DISTINCT -> no or a few duplicates only .
At times, when using DISTINCT, you run out of spool space on an AMP. The reason is that redistribution takes place immediately, and skewing could cause AMPs to run out of space.
If this happens, you have probably a better chance with GROUP BY, as duplicates are already removed in a first step, and less data is moved across the AMPs.
group by is used in aggregate operations -- like when you want to get a count of Bs broken down by column C
select C, count(B) from myTbl group by C
distinct is what it sounds like -- you get unique rows.
In sql server 2005, it looks like the query optimizer is able to optimize away the difference in the simplistic examples I ran. Dunno if you can count on that in all situations, though.
In that particular query there is no difference. But, of course, if you add any aggregate columns then you'll have to use group by.
You're only noticing that because you are selecting a single column.
Try selecting two fields and see what happens.
Group By is intended to be used like this:
SELECT name, SUM(transaction) FROM myTbl GROUP BY name
Which would show the sum of all transactions for each person.
From a 'SQL the language' perspective the two constructs are equivalent and which one you choose is one of those 'lifestyle' choices we all have to make. I think there is a good case for DISTINCT being more explicit (and therefore is more considerate to the person who will inherit your code etc) but that doesn't mean the GROUP BY construct is an invalid choice.
I think this 'GROUP BY is for aggregates' is the wrong emphasis. Folk should be aware that the set function (MAX, MIN, COUNT, etc) can be omitted so that they can understand the coder's intent when it is.
The ideal optimizer will recognize equivalent SQL constructs and will always pick the ideal plan accordingly. For your real life SQL engine of choice, you must test :)
PS note the position of the DISTINCT keyword in the select clause may produce different results e.g. contrast:
SELECT COUNT(DISTINCT C) FROM myTbl;
SELECT DISTINCT COUNT(C) FROM myTbl;
I know it's an old post. But it happens that I had a query that used group by just to return distinct values when using that query in toad and oracle reports everything worked fine, I mean a good response time. When we migrated from Oracle 9i to 11g the response time in Toad was excellent but in the reporte it took about 35 minutes to finish the report when using previous version it took about 5 minutes.
The solution was to change the group by and use DISTINCT and now the report runs in about 30 secs.
I hope this is useful for someone with the same situation.
Sometimes they may give you the same results but they are meant to be used in different sense/case. The main difference is in syntax.
Minutely notice the example below. DISTINCT is used to filter out the duplicate set of values. (6, cs, 9.1) and (1, cs, 5.5) are two different sets. So DISTINCT is going to display both the rows while GROUP BY Branch is going to display only one set.
SELECT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT DISTINCT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT * FROM student GROUP BY Branch;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 2 | mech | 6.3 |
+------+--------+------+
4 rows in set (0.001 sec)
Sometimes the results that can be achieved by GROUP BY clause is not possible to achieved by DISTINCT without using some extra clause or conditions. E.g in above case.
To get the same result as DISTINCT you have to pass all the column names in GROUP BY clause like below. So see the syntactical difference. You must have knowledge about all the column names to use GROUP BY clause in that case.
SELECT * FROM student GROUP BY Id, Branch, CGPA;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 1 | cs | 5.5 |
| 2 | mech | 6.3 |
| 3 | civil | 7.2 |
| 4 | eee | 8.2 |
| 6 | cs | 9.1 |
+------+--------+------+
Also I have noticed GROUP BY displays the results in ascending order by default which DISTINCT does not. But I am not sure about this. It may be differ vendor wise.
Source : https://dbjpanda.me/dbms/languages/sql/sql-syntax-with-examples#group-by
In terms of usage, GROUP BY is used for grouping those rows you want to calculate. DISTINCT will not do any calculation. It will show no duplicate rows.
I always used DISTINCT if I want to present data without duplicates.
If I want to do calculations like summing up the total quantity of mangoes, I will use GROUP BY
In Hive (HQL), GROUP BY can be way faster than DISTINCT, because the former does not require comparing all fields in the table.
See: https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct.
The way I always understood it is that using distinct is the same as grouping by every field you selected in the order you selected them.
i.e:
select distinct a, b, c from table;
is the same as:
select a, b, c from table group by a, b, c
Funtional efficiency is totally different.
If you would like to select only "return value" except duplicate one, use distinct is better than group by. Because "group by" include ( sorting + removing ) , "distinct" include ( removing )
Generally we can use DISTINCT for eliminate the duplicates on Specific Column in the table.
In Case of 'GROUP BY' we can Apply the Aggregation Functions like
AVG, MAX, MIN, SUM, and COUNT on Specific column and fetch
the column name and it aggregation function result on the same column.
Example :
select specialColumn,sum(specialColumn) from yourTableName group by specialColumn;
There is no significantly difference between group by and distinct clause except the usage of aggregate functions.
Both can be used to distinguish the values but if in performance point of view group by is better.
When distinct keyword is used , internally it used sort operation which can be view in execution plan.
Try simple example
Declare #tmpresult table
(
Id tinyint
)
Insert into #tmpresult
Select 5
Union all
Select 2
Union all
Select 3
Union all
Select 4
Select distinct
Id
From #tmpresult