Need a Complex SQL Query - sql

I need to make a rather complex query, and I need help bad. Below is an example I made.
Basically, I need a query that will return one row for each case_id where the type is support, status start, and date meaning the very first one created (so that in the example below, only the 2/1/2009 John's case gets returned, not the 3/1/2009). The search needs to be dynamic to the point of being able to return all similar rows with different case_id's etc from a table with thousands of rows.
There's more after that but I don't know all the details yet, and I think I can figure it out if you guys (an gals) can help me out here. :)
ID | Case_ID | Name | Date | Status | Type
48 | 450 | John | 6/1/2009 | Fixed | Support
47 | 450 | John | 4/1/2009 | Moved | Support
46 | 451 | Sarah | 3/1/2009 | |
45 | 432 | John | 3/1/2009 | Fixed | Critical
44 | 450 | John | 3/1/2009 | Start | Support
42 | 450 | John | 2/1/2009 | Start | Support
41 | 440 | Ben | 2/1/2009 | |
40 | 432 | John | 1/1/2009 | Start | Critical
...
Thanks a bunch!
Edit:
To answer some people's questions, I'm using SQL Server 2005. And the date is just plain date, not string.
Ok so now I got further in the problem. I ended up with Bliek's solution which worked like a charm. But now I ran into the problem that sometimes the status never starts, as it's solved immediately. I need to include this in as well. But only for a certain time period.
I imagine I'm going to have to check for the case table referenced by FK Case_ID here. So I'd need a way to check for each Case_ID created in the CaseTable within the past month, and then run a search for these in the same table and same manner as posted above, returning only the first result as before. How can I use the other table like that?
As usual I'll try to find the answer myself while waiting, thanks again!
Edit 2:
Seems this is the answer. I don't have access to the full DB yet so I can't fully test it, but it seems to be working with the dummy tables I created, to continue from Bliek's code's WHERE clause:
WHERE RowNumber = 1 AND Case_ID IN (SELECT Case_ID FROM CaseTable
WHERE (Date BETWEEN '2007/11/1' AND '2007/11/30'))
The date's screwed again but you get the idea I'm sure. Thanks for the help everyone! I'll get back if there're more problems, but I think with this info I can improvise my way through most of the SQL problems I currently have to deal with. :)

Maybe something like:
select Case_ID, Name, MIN(date), Status, Type
from table
where type = 'Support'
and status = 'Start'
group by Case_ID, Name, Status, Type
EDIT: You haven't provided a lot of details about what you really want, so I'd suggest that you read all the answers and choose one that suits your problem best. So far I'd say that Tomalak's answer is closest to what you're looking for...

SELECT
c.ID,
c.Case_ID,
c.Name,
c.Date,
c.Status,
c.Type
FROM
CaseTable c
WHERE
c.Type = 'Support'
AND c.Status = 'Start'
AND c.Date = (
SELECT MIN(Date)
FROM CaseTable
WHERE Case_ID = c.Case_ID AND Type = c.Type AND Status = c.Status)
/* GROUP BY only needed when for a given Case_ID several rows
exist that fulfill the WHERE clause */
GROUP BY
c.ID,
c.Case_ID,
c.Name,
c.Date,
c.Status,
c.Type
This query benefits greatly from indexes on the Case_ID, Date, Status and Type columns.
Added value though the fact that the filter on Support and Status only needs to be set in one place.
As an alternative to the GROUP BY clause, you can do SELECT DISTINCT, which would increase readability (this may or may not affect overall performance, I suggest you measure both variants against each other). If you are sure that for no Case_ID in your table two rows exist that have the same Date, you won't need GROUP BY or SELECT DISTINCT at all.

In SQL Server 2005 and beyond I would use Common Table Expressions (CTE). This offers lots of possibilities like so:
With ResultTable (RowNumber
,ID
,Case_ID
,Name
,Date
,Status
,Type)
AS
(
SELECT Row_Number() OVER (PARTITION BY Case_ID
ORDER BY Date ASC)
,ID
,Case_ID
,Name
,Date
,Status
,Type
FROM CaseTable
WHERE Type = 'Support'
AND Status = 'Start'
)
SELECT ID
,Case_ID
,Name
,Date
,Status
,Type
FROM ResultTable
WHERE RowNumber = 1

Don't apologize for your date formatting, it makes more sense that way.
SELECT ID, Case_ID, Name, MIN(Date), Status, Type
FROM caseTable
WHERE Type = 'Support'
AND status = 'Start'
GROUP BY ID, Case_ID, Name, Status, Type

Related

I want a hierachical order in my output but dont know how?

I want in the output a hierarchical order like so:
My Data :
Name | Cost | Level
----------------+--------+------
Car1 | 2000 | 1
Component1.1 | 3000 | 2
Component1.2 | 2300 | 3
Computer2 | 5000 | 1
Component2.1 | 2000 | 2
Component2.2 | Null | 3
Output: Show all those data, which has money in it and order it by the level, something like first 1, then 2, then,3 and after that start with 1 again.
Name | Level
------------------+------
Car1 | 1
Component1.1 | 2
Component1.2 | 3
Computer | 1
Component2.1 | 2
What ORDER BY does is:
Name | Level
----------------+------
Car1 | 1
Computer1 | 1
Component1.1 | 2
Component2.1 | 2
Component1.2 | 3
Component2.2 | 3
I tried the CONNECT BY PRIOR function and it didn't work well
SELECT Name, Level
FROM Product
CONNECT BY PRIOR Level;
In MySQL you would normaly use 'order by'. So if you want to order on table row "level" your synntax would be something like this:
SELECT * FROM items ORDER BY level ASC
You can make use of ASC (Ascending) or DESC (descending).
Hope this will help you.
You can use below one
SELECT Name, Level
FROM Auction
ORDER BY Name, Level
WHERE Money!='Null' ;
Doing order by Name will print the result in hierarchical order, but if it has a column called parent id, then it would have been easier to show.
i suggest this for you :
SELECT Name, Level FROM Product ORDER BY Name, Level WHERE Money!='Null' ASC;
i wish this help you brother
It is still not clear whether this is what you are really expecting. It seems to me from your data set, you want to numerically order the components based on some kind of a version number at the end of the component. If that's truly what you want then you may ignore the non-numeric characters in the name and order by pure numbers towards the end of string ( with the required where clause ).
ORDER BY REPLACE ( name, TRANSLATE(name,' .0123456789',' '),'');
If that's the case, then the adding level too to the ORDER BY shouldn't make any difference unless your numeric order of versions and level are in sync.
A problem may appear if you have components like component2_name1.2 etc which could break this logic, for which you may require REGEXP to identify the required pattern. But, it doesn't appear so from your data and I assumed that to be the case and you may want to clarify if that's not what you may always have in your dataset.
Here's a demo of the result obtained for your sample data.
Demo
This will work of the numeric character is always a valid decimal and has only one decimal point. If you have complex versioning system like say 1.1.8, 2.1.1 etc, it needs far sophisticated ordering on top of REPLACE ( name, TRANSLATE(name,' .0123456789',' '),'').
You will find such examples in posts such as this one Here
Note: I would request you to also please read the instructions here to know how to ask a good question. This would avoid all confusion to people who try to understand and answer your question.

Retrieving values from two columns based on different conditions

I have a question for you all. I have 'inherited' a DB at work and I have to create a report from a table using different conditions. Please note I'm no sql expert, I hope what I write makes sense.
Trying to simplify, I have a HARDWARE table that contains the following:
HWTYPE - type of hardware
HWMODEL - model of hardware
PHONENUM - phone number
USERID - user the hardware is assigned to
the data looks like this:
HWTYPE | HWMODEL | PHONENUM | USERID
-------+------------+------------+----------
SIM | SIMVOICE | 123456 | CIRO
SIM | SIMVOICE | 124578 | LEO
PHONE | APPLE | | CIRO
PHONE | SAMSUNG | | LEO
now as you can see, every user has assigned one phone and one SIM with a phone number.
I need to sort the data per user, so that every line of the query result look like:
HW | PHONENUM | USERID
---------+--------------+------
APPLE | 123456 | CIRO
SAMSUNG | 124578 | LEO
so basically: group column PHONENUM and HWMODEL based on USER.
And this is where I get stuck! I tried union, join, case etc. but I still don't get the correct result.
Again apologies for the (probably) very basic question. I tried to look for something similar but could not find anything.
Thanks to whoever will want to help me.
regards
Leo
I dont know that I understood your question or not
But i think you just need to write following query for your O/P
SELECT
HWTYPE, HWMODEL, USERID
FROM
HARDWARE
GROUP BY USERID ,HWTYPE,HWMODEL
ORDER BY HWTYPE
Placing my comment as an answer;
Write this as your SQL:
SELECT
(HWTYPE, HWMODEL, USERID)
FROM
HARDWARE
GROUP BY (USERID) /*Other Clauses can be added here, but ensure you use commas to seperate them!*/
Taking this step by step:
SELECT ... -> what columns you want to see
FROM ... -> what table you want it from (use joins if you need from multiple tables)
GROUP BY... -> What you want to collect together
There is also:
WHERE... -> conditions for when to include/what not to include
You can also get your expected Result using below query. This will use self join on Hardware table as well as reduce the length of query syntex.
SELECT H2.HWMODEL AS HW,H1.PHONENUM,H1.USERID FROM HARDWARE H1 INNER JOIN HARDWARE H2 ON H1.USERID = H2.USERID
WHERE ISNULL(H1.PHONENUM,'') <> ''
AND ISNULL(H2.PHONENUM,'') = ''
ORDER BY H2.HWMODEL ASC
Hope this will help you.
If your goal is a list of Phone numbers and models sorted by user, you can use this query:
SELECT HWMODEL,PHONENUM,USERID FROM HARDWARE ORDER BY USER
Hope to have understood your question
SELECT
hw.HWMODEL as HW
,phone.PHONENUM
,user.USERID
FROM
(
SELECT DISTINCT
USERID
FROM
HARDWARE
) as user
LEFT JOIN
(
SELECT
USERID
,PHONENUM
FROM
HARDWARE
WHERE
PHONENUM IS NOT NULL
) as phone
ON phone.USERID = user.USERID
LEFT JOIN
(
SELECT
USERID
,HWMODEL
FROM
HARDWARE
WHERE
HWTYPE = 'PHONE'
) AS hw
ON hw.USERID = user.USERID
ORDER BY
user.USERID
,hw.HWMODEL

SQL: SUM of MAX values WHERE date1 <= date2 returns "wrong" results

Hi stackoverflow users
I'm having a bit of a problem trying to combine SUM, MAX and WHERE in one query and after an intense Google search (my search engine skills usually don't fail me) you are my last hope to understand and fix the following issue.
My goal is to count people in a certain period of time and because a person can visit more than once in said period, I'm using MAX. Due to the fact that I'm defining people as male (m) or female (f) using a string (for statistic purposes), CHAR_LENGTH returns the numbers I'm in need of.
SELECT SUM(max_pers) AS "People"
FROM (
SELECT "guests"."id", MAX(CHAR_LENGTH("guests"."gender")) AS "max_pers"
FROM "guests"
GROUP BY "guests"."id")
So far, so good. But now, as stated before, I'd like to only count the guests which visited in a certain time interval (for statistic purposes as well).
SELECT "statistic"."id", SUM(max_pers) AS "People"
FROM (
SELECT "guests"."id", MAX(CHAR_LENGTH("guests"."gender")) AS "max_pers"
FROM "guests"
GROUP BY "guests"."id"),
"statistic", "guests"
WHERE ( "guests"."arrival" <= "statistic"."from" AND "guests"."departure" >= "statistic"."to")
GROUP BY "statistic"."id"
This query returns the following, x = desired result:
x * (x+1)
So if the result should be 3, it's 12. If it should be 5, it's 30 etc.
I probably could solve this algebraic but I'd rather understand what I'm doing wrong and learn from it.
Thanks in advance and I'm certainly going to answer all further questions.
PS: I'm using LibreOffice Base.
EDIT: An example
guests table:
ID | arrival | departure | gender |
10 | 1.1.14 | 10.1.14 | mf |
10 | 15.1.14 | 17.1.14 | m |
11 | 5.1.14 | 6.1.14 | m |
12 | 10.2.14 | 24.2.14 | f |
13 | 27.2.14 | 28.2.14 | mmmmmf |
statistic table:
ID | from | to | name |
1 | 1.1.14 | 31.1.14 |January | expected result: 3
2 | 1.2.14 | 28.2.14 |February| expected result: 7
MAX(...) is the wrong function: You want COUNT(DISTINCT ...).
Add proper join syntax, simplify (and remove unnecessary quotes) and this should work:
SELECT s.id, COUNT(DISTINCT g.id) AS People
FROM statistic s
LEFT JOIN guests g ON g.arrival <= s."from" AND g.departure >= s."too"
GROUP BY s.id
Note: Using LEFT join means you'll get a result of zero for statistics ids that have no guests. If you would rather no row at all, remove the LEFT keyword.
You have a very strange data structure. In any case, I think you want:
SELECT s.id, sum(numpersons) AS People
FROM (select g.id, max(char_length(g.gender)) as numpersons
from guests g join
statistic s
on g.arrival <= s."from" AND g.departure >= s."too"
group by g.id
) g join
GROUP BY s.id;
Thanks for all your inputs. I wasn't familiar with JOIN but it was necessary to solve my problem.
Since my databank is designed in german, I made quite the big mistake while translating it and I'm sorry if this caused confusion.
Selecting guests.id and later on grouping by guests.id wouldn't make any sense since the id is unique. What I actually wanted to do is select and group the guests.adr_id which links a visiting guest to an adress databank.
The correct solution to my problem is the following code:
SELECT statname, SUM (numpers) FROM (
SELECT statistic.name AS statname, guests.adr_id, MAX( CHAR_LENGTH( guests.gender ) ) AS numpers
FROM guests
JOIN statistics ON (guests.arrival <= statistics.too AND guests.departure >= statistics.from )
GROUP BY guests.adr_id, statistic.name )
GROUP BY statname
I also noted that my database structure is a mess but I created it learning by doing and haven't found any time to rewrite it yet. Next time posting, I'll try better.

Select unique records and display as category headers in rails

I have a rails 3.2 app running on PostgreSQL, and have some data I want to display in my view, which is stored in the database in this structure:
+----+--------+------------------+--------------------+
| id | name | sched_start_date | task |
+----+--------+------------------+--------------------+
| 1 | "Ben" | 2013-03-01 | "Check for debris" |
+----+--------+------------------+--------------------+
| 2 | "Toby" | 2013-03-02 | "Carry out Y1.1" |
+----+--------+------------------+--------------------+
| 3 | "Toby" | 2013-03-03 | "Check oil seals" |
+----+--------+------------------+--------------------+
I would like to display a list of tasks for each name, and for the names to be ordered ASC by the first sched_start_date they have, which should look like ...
Ben
2013-03-01 – Check for debris
Toby
2013-03-02 – Carry out Y1.1
2013-03-03 – Check oil seals
The approach I starting taking was to run a query for unique names and order them by sched_start_date ASC, then run a query for each name to get their tasks.
To get a list of unique names, the SQL would look like this.
select *
from (
select distinct on (name) name, sched_start_date
from tasks
) p
order by sched_start_date;
I would like to know if this is the correct approach (querying for unique names then running another query for all their tasks), or if there is a better rails way.
To get the data sorted like you describe, you might want to use min() as window function in the ORDER BY clause:
SELECT name, sched_start_date, task
FROM tasks
ORDER BY min(sched_start_date) OVER (PARTITION BY name), 1, 2, 3
Your original query would need an additional ORDER BY item to get the earliest date per name:
SELECT DISTINCT ON (name) name, sched_start_date, task
FROM tasks
ORDER BY 1, 2, 3;
I also added task (3) as last ORDER BY item to break ties, in case there can be more than one per date.
But the output is still ordered by name, not by date.
Getting your peculiar format with all data stuffed into one column is a bit more complex:
SELECT one_col
FROM (
WITH x AS (
SELECT name, min(sched_start_date) AS min_start
FROM tasks
GROUP BY 1
)
SELECT 2 AS rnk, name
,sched_start_date::text || ' – ' || task AS one_col
,sched_start_date, min_start
FROM tasks
JOIN x USING (name)
UNION ALL
SELECT 1 AS rnk, name, name, NULL::date, min_start
FROM x
ORDER BY min_start, name, rnk, sched_start_date, task
) y
Assuming that you have associations in your model you would be able to run
#employees = Employee.order(:name, :sched_start_date, :task).includes(:tasks)
You could then iterate over them:
#employees.each do |employee|
employee.name
employee.tasks.each do |task|
task.name
end
end
This isn't gonna exactly match your needs, but should show you where to start.

Is there any difference between GROUP BY and DISTINCT

I learned something simple about SQL the other day:
SELECT c FROM myTbl GROUP BY C
Has the same result as:
SELECT DISTINCT C FROM myTbl
What I am curious of, is there anything different in the way an SQL engine processes the command, or are they truly the same thing?
I personally prefer the distinct syntax, but I am sure it's more out of habit than anything else.
EDIT: This is not a question about aggregates. The use of GROUP BY with aggregate functions is understood.
MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."
However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.
A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?
(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinct and screw => get list of unique values in a table column)
GROUP BY lets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT.
On the other hand DISTINCT just removes duplicates.
For example, if you have a bunch of purchase records, and you want to know how much was spent by each department, you might do something like:
SELECT department, SUM(amount) FROM purchases GROUP BY department
This will give you one row per department, containing the department name and the sum of all of the amount values in all rows for that department.
What's the difference from a mere duplicate removal functionality point of view
Apart from the fact that unlike DISTINCT, GROUP BY allows for aggregating data per group (which has been mentioned by many other answers), the most important difference in my opinion is the fact that the two operations "happen" at two very different steps in the logical order of operations that are executed in a SELECT statement.
Here are the most important operations:
FROM (including JOIN, APPLY, etc.)
WHERE
GROUP BY (can remove duplicates)
Aggregations
HAVING
Window functions
SELECT
DISTINCT (can remove duplicates)
UNION, INTERSECT, EXCEPT (can remove duplicates)
ORDER BY
OFFSET
LIMIT
As you can see, the logical order of each operation influences what can be done with it and how it influences subsequent operations. In particular, the fact that the GROUP BY operation "happens before" the SELECT operation (the projection) means that:
It doesn't depend on the projection (which can be an advantage)
It cannot use any values from the projection (which can be a disadvantage)
1. It doesn't depend on the projection
An example where not depending on the projection is useful is if you want to calculate window functions on distinct values:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
GROUP BY rating
When run against the Sakila database, this yields:
rating rn
-----------
G 1
NC-17 2
PG 3
PG-13 4
R 5
The same couldn't be achieved with DISTINCT easily:
SELECT DISTINCT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
That query is "wrong" and yields something like:
rating rn
------------
G 1
G 2
G 3
...
G 178
NC-17 179
NC-17 180
...
This is not what we wanted. The DISTINCT operation "happens after" the projection, so we can no longer remove DISTINCT ratings because the window function was already calculated and projected. In order to use DISTINCT, we'd have to nest that part of the query:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM (
SELECT DISTINCT rating FROM film
) f
Side-note: In this particular case, we could also use DENSE_RANK()
SELECT DISTINCT rating, dense_rank() OVER (ORDER BY rating) AS rn
FROM film
2. It cannot use any values from the projection
One of SQL's drawbacks is its verbosity at times. For the same reason as what we've seen before (namely the logical order of operations), we cannot "easily" group by something we're projecting.
This is invalid SQL:
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY name
This is valid (repeating the expression)
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY first_name || ' ' || last_name
This is valid, too (nesting the expression)
SELECT name
FROM (
SELECT first_name || ' ' || last_name AS name
FROM customer
) c
GROUP BY name
I've written about this topic more in depth in a blog post
There is no difference (in SQL Server, at least). Both queries use the same execution plan.
http://sqlmag.com/database-performance-tuning/distinct-vs-group
Maybe there is a difference, if there are sub-queries involved:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
There is no difference (Oracle-style):
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212
Use DISTINCT if you just want to remove duplicates. Use GROUPY BY if you want to apply aggregate operators (MAX, SUM, GROUP_CONCAT, ..., or a HAVING clause).
I expect there is the possibility for subtle differences in their execution.
I checked the execution plans for two functionally equivalent queries along these lines in Oracle 10g:
core> select sta from zip group by sta;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH GROUP BY | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
core> select distinct sta from zip;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH UNIQUE | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
The middle operation is slightly different: "HASH GROUP BY" vs. "HASH UNIQUE", but the estimated costs etc. are identical. I then executed these with tracing on and the actual operation counts were the same for both (except that the second one didn't have to do any physical reads due to caching).
But I think that because the operation names are different, the execution would follow somewhat different code paths and that opens the possibility of more significant differences.
I think you should prefer the DISTINCT syntax for this purpose. It's not just habit, it more clearly indicates the purpose of the query.
For the query you posted, they are identical. But for other queries that may not be true.
For example, it's not the same as:
SELECT C FROM myTbl GROUP BY C, D
I read all the above comments but didn't see anyone pointed to the main difference between Group By and Distinct apart from the aggregation bit.
Distinct returns all the rows then de-duplicates them whereas Group By de-deduplicate the rows as they're read by the algorithm one by one.
This means they can produce different results!
For example, the below codes generate different results:
SELECT distinct ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
SELECT ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
GROUP BY Name
If there are 10 names in the table where 1 of which is a duplicate of another then the first query returns 10 rows whereas the second query returns 9 rows.
The reason is what I said above so they can behave differently!
If you use DISTINCT with multiple columns, the result set won't be grouped as it will with GROUP BY, and you can't use aggregate functions with DISTINCT.
GROUP BY has a very specific meaning that is distinct (heh) from the DISTINCT function.
GROUP BY causes the query results to be grouped using the chosen expression, aggregate functions can then be applied, and these will act on each group, rather than the entire resultset.
Here's an example that might help:
Given a table that looks like this:
name
------
barry
dave
bill
dave
dave
barry
john
This query:
SELECT name, count(*) AS count FROM table GROUP BY name;
Will produce output like this:
name count
-------------
barry 2
dave 3
bill 1
john 1
Which is obviously very different from using DISTINCT. If you want to group your results, use GROUP BY, if you just want a unique list of a specific column, use DISTINCT. This will give your database a chance to optimise the query for your needs.
If you are using a GROUP BY without any aggregate function then internally it will treated as DISTINCT, so in this case there is no difference between GROUP BY and DISTINCT.
But when you are provided with DISTINCT clause better to use it for finding your unique records because the objective of GROUP BY is to achieve aggregation.
They have different semantics, even if they happen to have equivalent results on your particular data.
Please don't use GROUP BY when you mean DISTINCT, even if they happen to work the same. I'm assuming you're trying to shave off milliseconds from queries, and I have to point out that developer time is orders of magnitude more expensive than computer time.
In Teradata perspective :
From a result set point of view, it does not matter if you use DISTINCT or GROUP BY in Teradata. The answer set will be the same.
From a performance point of view, it is not the same.
To understand what impacts performance, you need to know what happens on Teradata when executing a statement with DISTINCT or GROUP BY.
In the case of DISTINCT, the rows are redistributed immediately without any preaggregation taking place, while in the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.
Don’t think now that GROUP BY is always better from a performance point of view. When you have many different values, the preaggregation step of GROUP BY is not very efficient. Teradata has to sort the data to remove duplicates. In this case, it may be better to the redistribution first, i.e. use the DISTINCT statement. Only if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the deduplication step takes place, after redistribution.
In short, DISTINCT vs. GROUP BY in Teradata means:
GROUP BY -> for many duplicates
DISTINCT -> no or a few duplicates only .
At times, when using DISTINCT, you run out of spool space on an AMP. The reason is that redistribution takes place immediately, and skewing could cause AMPs to run out of space.
If this happens, you have probably a better chance with GROUP BY, as duplicates are already removed in a first step, and less data is moved across the AMPs.
group by is used in aggregate operations -- like when you want to get a count of Bs broken down by column C
select C, count(B) from myTbl group by C
distinct is what it sounds like -- you get unique rows.
In sql server 2005, it looks like the query optimizer is able to optimize away the difference in the simplistic examples I ran. Dunno if you can count on that in all situations, though.
In that particular query there is no difference. But, of course, if you add any aggregate columns then you'll have to use group by.
You're only noticing that because you are selecting a single column.
Try selecting two fields and see what happens.
Group By is intended to be used like this:
SELECT name, SUM(transaction) FROM myTbl GROUP BY name
Which would show the sum of all transactions for each person.
From a 'SQL the language' perspective the two constructs are equivalent and which one you choose is one of those 'lifestyle' choices we all have to make. I think there is a good case for DISTINCT being more explicit (and therefore is more considerate to the person who will inherit your code etc) but that doesn't mean the GROUP BY construct is an invalid choice.
I think this 'GROUP BY is for aggregates' is the wrong emphasis. Folk should be aware that the set function (MAX, MIN, COUNT, etc) can be omitted so that they can understand the coder's intent when it is.
The ideal optimizer will recognize equivalent SQL constructs and will always pick the ideal plan accordingly. For your real life SQL engine of choice, you must test :)
PS note the position of the DISTINCT keyword in the select clause may produce different results e.g. contrast:
SELECT COUNT(DISTINCT C) FROM myTbl;
SELECT DISTINCT COUNT(C) FROM myTbl;
I know it's an old post. But it happens that I had a query that used group by just to return distinct values when using that query in toad and oracle reports everything worked fine, I mean a good response time. When we migrated from Oracle 9i to 11g the response time in Toad was excellent but in the reporte it took about 35 minutes to finish the report when using previous version it took about 5 minutes.
The solution was to change the group by and use DISTINCT and now the report runs in about 30 secs.
I hope this is useful for someone with the same situation.
Sometimes they may give you the same results but they are meant to be used in different sense/case. The main difference is in syntax.
Minutely notice the example below. DISTINCT is used to filter out the duplicate set of values. (6, cs, 9.1) and (1, cs, 5.5) are two different sets. So DISTINCT is going to display both the rows while GROUP BY Branch is going to display only one set.
SELECT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT DISTINCT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT * FROM student GROUP BY Branch;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 2 | mech | 6.3 |
+------+--------+------+
4 rows in set (0.001 sec)
Sometimes the results that can be achieved by GROUP BY clause is not possible to achieved by DISTINCT without using some extra clause or conditions. E.g in above case.
To get the same result as DISTINCT you have to pass all the column names in GROUP BY clause like below. So see the syntactical difference. You must have knowledge about all the column names to use GROUP BY clause in that case.
SELECT * FROM student GROUP BY Id, Branch, CGPA;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 1 | cs | 5.5 |
| 2 | mech | 6.3 |
| 3 | civil | 7.2 |
| 4 | eee | 8.2 |
| 6 | cs | 9.1 |
+------+--------+------+
Also I have noticed GROUP BY displays the results in ascending order by default which DISTINCT does not. But I am not sure about this. It may be differ vendor wise.
Source : https://dbjpanda.me/dbms/languages/sql/sql-syntax-with-examples#group-by
In terms of usage, GROUP BY is used for grouping those rows you want to calculate. DISTINCT will not do any calculation. It will show no duplicate rows.
I always used DISTINCT if I want to present data without duplicates.
If I want to do calculations like summing up the total quantity of mangoes, I will use GROUP BY
In Hive (HQL), GROUP BY can be way faster than DISTINCT, because the former does not require comparing all fields in the table.
See: https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct.
The way I always understood it is that using distinct is the same as grouping by every field you selected in the order you selected them.
i.e:
select distinct a, b, c from table;
is the same as:
select a, b, c from table group by a, b, c
Funtional efficiency is totally different.
If you would like to select only "return value" except duplicate one, use distinct is better than group by. Because "group by" include ( sorting + removing ) , "distinct" include ( removing )
Generally we can use DISTINCT for eliminate the duplicates on Specific Column in the table.
In Case of 'GROUP BY' we can Apply the Aggregation Functions like
AVG, MAX, MIN, SUM, and COUNT on Specific column and fetch
the column name and it aggregation function result on the same column.
Example :
select specialColumn,sum(specialColumn) from yourTableName group by specialColumn;
There is no significantly difference between group by and distinct clause except the usage of aggregate functions.
Both can be used to distinguish the values but if in performance point of view group by is better.
When distinct keyword is used , internally it used sort operation which can be view in execution plan.
Try simple example
Declare #tmpresult table
(
Id tinyint
)
Insert into #tmpresult
Select 5
Union all
Select 2
Union all
Select 3
Union all
Select 4
Select distinct
Id
From #tmpresult