SQL WHERE Clause not filtering to criteria MS-Access - sql

I have written a query to gather the balances of two different days, find the percent difference and then display them. I added a Percent Filter section to my form to show only values that are >= the desired percentage.
When running the query, I get the results that are >= percent given. However, after the criteria is met, the results expand past and continue until 0, as if ignoring my WHERE clause. Is there something I'm not catching within my query?
Query being used:
SELECT [x].[ID], [x].[Name], [x].[Day1Date], [x].[Day1Bal], [x].[Day2Date], [x].[Day2Bal], [x].[Difference], IIf(([Day2Bal]>[Day1Bal]),((([Day2Bal]-[Day1Bal])/[Day1Bal])*100),(((([Day2Bal]-[Day1Bal])/[Day1Bal])*-1)*100)) AS PerDiff
FROM qryUnion AS x
WHERE IIf(([Day2Bal]>[Day1Bal]),((([Day2Bal]-[Day1Bal])/[Day1Bal])*100),(((([Day2Bal]-[Day1Bal])/[Day1Bal])*-1)*100)) > [Forms]![Compare]![txtPercent]
ORDER BY IIf(([Day2Bal]>[Day1Bal]),((([Day2Bal]-[Day1Bal])/[Day1Bal])*100),(((([Day2Bal]-[Day1Bal])/[Day1Bal])*-1)*100)) DESC
I have edited and re-written my IIf statement countless times but it still doesn't filter to criteria properly.
Results (Filtered for >= 10%) :
+----------+
| PerDiff |
+----------+
| 985.256 |
| 457.25 |
| 369.54 |
| 245.21 |
| 141.14 |
| 68.23 |
| 28.54 |
| 10.21454 |
| 10.1212 | <------- Criteria met
| 9.555 |
| 8.42 |
| 2.12 |
| 0.42 | <------- Ends at 0
+----------+
Obviously I'm wanting it to end at where the criteria is met, and I believe I've written my where clause to do so. I'm uncertain where else might be messing up.
qryUnion was a SubQuery but I had written just to get Dates and DateBals.
Any help is greatly appreciate! I'm still a bit new to SQL (and VBA for that matter). Thanks in advance!
EDIT1:
I have also tried
WHERE IIf(([Day2Bal]>[Day1Bal]),((([Day2Bal]-[Day1Bal])/[Day1Bal])*100),(((([Day2Bal]-[Day1Bal])/[Day1Bal])*-1)*100)) >= [Forms]![Compare]![txtPercent] _
AND NOT IIf(([Day2Bal]>[Day1Bal]),((([Day2Bal]-[Day1Bal])/[Day1Bal])*100),(((([Day2Bal]-[Day1Bal])/[Day1Bal])*-1)*100)) < [Forms]![Compare]![txtPercent]
As to not show any data that is less than the given percentage. This line didn't work. Is it possible that my WHERE clause isn't the issue? I'm uncertain where else the issue may lie.

*A better answer may exist, but this will accomplish your goal also:
You can create a subquery for PerDiff field before writing the final query:
SELECT [x].[ID], [x].[Name], [x].[Day1Date], [x].[Day1Bal], [x].[Day2Date], [x].[Day2Bal], [x].[Difference], IIf(([Day2Bal]>[Day1Bal]),((([Day2Bal]-[Day1Bal])/[Day1Bal])*100),(((([Day2Bal]-[Day1Bal])/[Day1Bal])*-1)*100)) AS PerDiff
FROM qryUnion AS x
Creating this subquery will then give you the results of the iff statement in your select clause that can then be used in the next query. So your final query could then use the Where clause like this:
WHERE PerDiff > [Forms]![Compare]![txtPercent]
ORDER BY PerDiff DESC

After a ton of trouble shooting, it seems my issue was the * 100 within my IIf statement.
SQL that worked:
SELECT [x].[DDANbr], [x].[Name], [x].[Day1Date], [x].[Day1Bal], [x].[Day2Date], [x].[Day2Bal], [x].[Difference], IIf(([Day2Bal]>[Day1Bal]),((([Day2Bal]-[Day1Bal])/[Day1Bal])),(((([Day2Bal]-[Day1Bal])/[Day1Bal])*-1))) AS PerDiff
FROM qry250CapAllCompare_Union AS x
--Added /100 at the end of WHERE clause to ensure that I was getting 10% because math
WHERE IIf(([Day2Bal]>[Day1Bal]),((([Day2Bal]-[Day1Bal])/[Day1Bal])),(((([Day2Bal]-[Day1Bal])/[Day1Bal])*-1)))>=Forms!frmCompare!txtPercent/100
ORDER BY IIf(([Day2Bal]>[Day1Bal]),((([Day2Bal]-[Day1Bal])/[Day1Bal])),(((([Day2Bal]-[Day1Bal])/[Day1Bal])*-1))) DESC

Related

Union results either Blank or Unfiltered

We have data that separates Paid and Rejected claims. I need to see results of both and therefore have to do a union.
(Our data is a mess. I am also aliasing for confidentiality/HIPPA compliance. Please try not to get hung up on those parts because I can't change it.)
SELECT CustID, code, date, 'Paid' AS Srce
FROM Paid.Claims
INNER JOIN Paid.Medical
ON Paid.Claims.id = Paid.Medical.id
AND Paid.Claims.blind = Paid.Medical.blind
WHERE Paid.Claims.date BETWEEN '2022-01-01' AND '2022-06-07'
AND Paid.Medical.code IN ('88521','88522','88523','88524','88525')
AND Paid.Claims.custID IN ('N065468','N095843','N001086')
UNION
SELECT CustID, code, date, 'Filter' AS Srce
FROM Rejected.Claims
INNER JOIN Rejected.Medical
ON Rejected.Claims.id = Rejected.Medical.id
AND Rejected.Claims.blind = Rejected.Medical.blind
WHERE Rejected.Claims.date BETWEEN '2022-01-01' AND '2022-06-07'
AND Rejected.Medical.code IN '88521','88522','88523','88524','88525')
AND Rejected.Claims.custID IN ('N065468','N095843','N001086')
It's based on a query that the person before me made, and that one works but it's also much simpler because it pulls less from less places. My outcomes so far have been:
Leave the where-clause out of the Paid data but still in the Rejected data and get EVERY RESULT. None of the filtering seems to be working.
Include the where-clause in both and get no results. filtering not working, but in the opposite direction.
I have also tried
SELECT *
FROM (
everything above with and without filters
) AS results
WHERE <filters same as above>
And results set is empty.
I have tried with and without aliasing with no changes in what's returned.
I'm expecting about 200 results that SHOULD look something like this:
| CustID | code | date | Srce |
| ------- | ----- | ---------- | ------ |
| N065468 | 88522 | 2022-04-04 | Paid |
| N095843 | 88521 | 2022-03-09 | Paid |
| N001086 | 88524 | 2022-05-20 | Filter |
Back to troubleshooting.
It's hard without sample data as someone else has requested. Without any data I'd recommend running each query individually and seeing what results you get ie. If you run just the top half for paid claims does that give you the rows you're expecting? Then the bottom half? The filtering arguments can be case sensitive so I'd recommend checking your medical codes are lower case.
We have many duplicate fields as a result of our flattening process. My formatting may have been correct, but it turns out I was pulling from some of the wrong places. I solved the problem by creating a franken-query using several others made by my predecessor with a similar union element. Depending on how I was attempting to alias everything, the filters were likely pulling from different fields than the ones I was select-ing.

Why does adding a SUM(column) throw a group by error [SQL]

I found some similar questions, but none of the solutions would work, nor did they explain what was causing the issue.
I have a working query
SELECT pages.pageString pageName, timeSpent
FROM
(SELECT `page_id`, SUM(`time_spent`) as timeSpent
FROM `pageViews`
WHERE `time_spent` > 0
GROUP BY `page_id`) myTable
JOIN pages ON pages.id = page_id
ORDER BY timeSpent DESC
LIMIT 5
This returns results that look like
+------------------------------+-----------+
| pageName | timeSpent |
+------------------------------+-----------+
| page 1 | 394292 |
| page 2 | 66990 |
| page 3 | 53896 |
| page 4 | 37796 |
| page 5 | 14982 |
+------------------------------+-----------+
I'd like to add a column containing the percentage of timeSpent relative to the other pages, to start I added a SUM(timeSpent) to my query but that throws an error
In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'pages.pageString'
Im not sure why this column is effected by adding this new column to the select statement.
Sadly any solution involving changing sql settings won't work due to company policy.
I appreciate any advice
UPDATE
The failing sql statement is
SELECT pages.pageString pageName, timeSpent FROM
(SELECT `page_id`, SUM(`time_spent`) as timeSpent FROM
`pageViews` WHERE `time_spent` > 0 GROUP BY `page_id`) myTable
JOIN pages ON pages.id = page_id ORDER BY timeSpent DESC LIMIT 5
As per the first answer I added a groupBy which solves the error
SELECT pages.pageString pageName, timeSpent, SUM(timeSpent) FROM
(SELECT `page_id`, SUM(`time_spent`) as timeSpent FROM `pageViews` WHERE `time_spent` > 0 GROUP BY `page_id`) myTable
JOIN pages ON pages.id = page_id GROUP BY pageName ORDER BY timeSpent DESC LIMIT 5
This however does not give the proper output
+------------------------------+-----------+----------------+
| pageName | timeSpent | SUM(timeSpent) |
+------------------------------+-----------+----------------+
| page 1. | 390210 | 390210 |
| page 2 | 66972 | 66972 |
| page3 | 52332 | 52332 |
| page4 | 25454 | 25454 |
| page5 | 13552 | 13552 |
+------------------------------+-----------+----------------+
Ideally this SUM(timeSpent) would be 390210+ 66972 + 52332 + 25454 + 13552 so that I may do timeSpent / SUM(timeSpent)
You did not say where you tried to put the sum(timeSpent) but I believe one can try to reconstruct with the error message:
In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'pages.pageString'
It says what the problem is. You added sum(timeSpent) to the projection, but the SQL statement does not have a GROUP BY, in particular it mentions the first item which should be aggregated pages.pageString.
It would mention the other ones too, once you fix this one.
On the other hand, please make sure you post exactly the failing SQL statement instead of trying to describe how to get the error you have. It's better for us who try to help.
Update:
You have two tables/views pages and pageViews. The first one is used to get the page name. I would just focus on the time calculation to make things easier. Figuring out the name afterwards is simple, because it is directly connected to the page_id.
The first information you want is the sum of all times spent so that you can calculate the ratio to this sum.
This is simply an aggregation where you sum the times over all pages.
The second information you want is the sum of the times per page_id. You already know how to do that. You group by the page_id while aggregating the sums of each.
Try to put those two together now. You have the first statement of which the result shall be applied to each row of the second statement so that you get the table form page_id, time_spent_page, time_spent_all.
When you have step 3 then it is easy to add the page_name now, since you have the page_id which is required for a simple join.
I tried no to give away the solution. Maybe you like to try again following the steps above. If you have difficulties, simply leave a comment (maybe showing how far you got).
It might look complex in the beginning, but once you have done that successfully I hope you'll see that it can be simple.
Adding a column containing the percentage of timeSpent relative to the sum of all pages
SELECT pages.pageString pageName, timeSpent,
, timeSpent / sum(timeSpent) over() * 100 p
FROM
(SELECT `page_id`, SUM(`time_spent`) as timeSpent
FROM `pageViews`
WHERE `time_spent` > 0
GROUP BY `page_id`) myTable
JOIN pages ON pages.id = page_id
ORDER BY timeSpent DESC
LIMIT 5

SQL Server Select COUNT without using aggregate function or group by

I'm in a very, very tight situation here. I have an SQL query running on SQL Server 2005:
SELECT col1,col2,col3 FROM myTable
Which of course gives:
col1 | col2 | col3
------------------
1 | a | i
2 | b | ii
etc
I need to, if possible, add a COUNT query so that it will return the number of records returned. I cannot use GROUP BY or an aggregate function (It's a very edge case on some very inflexible software).
Ideally, something like this:
SELECT col1,col2,col3,COUNT(NumberOfRows) as NumRows FROM myTable
col1 | col2 | col3| NumRows
---------------------------
1 | a | i | 2
2 | b | ii | 2
I realise that this is bad. And inefficient. And against all good practices. But I'm in a corner with software whose architecture was frozen in stone in 1991!
Uuh so it turns out my collegue came back with an answer 30 seconds after asking the question.
The correct syntax is:
SELECT col1,col2,col3,##ROWCOUNT as NumRows FROM myTable
Looks like using ##ROWCOUNT will return the number of rows processed by the previous query, so I'm not sure that this is a valid solution. I think this is because ##ROWCOUNT is internally set after the query is run, so it is best used after the query has already completed. Therefore, it won't return the number of rows processed by the query in which it is placed.

MYSQL - Combining Two Results in One Query

I have a query I need to perform to show search results for a project. What needs to happen, I need to sort the results by the "horsesActiveDate" and this applies to all of them except for any ad with the adtypesID=7. Those results are sorted by date but they must always result after all other ads.
So I will have all my ads in the result set be ordered by the Active Date AND adtypesID != 7. After that, I need all adtypesID=7 to be sorted by Active Date and appended at the bottom of all the results.
I'm hoping to put this in one query instead of two and appending them together in PHP. The way the code is written, I have to find a way to get it all in one query.
So here is my original query which has worked great until I had to ad the adtypesID=7 which has different sorting requirements.
This is the query that exists now that doesn't take into account the adtypesID for sorting.
SELECT
horses.horsesID,
horsesDescription,
horsesActiveDate,
adtypesID,
states.statesName,
horses_images.himagesPath
FROM horses
LEFT JOIN states ON horses.statesID = states.statesID
LEFT JOIN horses_images ON horses_images.himagesDefault = 1 AND horses_images.horsesID = horses.horsesID AND horses_images.himagesPath != ''
WHERE
horses.horsesStud = 0
AND horses.horsesSold = 0
AND horses.horsesID IN
(
SELECT DISTINCT horses.horsesID
FROM horses
LEFT JOIN horses_featured ON horses_featured.horsesID = horses.horsesID
WHERE horses.horsesActive = 1
)
ORDER BY adtypesID, horses.horsesActiveDate DESC
My first thought was to do two queries where one looked for all the ads that did not contain adtypesID=7 and sort those as the query does, then run a second query to find only those ads with adtypesID=7 and sort those by date. Then take those two results and append them to each other. Since I need to get this all into one query, I can't use a php function to do that.
Is there a way to merge the two query results one after the other in mysql? Is there a better way to run this query that will accomplish this sorting?
The Ideal Results would be as below (I modified the column names so they would be shorter):
ID | Description | ActiveDate | adtypesID | statesName | himagesPath
___________________________________________________________________________
3 | Ad Text | 06-01-2010 | 3 | OK | image.jpg
2 | Ad Text | 05-31-2010 | 2 | LA | image1.jpg
9 | Ad Text | 03-01-2010 | 4 | OK | image3.jpg
6 | Ad Text | 06-01-2010 | 7 | OK | image5.jpg
6 | Ad Text | 05-01-2010 | 7 | OK | image5.jpg
6 | Ad Text | 04-01-2010 | 7 | OK | image5.jpg
Any help that can be provided will be greatly appreciated!
I am not sure about the exact syntax in MySQL, but something like
ORDER BY case when adtypesID = 7 then 2 else 1 end ASC, horses.horsesActiveDate DESC
would work in many other SQL dielects.
Note that most SQL dialects allow the order by to not only be a column, but an expression.
This should work:
ORDER BY (adtypesID = 7) ASC, horses.horsesActiveDate DESC
Use a Union to append two queries together, like this:
SELECT whatever FROM wherever ORDER BY something AND adtypesID!=7
UNION
SELECT another FROM somewhere ORDER BY whocares AND adtypesID=7
http://dev.mysql.com/doc/refman/5.0/en/union.html
I re-wrote your query as:
SELECT h.horsesID,
h.horsesDescription,
h.horsesActiveDate,
adtypesID,
s.statesName,
hi.himagesPath
FROM HORSES h
LEFT JOIN STATES s ON s.stateid = h.statesID
LEFT JOIN HORSES_IMAGES hi ON hi.horsesID = h.horsesID
AND hi.himagesDefault = 1
AND hi.himagesPath != ''
LEFT JOIN HORSES_FEATURED hf ON hf.horsesID = h.horsesID
WHERE h.horsesStud = 0
AND h.horsesSold = 0
AND h.horsesActive = 1
ORDER BY (adtypesID = 7) ASC, h.horsesActiveDate DESC
The IN subquery, using a LEFT JOIN and such, will mean that any horse record whose horsesActive value is 1 will be returned - regardless if they have an associated HORSES_FEATURED record. I leave it to you for checking your data to decide if it should really be an INNER JOIN. Likewise for the STATES table relationship...

Is there any difference between GROUP BY and DISTINCT

I learned something simple about SQL the other day:
SELECT c FROM myTbl GROUP BY C
Has the same result as:
SELECT DISTINCT C FROM myTbl
What I am curious of, is there anything different in the way an SQL engine processes the command, or are they truly the same thing?
I personally prefer the distinct syntax, but I am sure it's more out of habit than anything else.
EDIT: This is not a question about aggregates. The use of GROUP BY with aggregate functions is understood.
MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."
However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.
A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?
(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinct and screw => get list of unique values in a table column)
GROUP BY lets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT.
On the other hand DISTINCT just removes duplicates.
For example, if you have a bunch of purchase records, and you want to know how much was spent by each department, you might do something like:
SELECT department, SUM(amount) FROM purchases GROUP BY department
This will give you one row per department, containing the department name and the sum of all of the amount values in all rows for that department.
What's the difference from a mere duplicate removal functionality point of view
Apart from the fact that unlike DISTINCT, GROUP BY allows for aggregating data per group (which has been mentioned by many other answers), the most important difference in my opinion is the fact that the two operations "happen" at two very different steps in the logical order of operations that are executed in a SELECT statement.
Here are the most important operations:
FROM (including JOIN, APPLY, etc.)
WHERE
GROUP BY (can remove duplicates)
Aggregations
HAVING
Window functions
SELECT
DISTINCT (can remove duplicates)
UNION, INTERSECT, EXCEPT (can remove duplicates)
ORDER BY
OFFSET
LIMIT
As you can see, the logical order of each operation influences what can be done with it and how it influences subsequent operations. In particular, the fact that the GROUP BY operation "happens before" the SELECT operation (the projection) means that:
It doesn't depend on the projection (which can be an advantage)
It cannot use any values from the projection (which can be a disadvantage)
1. It doesn't depend on the projection
An example where not depending on the projection is useful is if you want to calculate window functions on distinct values:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
GROUP BY rating
When run against the Sakila database, this yields:
rating rn
-----------
G 1
NC-17 2
PG 3
PG-13 4
R 5
The same couldn't be achieved with DISTINCT easily:
SELECT DISTINCT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
That query is "wrong" and yields something like:
rating rn
------------
G 1
G 2
G 3
...
G 178
NC-17 179
NC-17 180
...
This is not what we wanted. The DISTINCT operation "happens after" the projection, so we can no longer remove DISTINCT ratings because the window function was already calculated and projected. In order to use DISTINCT, we'd have to nest that part of the query:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM (
SELECT DISTINCT rating FROM film
) f
Side-note: In this particular case, we could also use DENSE_RANK()
SELECT DISTINCT rating, dense_rank() OVER (ORDER BY rating) AS rn
FROM film
2. It cannot use any values from the projection
One of SQL's drawbacks is its verbosity at times. For the same reason as what we've seen before (namely the logical order of operations), we cannot "easily" group by something we're projecting.
This is invalid SQL:
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY name
This is valid (repeating the expression)
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY first_name || ' ' || last_name
This is valid, too (nesting the expression)
SELECT name
FROM (
SELECT first_name || ' ' || last_name AS name
FROM customer
) c
GROUP BY name
I've written about this topic more in depth in a blog post
There is no difference (in SQL Server, at least). Both queries use the same execution plan.
http://sqlmag.com/database-performance-tuning/distinct-vs-group
Maybe there is a difference, if there are sub-queries involved:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
There is no difference (Oracle-style):
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212
Use DISTINCT if you just want to remove duplicates. Use GROUPY BY if you want to apply aggregate operators (MAX, SUM, GROUP_CONCAT, ..., or a HAVING clause).
I expect there is the possibility for subtle differences in their execution.
I checked the execution plans for two functionally equivalent queries along these lines in Oracle 10g:
core> select sta from zip group by sta;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH GROUP BY | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
core> select distinct sta from zip;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH UNIQUE | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
The middle operation is slightly different: "HASH GROUP BY" vs. "HASH UNIQUE", but the estimated costs etc. are identical. I then executed these with tracing on and the actual operation counts were the same for both (except that the second one didn't have to do any physical reads due to caching).
But I think that because the operation names are different, the execution would follow somewhat different code paths and that opens the possibility of more significant differences.
I think you should prefer the DISTINCT syntax for this purpose. It's not just habit, it more clearly indicates the purpose of the query.
For the query you posted, they are identical. But for other queries that may not be true.
For example, it's not the same as:
SELECT C FROM myTbl GROUP BY C, D
I read all the above comments but didn't see anyone pointed to the main difference between Group By and Distinct apart from the aggregation bit.
Distinct returns all the rows then de-duplicates them whereas Group By de-deduplicate the rows as they're read by the algorithm one by one.
This means they can produce different results!
For example, the below codes generate different results:
SELECT distinct ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
SELECT ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
GROUP BY Name
If there are 10 names in the table where 1 of which is a duplicate of another then the first query returns 10 rows whereas the second query returns 9 rows.
The reason is what I said above so they can behave differently!
If you use DISTINCT with multiple columns, the result set won't be grouped as it will with GROUP BY, and you can't use aggregate functions with DISTINCT.
GROUP BY has a very specific meaning that is distinct (heh) from the DISTINCT function.
GROUP BY causes the query results to be grouped using the chosen expression, aggregate functions can then be applied, and these will act on each group, rather than the entire resultset.
Here's an example that might help:
Given a table that looks like this:
name
------
barry
dave
bill
dave
dave
barry
john
This query:
SELECT name, count(*) AS count FROM table GROUP BY name;
Will produce output like this:
name count
-------------
barry 2
dave 3
bill 1
john 1
Which is obviously very different from using DISTINCT. If you want to group your results, use GROUP BY, if you just want a unique list of a specific column, use DISTINCT. This will give your database a chance to optimise the query for your needs.
If you are using a GROUP BY without any aggregate function then internally it will treated as DISTINCT, so in this case there is no difference between GROUP BY and DISTINCT.
But when you are provided with DISTINCT clause better to use it for finding your unique records because the objective of GROUP BY is to achieve aggregation.
They have different semantics, even if they happen to have equivalent results on your particular data.
Please don't use GROUP BY when you mean DISTINCT, even if they happen to work the same. I'm assuming you're trying to shave off milliseconds from queries, and I have to point out that developer time is orders of magnitude more expensive than computer time.
In Teradata perspective :
From a result set point of view, it does not matter if you use DISTINCT or GROUP BY in Teradata. The answer set will be the same.
From a performance point of view, it is not the same.
To understand what impacts performance, you need to know what happens on Teradata when executing a statement with DISTINCT or GROUP BY.
In the case of DISTINCT, the rows are redistributed immediately without any preaggregation taking place, while in the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.
Don’t think now that GROUP BY is always better from a performance point of view. When you have many different values, the preaggregation step of GROUP BY is not very efficient. Teradata has to sort the data to remove duplicates. In this case, it may be better to the redistribution first, i.e. use the DISTINCT statement. Only if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the deduplication step takes place, after redistribution.
In short, DISTINCT vs. GROUP BY in Teradata means:
GROUP BY -> for many duplicates
DISTINCT -> no or a few duplicates only .
At times, when using DISTINCT, you run out of spool space on an AMP. The reason is that redistribution takes place immediately, and skewing could cause AMPs to run out of space.
If this happens, you have probably a better chance with GROUP BY, as duplicates are already removed in a first step, and less data is moved across the AMPs.
group by is used in aggregate operations -- like when you want to get a count of Bs broken down by column C
select C, count(B) from myTbl group by C
distinct is what it sounds like -- you get unique rows.
In sql server 2005, it looks like the query optimizer is able to optimize away the difference in the simplistic examples I ran. Dunno if you can count on that in all situations, though.
In that particular query there is no difference. But, of course, if you add any aggregate columns then you'll have to use group by.
You're only noticing that because you are selecting a single column.
Try selecting two fields and see what happens.
Group By is intended to be used like this:
SELECT name, SUM(transaction) FROM myTbl GROUP BY name
Which would show the sum of all transactions for each person.
From a 'SQL the language' perspective the two constructs are equivalent and which one you choose is one of those 'lifestyle' choices we all have to make. I think there is a good case for DISTINCT being more explicit (and therefore is more considerate to the person who will inherit your code etc) but that doesn't mean the GROUP BY construct is an invalid choice.
I think this 'GROUP BY is for aggregates' is the wrong emphasis. Folk should be aware that the set function (MAX, MIN, COUNT, etc) can be omitted so that they can understand the coder's intent when it is.
The ideal optimizer will recognize equivalent SQL constructs and will always pick the ideal plan accordingly. For your real life SQL engine of choice, you must test :)
PS note the position of the DISTINCT keyword in the select clause may produce different results e.g. contrast:
SELECT COUNT(DISTINCT C) FROM myTbl;
SELECT DISTINCT COUNT(C) FROM myTbl;
I know it's an old post. But it happens that I had a query that used group by just to return distinct values when using that query in toad and oracle reports everything worked fine, I mean a good response time. When we migrated from Oracle 9i to 11g the response time in Toad was excellent but in the reporte it took about 35 minutes to finish the report when using previous version it took about 5 minutes.
The solution was to change the group by and use DISTINCT and now the report runs in about 30 secs.
I hope this is useful for someone with the same situation.
Sometimes they may give you the same results but they are meant to be used in different sense/case. The main difference is in syntax.
Minutely notice the example below. DISTINCT is used to filter out the duplicate set of values. (6, cs, 9.1) and (1, cs, 5.5) are two different sets. So DISTINCT is going to display both the rows while GROUP BY Branch is going to display only one set.
SELECT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT DISTINCT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT * FROM student GROUP BY Branch;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 2 | mech | 6.3 |
+------+--------+------+
4 rows in set (0.001 sec)
Sometimes the results that can be achieved by GROUP BY clause is not possible to achieved by DISTINCT without using some extra clause or conditions. E.g in above case.
To get the same result as DISTINCT you have to pass all the column names in GROUP BY clause like below. So see the syntactical difference. You must have knowledge about all the column names to use GROUP BY clause in that case.
SELECT * FROM student GROUP BY Id, Branch, CGPA;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 1 | cs | 5.5 |
| 2 | mech | 6.3 |
| 3 | civil | 7.2 |
| 4 | eee | 8.2 |
| 6 | cs | 9.1 |
+------+--------+------+
Also I have noticed GROUP BY displays the results in ascending order by default which DISTINCT does not. But I am not sure about this. It may be differ vendor wise.
Source : https://dbjpanda.me/dbms/languages/sql/sql-syntax-with-examples#group-by
In terms of usage, GROUP BY is used for grouping those rows you want to calculate. DISTINCT will not do any calculation. It will show no duplicate rows.
I always used DISTINCT if I want to present data without duplicates.
If I want to do calculations like summing up the total quantity of mangoes, I will use GROUP BY
In Hive (HQL), GROUP BY can be way faster than DISTINCT, because the former does not require comparing all fields in the table.
See: https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct.
The way I always understood it is that using distinct is the same as grouping by every field you selected in the order you selected them.
i.e:
select distinct a, b, c from table;
is the same as:
select a, b, c from table group by a, b, c
Funtional efficiency is totally different.
If you would like to select only "return value" except duplicate one, use distinct is better than group by. Because "group by" include ( sorting + removing ) , "distinct" include ( removing )
Generally we can use DISTINCT for eliminate the duplicates on Specific Column in the table.
In Case of 'GROUP BY' we can Apply the Aggregation Functions like
AVG, MAX, MIN, SUM, and COUNT on Specific column and fetch
the column name and it aggregation function result on the same column.
Example :
select specialColumn,sum(specialColumn) from yourTableName group by specialColumn;
There is no significantly difference between group by and distinct clause except the usage of aggregate functions.
Both can be used to distinguish the values but if in performance point of view group by is better.
When distinct keyword is used , internally it used sort operation which can be view in execution plan.
Try simple example
Declare #tmpresult table
(
Id tinyint
)
Insert into #tmpresult
Select 5
Union all
Select 2
Union all
Select 3
Union all
Select 4
Select distinct
Id
From #tmpresult