Getting warning: Null value is eliminated by an aggregate or other SET operation - sql

I have this schema
create table t(id int, d date)
insert into t (id, d) values (1, getdate()),
(2, NULL)
When doing
declare #mindate date
select #mindate = min(d) from t
I get the warning
Null value is eliminated by an aggregate or other SET operation
Why and what can I do about it?

Mostly you should do nothing about it.
It is possible to disable the warning by setting ansi_warnings off but this has other effects, e.g. on how division by zero is handled and can cause failures when your queries use features like indexed views, computed columns or XML methods.
In some limited cases you can rewrite the aggregate to avoid it. e.g. COUNT(nullable_column) can be rewritten as SUM(CASE WHEN nullable_column IS NULL THEN 0 ELSE 1 END) but this isn't always possible to do straightforwardly without changing the semantics.
It's just an informational message required in the SQL standard. Apart from adding unwanted noise to the messages stream it has no ill effects (other than meaning that SQL Server can't just bypass reading NULL rows, which can have an overhead but disabling the warning doesn't give better execution plans in this respect)
The reason for returning this message is that throughout most operations in SQL nulls propagate.
SELECT NULL + 3 + 7 returns NULL (regarding NULL as an unknown quantity this makes sense as ? + 3 + 7 is also unknown)
but
SELECT SUM(N)
FROM (VALUES (NULL),
(3),
(7)) V(N)
Returns 10 and the warning that nulls were ignored.
However these are exactly the semantics you want for typical aggregation queries. Otherwise the presence of a single NULL would mean aggregations on that column over all rows would always end up yielding NULL which is not very useful.
Which is the heaviest cake below? (Image Source, Creative Commons image altered (cropped and annotated) by me)
After the third cake was weighed the scales broke and so no information is available about the fourth but it was still possible to measure the circumference.
+--------+--------+---------------+
| CakeId | Weight | Circumference |
+--------+--------+---------------+
| 1 | 50 | 12.0 |
| 2 | 80 | 14.2 |
| 3 | 70 | 13.7 |
| 4 | NULL | 13.4 |
+--------+--------+---------------+
The query
SELECT MAX(Weight) AS MaxWeight,
AVG(Circumference) AS AvgCircumference
FROM Cakes
Returns
+-----------+------------------+
| MaxWeight | AvgCircumference |
+-----------+------------------+
| 80 | 13.325 |
+-----------+------------------+
even though technically it is not possible to say with certainty that 80 was the weight of the heaviest cake (as the unknown number may be larger) the results above are generally more useful than simply returning unknown.
+-----------+------------------+
| MaxWeight | AvgCircumference |
+-----------+------------------+
| ? | 13.325 |
+-----------+------------------+
So likely you want NULLs to be ignored, and the warning just alerts you to the fact that this is happening.

#juergen provided two good answers:
Suppress the warning using SET ANSI_WARNINGS OFF
Assuming you want to include NULL values and treat them as (say) use select #mindate = min(isnull(d, cast(0 as datetime))) from t
However if you want to ignore rows where the d column is null and not concern yourself with the ANSI_WARNINGS option then you can do this by excluding all rows where d is set to null as so:
select #mindate = min(d) from t where (d IS NOT NULL)

I think you can ignore this warning in the case since you using the MIN function.
"Except for COUNT, aggregate functions ignore null values"
Please refer Aggregate Functions (Transact-SQL)

What should min() return in your case as lowest value of d?
The error informs you that the min() function did not take records into account that are null.
So if it should ignore the NULL values and return the lowest existing date then you can ignore this warning.
If you also like to suppress warnings for this single statement then you can do it like this
set ansi_warnings off
select #mindate = min(d) from t
set ansi_warnings on
If you want NULL values taken into account by using a default value for them then you can set a default date value like this
select #mindate = min(isnull(d, cast(0 as datetime)))
from t

If you want to make aggregates consider null values and treat the result as null you can use:
SELECT IIF(COUNT(N) != COUNT(*), NULL, SUM(N)) as [Sum]
FROM (VALUES (NULL),
(3),
(7)) V(N)
This returns null if not all values are given.

Related

What is the maximum value for STRING ordering in SQL (SQLite)?

I have a SQLite database and I want to order my results by ascending order of a String column (name). I want the null-valued rows to be last in ascending order.
Moreover, I am doing some filtering on the same column (WHERE name>"previously obtained value"), which filters out the NULL-valued rows, which I do not want. Plus, the version of SQLite I'm using (I don't have control over this) does not support NULLS LAST. Therefore, to keep it simple I want to use IFNULL(name,"Something") in my ORDER BY and my comparison.
I want this "Something" to be as large as possible, so that my null-valued rows are always last. I have texts in Japanese and Korean, so I can't just use "ZZZ".
Therefore, I see two possible solutions. First, use the "maximum" character used by SQLite in the default ordering of strings, do you know what this value is or how to obtain it? Second, as the cells can contain any type in SQLite, is there a value of any other type that will always be considered larger than any string?
Example:
+----+-----------------+---------------+
| id | name | othercol |
+----+-----------------+---------------+
| 1 | English name | hello |
| 2 | NULL | hi |
| 3 | NULL | hi hello |
| 4 | 暴鬼 | hola |
| 5 | NULL | bonjour hello |
| 6 | 아바키 | hello bye |
+----+-----------------+---------------+
Current request:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (name,id)>("English name",1) ORDER BY (name,id)
Result (by ids): 6
Problems: NULL names are filtered out because of the comparison, and when I have no comparison they are shown first.
What I think would solve these problems:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (IFNULL(name,"Something"),id)>("English name",1) ORDER BY (IFNULL(name,"Something"),id)
But I need "Something" to be larger than any string I might encounter.
Expected result: 6, 3, 5
I think a simpler way is to use nulls last:
order by column nulls last
This works with both ascending and descending sorts. And it has the advantage that it can make use of an index on the column, which coalesce() would probably prevent.
Change your WHERE clause to:
WHERE SOMECOL > "previously obtained value" OR SOMECOL IS NULL
so the NULLs are not filtered out (since you want them).
You can sort the NULLs last, like this:
ORDER BY SOMECOL IS NULL, SOMECOL
The expresssion:
SOMECOL IS NULL
evaluates to 1 (True) or 0 (False), so the values that are not NULL will be sorted first.
Edit
If you want a string that is greater than any name in the table, then you can get it by:
select max(name) || ' ' from mytable
so in your code use:
ifnull(name, (select max(name) || ' ' from mytable))
Finally found a solution, for anyone looking for a character larger than any other (when I'm posting this, the unicode table might get expanded), here's your guy:
CAST(x'f48083bf' AS TEXT).
Example in my case:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (IFNULL(name,CAST(x'f48083bf' AS TEXT)),id)>("English name",1) ORDER BY (IFNULL(name,CAST(x'f48083bf' AS TEXT)),id)

SQL Statement rows to Columns

I have a Table in MS Access like this:
The Columns are:
--------------------------------------
| *Date* | *Article* | *Distance* | Value |
---------------------------------------
Date, Article and Distance are Primary Keys, so the combination of them is always unique.
The column Distance has discrete values from 0 to 27.
I need to transform this table into a table like this:
----------
| *Date* | *Article* | Value from Distance 0| Value Dis. 1|...|Value Dis. 27|
----------
I really don't know a SQL Statement for this task. I needed a really fast solution which is why I wrote an Excel macro which worked fine but was very inefficient and needed several hours to complete. Now that the amount of data is 10 times higher, I can't use this macro anymore.
You can try the following pivot query:
SELECT
Date,
Article,
MAX(IIF(Distance = 0, Value, NULL)) AS val_0,
MAX(IIF(Distance = 1, Value, NULL)) AS val_1,
...
MAX(IIF(Distance = 27, Value, NULL)) AS val_27
FROM yourTable
GROUP BY
Date,
Article
Note that Access does not support CASE expressions, but it does offer a function called IIF() which takes the form of:
IIF(condition, value if true, value if false)
which essentially behaves the same way as CASE in other RDBMS.

SQL Server Primary Key for a range lookup

I have a static dataset that correlates a range of numbers to some metadata, e.g.
+--------+--------+-------+--------+----------------+
| Min | Max |Country|CardType| Issuing Bank |
+--------+--------+-------+--------+----------------+
| 400011 | 400051 | USA |VISA | Bank of America|
+--------+--------+-------+--------+----------------+
| 400052 | 400062 | UK |MAESTRO | HSBC |
+--------+--------+-------+--------+----------------+
I wish to lookup a the data for some arbitrary single value
SELECT *
FROM SomeTable
WHERE Min <= 400030
AND Max >= 400030
I have about 200k of these range mappings, and am wondering the best table structure for SQL Server?
A composite key doesn't seem correct due to the fact that most of the time, the value being looked up will be in between the two range values stored on disk. Similarly, only indexing the first column doesn't seem to be selective enough.
I know that 200k rows is fairly insignificant, and I can get by with doing not much, but lets assume that the numbers of rows could be orders of magnitude greater.
If you usually search on both min and max then a compound key on (min,max) is appropriate. The engine will find all rows where min is less than X, then search within those result to find the rows where max is greater then Y.
The index would also be useful if you do searches on min only, but would not be applicable if you do searches only on max.
You can index the first number and then do the lookup like this:
select t.*,
(select top 1 s.country
from static s
where t.num >= s.firstnum
order by s.firstnum
) country
from sometable t;
Or use outer apply:
select t.*, s.country
from sometable t outer apply
(select top 1 s.country
from static s
where t.num >= s.firstnum
order by s.firstnum
) s
This should take advantage of an index on static(firstnum) or static(firstnum, country). This does not check against the second number. If that is important, use outer apply and do the check outside the subquery.
I would specify the primary key on (Min,Max). Queries are as simple as:
SELECT *
FROM SomeTable
WHERE #Value BETWEEN Min AND Max
I'd also define a constraint to enforce that Min <= Max. Then I would create a trigger to enforce uniqueness in ranges and prevent the database from storing an overlapping range.
I belive is easy/faster if you create a trigger for INSERT and then fill the related calculated columns country, issuing bank, card-number length
At the end you do the calculation only once, instead 200k every time you will do a query. Of course is there a space cost. But query will be much easier to mantain.
I remember once I have to calculate some sin and cos to calculate distance so I just create the calculated columns once.
After your update I think is even easier
+--------+--------+-------+--------+----------------+----------+
| Min | Max |Country|CardType| Issuing Bank | TypeID |
+--------+--------+-------+--------+----------------+----------+
| 400011 | 400051 | USA |VISA | Bank of America| 1 |
+--------+--------+-------+--------+----------------+----------+
| 400052 | 400062 | UK |MAESTRO | HSBC | 2 |
+--------+--------+-------+--------+----------------+----------+
Then you Card will also create a column TypeID

Query Performance with NULL

I would like to know about how NULL values affect query performance in SQL Server 2005.
I have a table similar to this (simplified):
ID | ImportantData | QuickPickOrder
--------------------------
1 | 'Some Text' | NULL
2 | 'Other Text' | 3
3 | 'abcdefg' | NULL
4 | 'whatever' | 4
5 | 'it is' | 2
6 | 'technically' | NULL
7 | 'a varchar' | NULL
8 | 'of course' | 1
9 | 'but that' | NULL
10 | 'is not' | NULL
11 | 'important' | 5
And I'm doing a query on it like this:
SELECT *
FROM MyTable
WHERE QuickPickOrder IS NOT NULL
ORDER BY QuickPickOrder
So the QuickPickOrder is basically a column used to single out some commonly chosen items from a larger list. It also provides the order in which they will appear to the user. NULL values mean that it doesn't show up in the quick pick list.
I've always been told that NULL values in a database are somehow evil, at least from a normalization perspective, but is it an acceptable way to filter out unwanted rows in a WHERE constraint?
Would it be better to use specific number value, like -1 or 0, to indicate items that aren't wanted? Are there other alternatives?
EDIT:
The example does not accuratly represent the ratio of real values to NULLs. An better example might show at least 10 NULLs for every non-NULL. The table size might be 100 to 200 rows. It is a reference table so updates are rare.
SQL Server indexes NULL values, so this will most probably just use the Index Seek over an index on QuickPickOrder, both for filtering and for ordering.
Another alternative would be two tables:
MyTable:
ID | ImportantData
------------------
1 | 'Some Text'
2 | 'Other Text'
3 | 'abcdefg'
4 | 'whatever'
5 | 'it is'
6 | 'technically'
7 | 'a varchar'
8 | 'of course'
9 | 'but that'
10 | 'is not'
11 | 'important'
QuickPicks:
MyTableID | QuickPickOrder
--------------------------
2 | 3
4 | 4
5 | 2
8 | 1
11 | 5
SELECT MyTable.*
FROM MyTable JOIN QuickPicks ON QuickPickOrder.MyTableID = MyTable.ID
ORDER BY QuickPickOrder
This would allow updating QuickPickOrder without locking anything in MyTable or logging a full row transaction for that table. So depending how big MyTable is, and how often you are updating QuickPickOrder, there may be a scalability advantage.
Also, having a separate table will allow you to add a unique index on QuickPickOrder to ensure no duplication, and could be more easily scaled later to allow different kinds of QuickPicks, having them specific to certain contexts or users, etc.
They do not have a negative performance hit on the database. Remember, NULL is more of a state than a value. Checking for NOT NULL vs setting that value to a -1 makes no difference other than the -1 is probably breaking your data integrity, imo.
SQL Server's performance can be affected by using NULLS in your database. There are several reasons for this.
First, NULLS that appear in fixed length columns (CHAR) take up the entire size of the column. So if you have a column that is 25 characters wide, and a NULL is stored in it, then SQL Server must store 25 characters to represent the NULL value. This added space increases the size of your database, which in turn means that it takes more I/O overhead to find the data you are looking for. Of course, one way around this is to use variable length fields instead. When NULLs are added to a variable length column, space is not unnecessarily wasted as it is with fixed length columns.
Second, use of the IS NULL clause in your WHERE clause means that an index cannot be used for the query, and a table scan will be performed. This can greatly reduce performance.
Third, the use of NULLS can lead to convoluted Transact-SQL code, which can mean code that doesn't run efficiently or that is buggy.
Ideally, NULLs should be avoided in your SQL Server databases.
Instead of using NULLs, use a coding scheme similar to this in your databases:
NA: Not applicable
NYN: Not yet known
TUN: Truly unknown
Such a scheme provides the benefits of using NULLs, but without the drawbacks.
NULL looks fine to me for this purpose. Performance is likely to be basically the same as with a non-null column and constant value, or maybe even better for filtering out all NULLs.
The alternative is to normalize QuickPickOrder into a table with a foreign key, and then perform an inner join to filter the nulls out (or a left join with a where clause to filter the non-nulls out).
NULL looks good to me as well. SQL Server has many kinds of indices to choose from. I forget which ones do this, but some only index values in a given range. If you had that kind of index on the column being tested, the NULL valued records would not be in the index, and the index scan would be fast.
Having a lot of NULLs in a column which has an index on it (or starting with it) is generally beneficial to this kind of query.
NULL values are not entered into the index, which means that inserting / updating rows with NULL in there doesn't take the performance hit of having to update another secondary index. If, say, only 0.001% of your rows have a non-null value in that column, the IS NOT NULL query becomes pretty efficient as it just scans a relatively small index.
Of course all of this is relative, if your table is tiny anyway, it makes no appreciable difference.

Is there any difference between GROUP BY and DISTINCT

I learned something simple about SQL the other day:
SELECT c FROM myTbl GROUP BY C
Has the same result as:
SELECT DISTINCT C FROM myTbl
What I am curious of, is there anything different in the way an SQL engine processes the command, or are they truly the same thing?
I personally prefer the distinct syntax, but I am sure it's more out of habit than anything else.
EDIT: This is not a question about aggregates. The use of GROUP BY with aggregate functions is understood.
MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."
However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.
A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?
(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinct and screw => get list of unique values in a table column)
GROUP BY lets you use aggregate functions, like AVG, MAX, MIN, SUM, and COUNT.
On the other hand DISTINCT just removes duplicates.
For example, if you have a bunch of purchase records, and you want to know how much was spent by each department, you might do something like:
SELECT department, SUM(amount) FROM purchases GROUP BY department
This will give you one row per department, containing the department name and the sum of all of the amount values in all rows for that department.
What's the difference from a mere duplicate removal functionality point of view
Apart from the fact that unlike DISTINCT, GROUP BY allows for aggregating data per group (which has been mentioned by many other answers), the most important difference in my opinion is the fact that the two operations "happen" at two very different steps in the logical order of operations that are executed in a SELECT statement.
Here are the most important operations:
FROM (including JOIN, APPLY, etc.)
WHERE
GROUP BY (can remove duplicates)
Aggregations
HAVING
Window functions
SELECT
DISTINCT (can remove duplicates)
UNION, INTERSECT, EXCEPT (can remove duplicates)
ORDER BY
OFFSET
LIMIT
As you can see, the logical order of each operation influences what can be done with it and how it influences subsequent operations. In particular, the fact that the GROUP BY operation "happens before" the SELECT operation (the projection) means that:
It doesn't depend on the projection (which can be an advantage)
It cannot use any values from the projection (which can be a disadvantage)
1. It doesn't depend on the projection
An example where not depending on the projection is useful is if you want to calculate window functions on distinct values:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
GROUP BY rating
When run against the Sakila database, this yields:
rating rn
-----------
G 1
NC-17 2
PG 3
PG-13 4
R 5
The same couldn't be achieved with DISTINCT easily:
SELECT DISTINCT rating, row_number() OVER (ORDER BY rating) AS rn
FROM film
That query is "wrong" and yields something like:
rating rn
------------
G 1
G 2
G 3
...
G 178
NC-17 179
NC-17 180
...
This is not what we wanted. The DISTINCT operation "happens after" the projection, so we can no longer remove DISTINCT ratings because the window function was already calculated and projected. In order to use DISTINCT, we'd have to nest that part of the query:
SELECT rating, row_number() OVER (ORDER BY rating) AS rn
FROM (
SELECT DISTINCT rating FROM film
) f
Side-note: In this particular case, we could also use DENSE_RANK()
SELECT DISTINCT rating, dense_rank() OVER (ORDER BY rating) AS rn
FROM film
2. It cannot use any values from the projection
One of SQL's drawbacks is its verbosity at times. For the same reason as what we've seen before (namely the logical order of operations), we cannot "easily" group by something we're projecting.
This is invalid SQL:
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY name
This is valid (repeating the expression)
SELECT first_name || ' ' || last_name AS name
FROM customer
GROUP BY first_name || ' ' || last_name
This is valid, too (nesting the expression)
SELECT name
FROM (
SELECT first_name || ' ' || last_name AS name
FROM customer
) c
GROUP BY name
I've written about this topic more in depth in a blog post
There is no difference (in SQL Server, at least). Both queries use the same execution plan.
http://sqlmag.com/database-performance-tuning/distinct-vs-group
Maybe there is a difference, if there are sub-queries involved:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
There is no difference (Oracle-style):
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212
Use DISTINCT if you just want to remove duplicates. Use GROUPY BY if you want to apply aggregate operators (MAX, SUM, GROUP_CONCAT, ..., or a HAVING clause).
I expect there is the possibility for subtle differences in their execution.
I checked the execution plans for two functionally equivalent queries along these lines in Oracle 10g:
core> select sta from zip group by sta;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH GROUP BY | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
core> select distinct sta from zip;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 58 | 174 | 44 (19)| 00:00:01 |
| 1 | HASH UNIQUE | | 58 | 174 | 44 (19)| 00:00:01 |
| 2 | TABLE ACCESS FULL| ZIP | 42303 | 123K| 38 (6)| 00:00:01 |
---------------------------------------------------------------------------
The middle operation is slightly different: "HASH GROUP BY" vs. "HASH UNIQUE", but the estimated costs etc. are identical. I then executed these with tracing on and the actual operation counts were the same for both (except that the second one didn't have to do any physical reads due to caching).
But I think that because the operation names are different, the execution would follow somewhat different code paths and that opens the possibility of more significant differences.
I think you should prefer the DISTINCT syntax for this purpose. It's not just habit, it more clearly indicates the purpose of the query.
For the query you posted, they are identical. But for other queries that may not be true.
For example, it's not the same as:
SELECT C FROM myTbl GROUP BY C, D
I read all the above comments but didn't see anyone pointed to the main difference between Group By and Distinct apart from the aggregation bit.
Distinct returns all the rows then de-duplicates them whereas Group By de-deduplicate the rows as they're read by the algorithm one by one.
This means they can produce different results!
For example, the below codes generate different results:
SELECT distinct ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
SELECT ROW_NUMBER() OVER (ORDER BY Name), Name FROM NamesTable
GROUP BY Name
If there are 10 names in the table where 1 of which is a duplicate of another then the first query returns 10 rows whereas the second query returns 9 rows.
The reason is what I said above so they can behave differently!
If you use DISTINCT with multiple columns, the result set won't be grouped as it will with GROUP BY, and you can't use aggregate functions with DISTINCT.
GROUP BY has a very specific meaning that is distinct (heh) from the DISTINCT function.
GROUP BY causes the query results to be grouped using the chosen expression, aggregate functions can then be applied, and these will act on each group, rather than the entire resultset.
Here's an example that might help:
Given a table that looks like this:
name
------
barry
dave
bill
dave
dave
barry
john
This query:
SELECT name, count(*) AS count FROM table GROUP BY name;
Will produce output like this:
name count
-------------
barry 2
dave 3
bill 1
john 1
Which is obviously very different from using DISTINCT. If you want to group your results, use GROUP BY, if you just want a unique list of a specific column, use DISTINCT. This will give your database a chance to optimise the query for your needs.
If you are using a GROUP BY without any aggregate function then internally it will treated as DISTINCT, so in this case there is no difference between GROUP BY and DISTINCT.
But when you are provided with DISTINCT clause better to use it for finding your unique records because the objective of GROUP BY is to achieve aggregation.
They have different semantics, even if they happen to have equivalent results on your particular data.
Please don't use GROUP BY when you mean DISTINCT, even if they happen to work the same. I'm assuming you're trying to shave off milliseconds from queries, and I have to point out that developer time is orders of magnitude more expensive than computer time.
In Teradata perspective :
From a result set point of view, it does not matter if you use DISTINCT or GROUP BY in Teradata. The answer set will be the same.
From a performance point of view, it is not the same.
To understand what impacts performance, you need to know what happens on Teradata when executing a statement with DISTINCT or GROUP BY.
In the case of DISTINCT, the rows are redistributed immediately without any preaggregation taking place, while in the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.
Don’t think now that GROUP BY is always better from a performance point of view. When you have many different values, the preaggregation step of GROUP BY is not very efficient. Teradata has to sort the data to remove duplicates. In this case, it may be better to the redistribution first, i.e. use the DISTINCT statement. Only if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the deduplication step takes place, after redistribution.
In short, DISTINCT vs. GROUP BY in Teradata means:
GROUP BY -> for many duplicates
DISTINCT -> no or a few duplicates only .
At times, when using DISTINCT, you run out of spool space on an AMP. The reason is that redistribution takes place immediately, and skewing could cause AMPs to run out of space.
If this happens, you have probably a better chance with GROUP BY, as duplicates are already removed in a first step, and less data is moved across the AMPs.
group by is used in aggregate operations -- like when you want to get a count of Bs broken down by column C
select C, count(B) from myTbl group by C
distinct is what it sounds like -- you get unique rows.
In sql server 2005, it looks like the query optimizer is able to optimize away the difference in the simplistic examples I ran. Dunno if you can count on that in all situations, though.
In that particular query there is no difference. But, of course, if you add any aggregate columns then you'll have to use group by.
You're only noticing that because you are selecting a single column.
Try selecting two fields and see what happens.
Group By is intended to be used like this:
SELECT name, SUM(transaction) FROM myTbl GROUP BY name
Which would show the sum of all transactions for each person.
From a 'SQL the language' perspective the two constructs are equivalent and which one you choose is one of those 'lifestyle' choices we all have to make. I think there is a good case for DISTINCT being more explicit (and therefore is more considerate to the person who will inherit your code etc) but that doesn't mean the GROUP BY construct is an invalid choice.
I think this 'GROUP BY is for aggregates' is the wrong emphasis. Folk should be aware that the set function (MAX, MIN, COUNT, etc) can be omitted so that they can understand the coder's intent when it is.
The ideal optimizer will recognize equivalent SQL constructs and will always pick the ideal plan accordingly. For your real life SQL engine of choice, you must test :)
PS note the position of the DISTINCT keyword in the select clause may produce different results e.g. contrast:
SELECT COUNT(DISTINCT C) FROM myTbl;
SELECT DISTINCT COUNT(C) FROM myTbl;
I know it's an old post. But it happens that I had a query that used group by just to return distinct values when using that query in toad and oracle reports everything worked fine, I mean a good response time. When we migrated from Oracle 9i to 11g the response time in Toad was excellent but in the reporte it took about 35 minutes to finish the report when using previous version it took about 5 minutes.
The solution was to change the group by and use DISTINCT and now the report runs in about 30 secs.
I hope this is useful for someone with the same situation.
Sometimes they may give you the same results but they are meant to be used in different sense/case. The main difference is in syntax.
Minutely notice the example below. DISTINCT is used to filter out the duplicate set of values. (6, cs, 9.1) and (1, cs, 5.5) are two different sets. So DISTINCT is going to display both the rows while GROUP BY Branch is going to display only one set.
SELECT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT DISTINCT * FROM student;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 2 | mech | 6.3 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 1 | cs | 5.5 |
+------+--------+------+
5 rows in set (0.001 sec)
SELECT * FROM student GROUP BY Branch;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 3 | civil | 7.2 |
| 6 | cs | 9.1 |
| 4 | eee | 8.2 |
| 2 | mech | 6.3 |
+------+--------+------+
4 rows in set (0.001 sec)
Sometimes the results that can be achieved by GROUP BY clause is not possible to achieved by DISTINCT without using some extra clause or conditions. E.g in above case.
To get the same result as DISTINCT you have to pass all the column names in GROUP BY clause like below. So see the syntactical difference. You must have knowledge about all the column names to use GROUP BY clause in that case.
SELECT * FROM student GROUP BY Id, Branch, CGPA;
+------+--------+------+
| Id | Branch | CGPA |
+------+--------+------+
| 1 | cs | 5.5 |
| 2 | mech | 6.3 |
| 3 | civil | 7.2 |
| 4 | eee | 8.2 |
| 6 | cs | 9.1 |
+------+--------+------+
Also I have noticed GROUP BY displays the results in ascending order by default which DISTINCT does not. But I am not sure about this. It may be differ vendor wise.
Source : https://dbjpanda.me/dbms/languages/sql/sql-syntax-with-examples#group-by
In terms of usage, GROUP BY is used for grouping those rows you want to calculate. DISTINCT will not do any calculation. It will show no duplicate rows.
I always used DISTINCT if I want to present data without duplicates.
If I want to do calculations like summing up the total quantity of mangoes, I will use GROUP BY
In Hive (HQL), GROUP BY can be way faster than DISTINCT, because the former does not require comparing all fields in the table.
See: https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct.
The way I always understood it is that using distinct is the same as grouping by every field you selected in the order you selected them.
i.e:
select distinct a, b, c from table;
is the same as:
select a, b, c from table group by a, b, c
Funtional efficiency is totally different.
If you would like to select only "return value" except duplicate one, use distinct is better than group by. Because "group by" include ( sorting + removing ) , "distinct" include ( removing )
Generally we can use DISTINCT for eliminate the duplicates on Specific Column in the table.
In Case of 'GROUP BY' we can Apply the Aggregation Functions like
AVG, MAX, MIN, SUM, and COUNT on Specific column and fetch
the column name and it aggregation function result on the same column.
Example :
select specialColumn,sum(specialColumn) from yourTableName group by specialColumn;
There is no significantly difference between group by and distinct clause except the usage of aggregate functions.
Both can be used to distinguish the values but if in performance point of view group by is better.
When distinct keyword is used , internally it used sort operation which can be view in execution plan.
Try simple example
Declare #tmpresult table
(
Id tinyint
)
Insert into #tmpresult
Select 5
Union all
Select 2
Union all
Select 3
Union all
Select 4
Select distinct
Id
From #tmpresult