What do comma-separated integers in a GROUP BY statement accomplish? - sql

I have a query like this:
SELECT col1, col2, col3, col4, col5, SUM(col6) AS total
FROM table_name
WHERE col1 < 99999
GROUP BY 1,2,3,4,5
What does the GROUP BY statement actually accomplish here? The query does not work properly without the comma-separated integers.

It is equivalent to writing:
SELECT col1, col2, col3, col4, col5, SUM(col6) AS total
FROM table_name
WHERE col1 < 99999
GROUP BY col1, col2, col3, col4, col5
The numbers are the values/columns in the select-list expressed by ordinal position in the list, starting with 1.
The numbers used to mandatory; then the ability to use the expressions in the select-list was added. The expressions can get unwieldy, and not all DBMS allow you to use 'display labels' or 'column aliases' from the select-list in the GROUP BY clause, so occasionally using the column numbers is helpful.
In your example, it would be better to use the names - they are simple. And, in general, use names rather than numbers whenever you can.

My guess is that your database product allows for referencing columns in the Group By by position as opposed to only by column name (i.e., 1 for the first column, 2 for the second column etc.) If so, this is a proprietary feature and is not recommended because of portability and (arguably) readability issues (But can admittedly be handy for a quick and dirty query).

Tried kind a same query in MS SQL Server 2005
select distinct host from some_table group by 1,2,3
It error's out saying
Each GROUP BY expression must contain at least one column that is not an outer reference.
So this indicates that those 1,2,3 are nothing but column outer referrence

Related

Why SQL-Standard doesn't allow COUNT(col1, col2, ..., colN)

The thing is simple, why doesn't SQL-Standard allow COUNT(col1, col2, ..., colN)? What's the reason behind?
It's pretty strange because, viceversa, SQL-standard allows COUNT(DISTINCT col1, col2, ..., colN).
What do you want? Simply COUNT(*) gives you the 'desired' answer.
COUNT(*) is usually the thing to use. It counts the number of rows (after filtering by WHERE and subject to GROUP BY).
COUNT(col) is common, but often not necessary -- it counts rows with col IS NOT NULL.
COUNT(DISTINCT col) determines how many different values there are for col.
COUNT(DISTINCT col1, col2) determines how many different values there are for the combination of col1 and col2.
If you want to know why MySQL/MariaDB chose to leave out that syntax, you may have to ask the people who developed the SQL standard.
COUNT(NAME), COUNT(SURNAME) gets you ALL names and surnames. ALL is illegal in that context.

SQL Query that gives me last column with a value in it

So let's say i have a table with 6 columns col1-col6
Sometimes i will have values in the all 6 columns and sometimes maybe just in 3 or 5. It depends. The other columns will be null.
In my query i want a select from col1 and the last column with value in. It's becuase this column hold a total that need.
If the total would be fix in for example col6 i could make a easy query and say
SELECT col1,col6 from mytable
but the problem is i have to find out in with column the total is. It could be in any of col2-col6.
pls look at my fiddle for better understanding, in the example in fiddle i want to have col1, col5 back.
http://sqlfiddle.com/#!6/e3aeb/2
Use COALESCE to pick "last" non-null column.
select col1, coalesce(col6, col5, col4, col3, col2) from mytable

How does group by statement in SQL affect the results ?

Does including an extra column in group by change the number of rows in the results ?
I was doing a select query on a table A(col1,col2....col9) and I first included
select col1,col2,col3
from A where col1 = (condition)
group by col1, col2, col3
which yielded me certain number of results.
now I changed the query to this
`select col1,col2,col3, col8,col9
from A where col1=(condition)
group by col1,col2,col3, col8,col9'
and I got a different number of rows in the results. What could be the possible explanation ?
If the combination of col1, col2 and col3 is not unique, you can have more than one row with the same combination of those three.
If that happens, and those duplicates have different values for col8 and/or col9, then grouping by those extra columns will result in more rows.
Note that you can use select distinct to get the same results. group by is especially used if you want to aggregate over other columns, for instance, calculate a sum or a count, like so:
select
col1, col2, col3,
sum(col8) as total8
from A
group by col1, col2, col3
The query above will give you each unique combination of col1, col2 and col3 plus the sum over all col8's for each combination.
By grouping on those columns you are, in essence, making the results distinct on the grouped columns. So if there were rows that had columns 1, 2, 3, 18, and 19 in common, they would be folded together.
Adding GROUP BY isn't really the correct way to go about this as instead of grouping by the one column it tries to group across the board so you may end up with fewer or greater results depending on the data you're querying.
May I ask for what reason you're grouping the columns?

Is there a difference between DISTINCT colname and DISTINCT(colname)?

I've seen both versions around. On iSeries DB2 you can use either and as far as I can tell they do the same thing. Is there a difference?
No, there is no difference because DISTINCT is a keyword and not a function call.
It's the same difference as between SOME_COLUMN and (SOME_COLUMN) (without any keyword in front)
If you have only one column in your select, then there is no difference.
However when you use distinct outside as -
select disctinct col1, col2, col3 from table
It applies distinct on the group tuple of (col1, col2, col3).
Finally there is no difference in using distinct as select distinct or select distinct()

Where clause on a column that's a result of a UDF

I have a user defined function (e.g. myUDF(a,b)) that returns an integer.
I am trying to ensure this function will be called only once and its results can be used as a condition in the WHERE clause:
SELECT col1, col2, col3,
myUDF(col1,col2) AS X
From myTable
WHERE x>0
SQL Server tries to detect x as column, but it's really an alias for a computed value.
How can you re-write this query so that the filtering can be done on the computed value without having to execute the UDF more than once?
With Tbl AS
(SELECT col1, col2, col3, myUDF(col1,col2) AS X
From table myTable )
SELECT * FROM Tbl WHERE X > 0
If you are using SQL Server 2005 and beyond, you can use Cross Apply:
Select T.col1, T.col2, FuncResult.X
From Table As T
Cross Apply ( Select myUdf(T.col1, T.col2) As X ) As FuncResult
Where FuncResult.X > 0
try
SELECT col1, col2, col3, dbo.myUDF(col1,col2) AS X
From myTable
WHERE dbo.myUDF(col1,col2) >0
but be aware that this will cause a scan since it is not SARGable
Here is another way
select * from(
SELECT col1, col2, col3, dbo.myUDF(col1,col2) AS X
From myTable ) as y
WHERE x>0
SQL Server does not allow you to reference columns by alias. You either have to write out the column twice:
SELECT col1, col2, col3, myUDF(col1,col2) AS X
From table myTable
WHERE myUDF(col1,col2) > 0
Or use a subquery:
SELECT *
FROM (
SELECT col1, col2, col3, myUDF(col1,col2) AS X
From table myTable
) as subq
WHERE x > 0
Depending on the udf and how useful or frequently used it is, you may consider adding it to the table as a computed column. You could then filter on the column as normal and not have to write out the function at all in queries.
I'm not 100% sure what you are doing but since x isn't a column I would remove it from your SQL statement so you have :
SELECT col1, col2, col3, myUDF(col1,col2) AS X From myTable
And then add the condition to your code so you only call it when x > 0
Your question is best answered by the "With" clauses (CTE's I think, in MSSS).
Really the best question is: Should I store this computed value or recalculate it for every row, each and every time I query the table.
Are there 10 rows in the table and always 10 rows?
Are rows being added constantly?
Do you have a purge strategy in place or just let it grow?
Query that table only once a month?
If this is a "long running" function (even after you've optimized the hell out of it), why do you want to execute it more than once, ever?
You asked for once, but you are really asking for once per row, per query.
Storing the answer in an index or "virtual column"
Pros:
Calculate exactly once per row.
Query times don't grow linearly.
Cons:
Increases insert/update time
Calculating every time
Pros:
Insert/update time optimized
Cons:
Query time grows with row count. (not scalable)
If you're querying once a month, why do you care how bad the performance is, go tune something that actually has a big impact on your operations (very slightly facetious).
If you're not inserting a bunch (depends on your hardware) of rows per second, is spending that time up front going to make a big difference?