sql: legality of including ungrouped columns in group by statement - sql

In SQLite, if I do:
CREATE TABLE fraction (
id Int,
tag Int,
num Int,
den Int,
PRIMARY KEY (id)
);
INSERT INTO fraction VALUES (1,1,3,4);
INSERT INTO fraction VALUES (2,1,5,6);
INSERT INTO fraction VALUES (3,2,3,8);
INSERT INTO fraction VALUES (4,2,5,7);
INSERT INTO fraction VALUES (5,1,10,13);
INSERT INTO fraction VALUES (6,2,5,7);
SELECT fraction.tag, max(1.0 * fraction.num / fraction.den)
FROM fraction
GROUP BY fraction.tag;
I will get the result:
1|0.833333333333333
2|0.714285714285714
Then, if I issue:
SELECT fraction.tag, max(1.0 * fraction.num / fraction.den),
fraction.num, fraction.den
FROM fraction
GROUP BY fraction.tag;
I will get the result:
1|0.833333333333333|5|6
2|0.714285714285714|5|7
The latter is what I would expect, but it seems like a happy accident more than anything predictable or reliable. For example, were the aggregate function sum instead of min, some type of "rider" column wouldn't make sense.
In a current project that I'm doing, I'm using a table joined to itself to simulate the latter:
SELECT DISTINCT fraction_a.tag, fraction_a.high,
fraction_b.num, fraction_b.den
FROM
(SELECT fraction.tag, max(1.0 * fraction.num / fraction.den) AS high
FROM fraction
GROUP BY fraction.tag)
AS fraction_a JOIN
(SELECT fraction.tag, fraction.num, fraction.den
FROM fraction)
AS fraction_b
ON fraction_a.tag = fraction_b.tag
AND fraction_a.high = 1.0 * fraction_b.num / fraction_b.den;
yielding
1|0.833333333333333|5|6
2|0.714285714285714|5|7
But I find that syntax ugly, impractical and unmaintainable.
As I'll be porting my project between several dialects of SQL, I need a solution that is reliable in all dialects. So, if I have to bite the bullet and use the ugly syntax I will, but I'd prefer using the cleaner one.

When you're using GROUP BY, the database has to create a single output row from (possibly) multiple input rows.
Columns mentioned in the GROUP BY clause have the same value for all rows in the group, so this is the output value to be used.
Columns with some aggregate function use that to compute the output value.
However, other columns are a problem, because there might be different values in the group.
The SQL standard forbids this.
MySQL forgets to check for this error, and gives some random row's value for the output.
SQLite allows this for compatibility with MySQL.
Since version 3.7.11, when you're using MIN or MAX, SQLite guarantees that the other columns will come from the record that has the minimum/maximum value.

Including non-aggregated columns in your SELECT clause that don't appear in your GROUP BY clause is non-portable and will likely cause errors / unexpected results. The syntax you're using is not cleaner - it is plain wrong and happens to work on SQLite. It won't work on Oracle (causing a syntax error), it won't work as expected on MySQL (where it will return random values from the group), and it likely won't work on other RDBMS.
The most straightforward way to implement this would be to use a windowing function - but since you need to support SQLite, that's out of the question.
Please note that your second approach (the "ugly" query) will return multiple rows per tag if you happen to have several maxima. This might or might not be what you want.
So bite the bullet and use something like your ugly approach - it's portable and will work as expected.

Related

Why does joining on different data types produce a conversion type inconsistently?

As I try to join tables together on a value that's represented in different data types, I get really odd errors. Please consider the following:
I have two tables; let's say one is in database "CoffeeWarehouse," and the other is in database "CoffeeAnalytics":
Table 1: CoffeeWarehouse.dbo.BeanInfo
Table 2: CoffeeAnalytics.dbo.BeanOrderRecord
Now, both tables have a field called OrderNumber (although in table 2, it's spelled as[order number]); in Table 1, it's represented as a string, and in Table 2, it's represented as a float.
I proceed to join the tables together:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber;
If I specify the order numbers I'd like by adding the following:
WHERE bni.ordernumber = '48911'
then I see the complete table I'd like- all the fields from the table I've joined are populated properly.
If I add more order numbers, it works too:
WHERE bni.ordernumber IN ('48911', '83716', '98811', ...)
Now for the problem:
Suppose I want to select everything in the table where another field, i.e. CountryOfOrigin, is not null. I'm not going to enter several thousand order numbers- I just want to use a where clause to weed out the rows with incomplete data.
So I add the following to my original query:
WHERE bor.CountryOfOrigin IS NOT NULL
When I execute, I get this error:
Msg 8114, Level 16, State 5, Line 1
Error converting data type varchar to float.
I get the same error if I even simply use this as a where clause:
WHERE bni.ordernumber IS NOT NULL
Why is this the case? When I specify the ordernumber, the join works well- when I want to select many ordernumbers, I get a conversion error.
Any help/insight?
The SQL Server query optimiser can choose different paths to get your results, even with the same query from minute to minute.
In this query, say:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber
WHERE bni.ordernumber = '48911';
The query optimiser may, for example, take one of two paths:
It may choose to use BeanInfo as the "driving" table, use an index to narrow down the rows in that table to, say, a single row with order number 48911, and then join to BeanOrderRecord using just that one order number.
It may choose to use BeanOrderRecord as the driving table, join the two tables together by order number to get a full set of results, and then filter that resultset by the order number.
Which path the query optimiser takes will depend on a variety of things, including defined indexes, the number of rows in the table, cardinality, and so on.
Now, if it just so happens that one of your order numbers isn't convertible to a float—say someone typed '!2345' by accident—the first optimiser choice may always work, and the second one may always fail. But you don't get to choose which path the optimiser takes.
This is why you're seeing what you think of as weird results. In one of your queries, all the order numbers are being analysed and that's triggering the error, in another only order numbers that are convertible to float are being analysed, so there's no error. But it's basically just luck that it's working out the way it is. It could just as well be the other way around, or neither query might ever work.
This is one reason it's bad to store things in inappropriate data types. Fixing that would be the obvious solution.
A dirty and terrible fix, however, might be to always cast your FLOAT to a VARCHAR when doing the order number comparison, as I believe it's always safe to cast from FLOAT to VARCHAR. Though you may need to experiment to make sure the resulting VARCHAR value is formatted the same as your order number (or cast to INTEGER first...)
You'll have to resort to some quite fiddly trickery to get any performance out of your existing setup, though. If they were both VARCHAR values you could easily make the table join very fast by indexing each order number column, but as it is the casting you'll have to do will render normal indexes unusable for a join.
If you're using a recent version of SQL Server, you can use TRY_CAST to find the problem row(s):
SELECT * FROM BeanOrderRecord WHERE TRY_CAST([order number] AS VARCHAR) IS NULL
...will find rows with any FLOAT [order number] which can't be converted to a VARCAHR.

SELECT COUNT(*) ;

I have a database, database1, with two tables (Table 1, Table2) in it.
There are 3 rows in Table1 and 2 rows in Table2. Now if I execute the following SQL query SELECT COUNT(*); on database1, then the output is "1".
Does anyone has the idea, what this "1" signifies?
The definition of the two tables is as below.
CREATE TABLE Table1
(
ID INT PRIMARY KEY,
NAME NVARCHAR(20)
)
CREATE TABLE Table2
(
ID INT PRIMARY KEY,
NAME NVARCHAR(20)
)
Normally all selects are of the form SELECT [columns, scalar computations on columns, grouped computations on columns, or scalar computations] FROM [table or joins of tables, etc]
Because this allows plain scalar computations we can do something like SELECT 1 + 1 FROM SomeTable and it will return a recordset with the value 2 for every row in the table SomeTable.
Now, if we didn't care about any table, but just wanted to do our scalar computed we might want to do something like SELECT 1 + 1. This isn't allowed by the standard, but it is useful and most databases allow it (Oracle doesn't unless it's changed recently, at least it used to not).
Hence such bare SELECTs are treated as if they had a from clause which specified a table with one row and no column (impossible of course, but it does the trick). Hence SELECT 1 + 1 becomes SELECT 1 + 1 FROM ImaginaryTableWithOneRow which returns a single row with a single column with the value 2.
Mostly we don't think about this, we just get used to the fact that bare SELECTs give results and don't even think about the fact that there must be some one-row thing selected to return one row.
In doing SELECT COUNT(*) you did the equivalent of SELECT COUNT(*) FROM ImaginaryTableWithOneRow which of course returns 1.
Along similar lines the following also returns a result.
SELECT 'test'
WHERE EXISTS (SELECT *)
The explanation for that behavior (from this Connect item) also applies to your question.
In ANSI SQL, a SELECT statement without FROM clause is not permitted -
you need to specify a table source. So the statement "SELECT 'test'
WHERE EXISTS(SELECT *)" should give syntax error. This is the correct
behavior.
With respect to the SQL Server implementation, the FROM
clause is optional and it has always worked this way. So you can do
"SELECT 1" or "SELECT #v" and so on without requiring a table. In
other database systems, there is a dummy table called "DUAL" with one
row that is used to do such SELECT statements like "SELECT 1 FROM
dual;" or "SELECT #v FROM dual;". Now, coming to the EXISTS clause -
the project list doesn't matter in terms of the syntax or result of
the query and SELECT * is valid in a sub-query. Couple this with the
fact that we allow SELECT without FROM, you get the behavior that you
see. We could fix it but there is not much value in doing it and it
might break existing application code.
It's because you have executed select count(*) without specifying a table.
The count function returns the number of rows in the specified dataset. If you don't specify a table to select from, a single select will only ever return a single row - therefore count(*) will return 1. (In some versions of SQL, such as Oracle, you have to specify a table or similar database object; Oracle includes a dummy table (called DUAL) which can be selected from when no specific table is required.)
you wouldn't normally execute a select count(*) without specifying a table to query against. Your database server is probably giving you a count of "1" based on default system table it is querying.
Try using
select count(*) from Table1
Without a table name it makes no sense.
without table name it always return 1 whether it any database....
Since this is tagged SQL server, the MSDN states.
COUNT always returns an int data type value.
Also,
COUNT(*) returns the number of items in a group. This includes NULL
values and duplicates.
Thus, since you didn't provide a table to do a COUNT from, the default (assumption) is that it returns a 1.
COUNT function returns the number of rows as result. If you don't specify any table, it returns 1 by default. ie., COUNT(*), COUNT(1), COUNT(2), ... will return 1 always.
Select *
without a from clause is "Select ALL from the Universe" since you have filtered out nothing.
In your case, you are asking "How many universe?"
This is exactly how I would teach it. I would write on the board on the first day,
Select * and ask what it means. Answer: Give me the world.
And from there I would teach how to filter the universe down to something meaningful.
I must admit, I never thought of Select Count(*), which would make it more interesting but still brings back a true answer. We have only one world.
Without consulting Steven Hawking, SQL will have to contend with only 1.
The results of the query is correct.

In SQL, are there non-aggregate min / max operators

Is there something like
select max(val,0)
from table
I'm NOT looking to find the maximum value of the entire table
There has to be an easier way than this right?
select case when val > 0 then val else 0 end
from table
EDIT: I'm using Microsoft SQL Server
Functions GREATEST and LEAST are not SQL standard but are in many RDBMSs (e.g., Postgresql). So
SELECT GREATEST(val, 0) FROM mytable;
Not in SQL per se. But many database engines define a set of functions you can use in SQL statements. Unfortunately, they generally use different names and arguments list.
In MySQL, the function is GREATEST. In SQLite, it's MAX (it works differently with one parameter or more).
Make it a set based JOIN?
SELECT
max(val)
FROM
(
select val
from table
UNION ALL
select 0
) foo
This avoids a scalar udf suggested in the other question which may suit better
What you have in SQL is not even valid SQL. That's not how MAX works in SQL. In T-SQL the MAX aggregates over a range returning the maximum value. What you want is a simply the greater value of two.
Read this for more info:
Is there a Max function in SQL Server that takes two values like Math.Max in .NET?

How does SQL choose which row to show when grouping multiple rows together?

Consider the following table:
CREATE TABLE t
(
a INTEGER NOT NULL,
b INTEGER NOT NULL,
c INTEGER,
PRIMARY KEY (a, b)
)
Now if I do this:
SELECT a,b,c FROM t GROUP BY a;
I expect to have get each distinct value of a only once. But since I'm asking for b and c as well, it's going to give me a row for every value of a. Therefor, if, for a single value of a, there are many rows to choose from, how can I predict which row SQL will choose? My tests show that it chooses to return the row for which b is the greatest. But what is the logic in that? How would this apply to strings of blobs or dates or anything else?
My question: How does SQL choose which row to show when grouping multiple rows together?
btw: My particular problem concerns SQLITE3, but I'm guessing this is an SQL issue not dependent of the DBMS...
That shouldn't actually work in a decent DBMS :-)
Any column not used in the group by clause should be subject to an aggregation function, such as:
select a, max(b), sum(c) from t group by a
If it doesn't complain in SQLite (and I have no immediate reason to doubt you), I'd just put it down to the way the DBMS is built. From memory, there's a few areas where it doesn't worry too much about the "purity" of the data (such as every column being able to hold multiple types, the type belonging to the data in that row/column intersect rather than the column specification).
All the SQL engines that I know will complain about the query that you mentioned with an error message like "b and c appear in the field list but not in the group by list". You are only allowed to use b or c in an aggregate function (like MAX / MIN / COUNT / AVG whatever) or you'll be forced to add them in the GROUP BY list.
You're not quite correct about your assumption that this is RDBMS-independent. Most RDBMS don't allow to select fields that are not also in the GROUP BY clause. Exceptions to this (to my knowledge) are SQLite and MySQL. In general, you shouldn't do this, because values for b and c are chosen pretty arbitrarily (depending on the applied grouping algorithm). Even if this may be documented in your database, it's always better to express a query in a way that fully and non-ambiguously specifies the outcome
It's not a matter of what the database will choose, but the order your data are going to be returned.
Your primary key is handling your sort order by default since you didn't provide one.
You can use Order By a, c if that's what you want.

Subtracting minimum value from all values in a column

Is there a another way to subtract the smallest value from all the values of a column, effectively offset the values?
The only way I have found becomes horribly complicated for more complex queries.
CREATE TABLE offsettest(value NUMBER);
INSERT INTO offsettest VALUES(100);
INSERT INTO offsettest VALUES(200);
INSERT INTO offsettest VALUES(300);
INSERT INTO offsettest VALUES(400);
SELECT value - (SELECT MIN(value) FROM offsettest) FROM offsettest;
DROP TABLE offsettest;
I'd like to limit it to a single query (no stored procedures, variables, etc) if possible and standard SQL is preferred (although I am using Oracle).
I believe this works as of ANSI 1999.
SELECT value - MIN(value) OVER() FROM offsettest;
It would have helped you see your actual query, though, since depending on whether you need to manipulate more than one column this way, and the various minimums come from different rows, there may be more efficient ways to do it. If the OVER() works for you, then fine.