ORDER BY in a partition - SELECT keyword - sql

Please see the DDL below:
create table #names (name varchar(20), Gender char(1))
insert into #names VALUES ('Ian', 'M')
insert into #names values ('Marie', 'F')
insert into #names values ('andy', 'F')
insert into #names values ('karen', 'F')
and the SQL below:
select row_number() over (order by (select null)) from #names
This adds a unique number to each row. I could also do this (which does not add a unique row):
select row_number() over (partition by gender order by (name)) from #names
Why do you not need 'SELECT name', however you do not need SELECT null?

As far as I can tell, this is just a quirk of SQL Server. SQL Server does not permit constants in ORDER BY (nor in GROUP BY, which can occur in other contexts).
Probably the origin of this is the ORDER BY clause in a SELECT statement:
ORDER BY 1
where "1" is a column reference rather than a constant. To prevent confusion, (I am guessing), the designers of the language do not allow other constants there. After all, would ORDER BY 2 + 1 refer to the third column? To the sum of the values in the two columns? To the constant 3?
I think this was just carried over into the windows syntax. There is a way around it -- as you have seen -- by using a subquery. The following should also work:
ROW_NUMBER() ORDER BY (CASE WHEN NAME = NULL THEN 'Never Happens' ELSE 'Always' END)
Because a column is mentioned, this is permitted. But, = NULL never returns true, so a constant is used for the sorting. I use the SELECT NULL subquery, however.

The Order By clause has 4 basic syntax structures.
Specifying a single column defined in the select list
Specifying a column that is not defined in the select list
Specifying an alias as the sort column
Specifying an expression as the sort column
You can review the MSDN documentation here.
https://msdn.microsoft.com/en-us/library/ms188385.aspx
I believe that your SELECT NULL, or really any constant that you want to specify in your order by clause, would require the select because the database engine is evaluating the constant as a #4 structure, an expression. As proof, in my query example, I have used a COUNT(*) in lieu of your select null.
I believe when you specify Name or Group in your order by clause, you are actually using a different order by structure, possibly #1. Here is my proof from the execution plan and results of your corrected first and original second query's sort operation. I have removed the partition because it's not relevant to our discussion.
select row_number() over (order by (select null)),name from #names
select row_number() over (order by name),name from #names
select row_number() over (order by Gender),name from #names
I have attached the execution plans for the three queries.
As you can see, no Sort Operation is performed on the data that is passed to the Segment operator which handles the window function. This is mirrored in the results of these queries, also pictured below.
So basically, SQL Server just ignored or did not operate on your Order By clause sub-query, because it could not associate the values which you returned in the sub-query to a particular parent column using method #4 and the reason that you do not specify "SELECT name" in your Order By sub-query is because you are actually using a different Order By syntax structure.

Related

Using calculation with an an aliased column in ORDER BY

As we all know, the ORDER BY clause is processed after the SELECT clause, so a column alias in the SELECT clause can be used.
However, I find that I can’t use the aliased column in a calculation in the ORDER BY clause.
WITH data AS(
SELECT *
FROM (VALUES
('apple'),
('banana'),
('cherry'),
('date')
) AS x(item)
)
SELECT item AS s
FROM data
-- ORDER BY s; -- OK
-- ORDER BY item + ''; -- OK
ORDER BY s + ''; -- Fails
I know there are alternative ways of doing this particular query, and I know that this is a trivial calculation, but I’m interested in why the column alias doesn’t work when in a calculation.
I have tested in PostgreSQL, MariaDB, SQLite and Oracle, and it works as expected. SQL Server appears to be the odd one out.
The documentation clearly states that:
The column names referenced in the ORDER BY clause must correspond to
either a column or column alias in the select list or to a column
defined in a table specified in the FROM clause without any
ambiguities. If the ORDER BY clause references a column alias from
the select list, the column alias must be used standalone, and not as
a part of some expression in ORDER BY clause:
Technically speaking, your query should work since order by clause is logically evaluated after select clause and it should have access to all expressions declared in select clause. But without looking at having access to the SQL specs I cannot comment whether it is a limitation of SQL Server or the other RDBMS implementing it as a bonus feature.
Anyway, you can use CROSS APPLY as a trick.... it is part of FROM clause so the expressions should be available in all subsequent clauses:
SELECT item
FROM t
CROSS APPLY (SELECT item + '') AS CA(item_for_sort)
ORDER BY item_for_sort
It is simply due to the way expressions are evaluated. A more illustrative example:
;WITH data AS
(
SELECT * FROM (VALUES('apple'),('banana')) AS sq(item)
)
SELECT item AS s
FROM data
ORDER BY CASE WHEN 1 = 1 THEN s END;
This returns the same Invalid column name error. The CASE expression (and the concatenation of s + '' in the simpler case) is evaluated before the alias in the select list is resolved.
One workaround for your simpler case is to append the empty string in the select list:
SELECT
item + '' AS s
...
ORDER BY s;
There are more complex ways, like using a derived table or CTE:
;WITH data AS
(
SELECT * FROM (VALUES('apple'),('banana') AS sq(item)
),
step2 AS
(
SELECT item AS s FROM data
)
SELECT s FROM step2 ORDER BY s+'';
This is just the way that SQL Server works, and I think you could say "well SQL Server is bad because of this" but SQL Server could also say "what the heck is this use case?" :-)

Sybase: HAVING operates on rows?

I've came across the following SYBASE SQL:
-- Setup first
create table #t (id int, ts int)
go
insert into #t values (1, 2)
insert into #t values (1, 10)
insert into #t values (1, 20)
insert into #t values (1, 30)
insert into #t values (2, 5)
insert into #t values (2, 13)
insert into #t values (2, 25)
go
declare #time int select #time=11
-- This is the SQL I am asking about
select * from (select * from #t where ts <= #time) t group by id having ts = max(ts)
go
The results of this SQL are
id ts
----------- -----------
1 10
2 5
This looks like HAVING condition applied to rows rather than groups. Can someone please point me at a place is Sybase 15.5 documentation where this case is described? All I see is "HAVING operates on groups". The closest I see in the docs is:
The having clause can include columns or expressions that are not in
the select list and not in the group by clause.
(Quote from here).
However, they don't exactly explain what happens when you do that.
My understanding: Yes, fundamentally, HAVING operates on rows. By omitting a GROUP BY, it operates on all result rows within a single "supergroup" rather than on rows-within-groups. Read the section "How group by and having queries with aggregates work" in your originally-linked Sybase docco:-
How group by and having queries with aggregates work
The where clause excludes rows that do not meet its search conditions; its function remains the same for grouped or nongrouped queries.
The group by clause collects the remaining rows into one group for each unique value in the group by expression. Omitting group by creates a single group for the whole table.
Aggregate functions specified in the select list calculate summary values for each group. For scalar aggregates, there is only one value for the table. Vector aggregates calculate values for the distinct groups.
The having clause excludes groups from the results that do not meet its search conditions. Even though the having clause tests only rows, the presence or absence of a group by clause may make it appear to be operating on groups:
When the query includes group by, having excludes result group rows. This is why having seems to operate on groups.
When the query has no group by, having excludes result rows from the (single-group) table. This is why having seems to operate on rows (the results are similar to where clause results).
Secondly, a brief summary appears in the section "How the having, group by, and where clauses interact":-
How the having, group by, and where clauses interact
When you include the having, group by, and where clauses in a query, the sequence in which each clause affects the rows determines the final results:
The where clause excludes rows that do not meet its search conditions.
The group by clause collects the remaining rows into one group for each unique value in the group by expression.
Aggregate functions specified in the select list calculate summary values for each group.
The having clause excludes rows from the final results that do not meet its search conditions.
#SQLGuru's explanation is an illustration of this.
Edit...
On a related point, I was surprised by the behaviour of non-ANSI-conforming queries that utilise TSQL "extended columns". Sybase handles the extended columns (i) after the WHERE clause (ii) by creating extra joins to the original tables and (iii) the WHERE clause is not used in the join. Such queries might return more rows than expected and the HAVING clause then requires additional conditions to filter these out.
See examples b, c and d under "Transact-SQL extensions to group by and having" on the page of your originally-linked docco. I found it useful to install the pubs2 sample database from Sybase to play along with the examples.
I haven't done Sybase since it shared code with MS SQL Server....90's, but my interpretation of what you are doing is this:
First, the list is filtered to <= 11
id ts
1 2
1 10
2 5
Everything else is filtered out.
Next, you are filtering the list to the rows where TS = the Max(TS) for that group.
id ts
1 10
2 5
10 is the Max(TS) for group 1 and 5 is the Max(TS) for group 2. Those two rows are the ones that remain. What result would you expect otherwise?
If you read the documentation here, it seems that Sybase use of columns in the having clause that don't appear in the group by clause is different from MySQL.
The example they give has this explanation:
The Transact-SQL extended column, price (in the select list, but not
an aggregate and not in the group by clause), causes all qualified
rows to display in each qualified group, even though a standard group
by clause produces a single row per group. The group by still affects
the vector aggregate, which computes the average price per group
displayed on each row of each group (they are the same values that
were computed for example a):
So, ts = max(ts) essentially does this:
select *
from (select t.*,
max(ts) over (partition by id) as maxts
from #t
where ts <= #time
) t
where ts = maxts
The subquery is important, because the where clause gets used for the max() calculation and all rows would be returned.
I find this behavior rather confusing and non-standard. I would replace it with more typical constructs. These are about the same level of complexity and seem clearer to a larger audience.

Joining Two Same-Sized Resultsets by Row Number

I have two table functions that return a single column each. One function is guaranteed to return the same number of rows as the other.
I want to insert the values into a new two-column table. One colum will receive the value from the first udf, the second column from the second udf. The order of the inserts will be the order in which the rows are returned by the udfs.
How can I JOIN these two udfs given that they do not share a common key? I've tried using a ROW_NUMBER() but can't quite figure it out:
INSERT INTO dbo.NewTwoColumnTable (Column1, Column2)
SELECT udf1.[value], udf2.[value]
FROM dbo.udf1() udf1
INNER JOIN dbo.udf2() udf2 ON ??? = ???
This will not help you, but SQL does not guarantee row order unless it is asked to explicitly, so the idea that they will be returned in the order you expect may be true for a given set, but as I understand the idea of set based results, is fundamentally not guaranteed to work properly. You probably want to have a key returned from the UDF if it is associated with something that guarantees the order.
Despite this, you can do the following:
declare #val int
set #val=1;
Select Val1,Val2 from
(select Value as Val2, ROW_NUMBER() over (order by #val) r from udf1) a
join
(select Value as Val2, ROW_NUMBER() over (order by #val) r from udf2) b
on a.r=b.r
The variable addresses the issue of needing a column to sort by.
If you have the privlidges to edit the UDF, I think the better practice is to already sort the data coming out of the UDF, and then you can add ident int identity(1,1) to your output table in the udf, which makes this clear.
The reaosn this might matter is if your server decided to split the udf results into two packets. If the two arrive out of the order you expected, SQL could return them in the order received, which ruins the assumption made that he UDF will return rows in order. This may not be an issue, but if the result is needed later for a real system, proper programming here prevents unexpected bugs later.
In SQL, the "order returned by the udfs" is not guaranteed to persist (even between calls).
Try this:
WITH q1 AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY whatever1) rn
FROM udf1()
),
q2 AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY whatever2) rn
FROM udf2()
)
INSERT
INTO dbo.NewTwoColumnTable (Column1, Column2)
SELECT q1.value, q2.value
FROM q1
JOIN q2
ON q2.rn = q1.rn
PostgreSQL 9.4+ could append a INT8 column at the end of the udfs result using the WITH ORDINALITY suffix
-- set returning function WITH ORDINALITY
SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
ls | n
-----------------+----
pg_serial | 1
pg_twophase | 2
postmaster.opts | 3
pg_notify | 4
official doc: http://www.postgresql.org/docs/devel/static/functions-srf.html
related blogspot: http://michael.otacoo.com/postgresql-2/postgres-9-4-feature-highlight-with-ordinality/

SQL Having on columns not in SELECT

I have a table with 3 columns:
userid mac_address count
The entries for one user could look like this:
57193 001122334455 42
57193 000C6ED211E6 15
57193 FFFFFFFFFFFF 2
I want to create a view that displays only those MAC's that are considered "commonly used" for this user. For example, I want to filter out the MAC's that are used <10% compared to the most used MAC-address for that user. Furthermore I want 1 row per user. This could easily be achieved with a GROUP BY, HAVING & GROUP_CONCAT:
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
And indeed, the result is as follows:
57193 001122334455,000C6ED211E6 42
However I really don't want the count-column in my view. But if I take it out of the SELECT statement, I get the following error:
#1054 - Unknown column 'count' in 'having clause'
Is there any way I can perform this operation without being forced to have a nasty count-column in my view? I know I can probably do it using inner queries, but I would like to avoid doing that for performance reasons.
Your help is very much appreciated!
As HAVING explicitly refers to the column names in the select list, it is not possible what you want.
However, you can use your select as a subselect to a select that returns only the rows you want to have.
SELECT a.userid, a.macs
FROM
(
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
) as a
UPDATE:
Because of a limitation of MySQL this is not possible, although it works in other DBMS like Oracle.
One solution would be to create a view for the subquery. Another solution seems cleaner:
CREATE VIEW YOUR_VIEW (userid, macs) AS
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
This will declare the view as returning only the columns userid and macs although the underlying SELECT statement returns more columns than those two.
Although I am not sure, whether the non-DBMS MySQL supports this or not...

SQLServer SQL query with a row counter

I have a SQL query, that returns a set of rows:
SELECT id, name FROM users where group = 2
I need to also include a column that has an incrementing integer value, so the first row needs to have a 1 in the counter column, the second a 2, the third a 3 etc
The query shown here is just a simplified example, in reality the query could be arbitrarily complex, with several joins and nested queries.
I know this could be achieved using a temporary table with an autonumber field, but is there a way of doing it within the query itself ?
For starters, something along the lines of:
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY my_order_column) AS Row_Counter
FROM my_table
However, it's important to note that the ROW_NUMBER() OVER (ORDER BY ...) construct only determines the values of Row_Counter, it doesn't guarantee the ordering of the results.
Unless the SELECT itself has an explicit ORDER BY clause, the results could be returned in any order, dependent on how SQL Server decides to optimise the query. (See this article for more info.)
The only way to guarantee that the results will always be returned in Row_Counter order is to apply exactly the same ordering to both the SELECT and the ROW_NUMBER():
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY my_order_column) AS Row_Counter
FROM my_table
ORDER BY my_order_column -- exact copy of the ordering used for Row_Counter
The above pattern will always return results in the correct order and works well for simple queries, but what about an "arbitrarily complex" query with perhaps dozens of expressions in the ORDER BY clause? In those situations I prefer something like this instead:
SELECT t.*
FROM
(
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY ...) AS Row_Counter -- complex ordering
FROM my_table
) AS t
ORDER BY t.Row_Counter
Using a nested query means that there's no need to duplicate the complicated ORDER BY clause, which means less clutter and easier maintenance. The outer ORDER BY t.Row_Counter also makes the intent of the query much clearer to your fellow developers.
In SQL Server 2005 and up, you can use the ROW_NUMBER() function, which has options for the sort order and the groups over which the counts are done (and reset).
The simplest way is to use a variable row counter. However it would be two actual SQL commands. One to set the variable, and then the query as follows:
SET #n=0;
SELECT #n:=#n+1, a.* FROM tablename a
Your query can be as complex as you like with joins etc. I usually make this a stored procedure. You can have all kinds of fun with the variable, even use it to calculate against field values. The key is the :=
Heres a different approach.
If you have several tables of data that are not joinable, or you for some reason dont want to count all the rows at the same time but you still want them to be part off the same rowcount, you can create a table that does the job for you.
Example:
create table #test (
rowcounter int identity,
invoicenumber varchar(30)
)
insert into #test(invoicenumber) select [column] from [Table1]
insert into #test(invoicenumber) select [column] from [Table2]
insert into #test(invoicenumber) select [column] from [Table3]
select * from #test
drop table #test