Why use Select Top 100 Percent? - sql

I understand that prior to SQL Server 2005, you could "trick" SQL Server to allow use of an order by in a view definition, by also include TOP 100 PERCENT in the SELECT clause. But I have seen other code which I have inherited which uses SELECT TOP 100 PERCENT ... within dynamic SQL statements (used in ADO in ASP.NET apps, etc). Is there any reason for this? Isn't the result the same as not including the TOP 100 PERCENT?

It was used for "intermediate materialization (Google search)"
Good article: Adam Machanic: Exploring the secrets of intermediate materialization
He even raised an MS Connect so it can be done in a cleaner fashion
My view is "not inherently bad", but don't use it unless 100% sure. The problem is, it works only at the time you do it and probably not later (patch level, schema, index, row counts etc)...
Worked example
This may fail because you don't know in which order things are evaluated
SELECT foo From MyTable WHERE ISNUMERIC (foo) = 1 AND CAST(foo AS int) > 100
And this may also fail because
SELECT foo
FROM
(SELECT foo From MyTable WHERE ISNUMERIC (foo) = 1) bar
WHERE
CAST(foo AS int) > 100
However, this did not in SQL Server 2000. The inner query is evaluated and spooled:
SELECT foo
FROM
(SELECT TOP 100 PERCENT foo From MyTable WHERE ISNUMERIC (foo) = 1 ORDER BY foo) bar
WHERE
CAST(foo AS int) > 100
Note, this still works in SQL Server 2005
SELECT TOP 2000000000 ... ORDER BY...

TOP (100) PERCENT is completely meaningless in recent versions of SQL Server, and it (along with the corresponding ORDER BY, in the case of a view definition or derived table) is ignored by the query processor.
You're correct that once upon a time, it could be used as a trick, but even then it wasn't reliable. Sadly, some of Microsoft's graphical tools put this meaningless clause in.
As for why this might appear in dynamic SQL, I have no idea. You're correct that there's no reason for it, and the result is the same without it (and again, in the case of a view definition or derived table, without both the TOP and ORDER BY clauses).

...allow use of an ORDER BY in a view definition.
That's not a good idea. A view should never have an ORDER BY defined.
An ORDER BY has an impact on performance - using it a view means that the ORDER BY will turn up in the explain plan. If you have a query where the view is joined to anything in the immediate query, or referenced in an inline view (CTE/subquery factoring) - the ORDER BY is always run prior to the final ORDER BY (assuming it was defined). There's no benefit to ordering rows that aren't the final result set when the query isn't using TOP (or LIMIT for MySQL/Postgres).
Consider:
CREATE VIEW my_view AS
SELECT i.item_id,
i.item_description,
it.item_type_description
FROM ITEMS i
JOIN ITEM_TYPES it ON it.item_type_id = i.item_type_id
ORDER BY i.item_description
...
SELECT t.item_id,
t.item_description,
t.item_type_description
FROM my_view t
ORDER BY t.item_type_description
...is the equivalent to using:
SELECT t.item_id,
t.item_description,
t.item_type_description
FROM (SELECT i.item_id,
i.item_description,
it.item_type_description
FROM ITEMS i
JOIN ITEM_TYPES it ON it.item_type_id = i.item_type_id
ORDER BY i.item_description) t
ORDER BY t.item_type_description
This is bad because:
The example is ordering the list initially by the item description, and then it's reordered based on the item type description. It's wasted resources in the first sort - running as is does not mean it's running: ORDER BY item_type_description, item_description
It's not obvious what the view is ordered by due to encapsulation. This does not mean you should create multiple views with different sort orders...

If there is no ORDER BY clause, then TOP 100 PERCENT is redundant. (As you mention, this was the 'trick' with views)
[Hopefully the optimizer will optimize this away.]

I have seen other code which I have inherited which uses SELECT TOP 100 PERCENT
The reason for this is simple: Enterprise Manager used to try to be helpful and format your code to include this for you. There was no point ever trying to remove it as it didn't really hurt anything and the next time you went to change it EM would insert it again.

No reason but indifference, I'd guess.
Such query strings are usually generated by a graphical query tool. The user joins a few tables, adds a filter, a sort order, and tests the results. Since the user may want to save the query as a view, the tool adds a TOP 100 PERCENT. In this case, though, the user copies the SQL into his code, parameterized the WHERE clause, and hides everything in a data access layer. Out of mind, out of sight.

Kindly try the below, Hope it will work for you.
SELECT TOP
( SELECT COUNT(foo)
From MyTable
WHERE ISNUMERIC (foo) = 1) *
FROM bar WITH(NOLOCK)
ORDER BY foo
WHERE CAST(foo AS int) > 100
)

The error says it all...
Msg 1033, Level 15, State 1, Procedure TestView, Line 5 The ORDER BY
clause is invalid in views, inline functions, derived tables,
subqueries, and common table expressions, unless TOP, OFFSET or FOR
XML is also specified.
Don't use TOP 100 PERCENT, use TOP n, where N is a number
The TOP 100 PERCENT (for reasons I don't know) is ignored by SQL Server VIEW (post 2012 versions), but I think MS kept it for syntax reasons. TOP n is better and will work inside a view and sort it the way you want when a view is used initially, but be careful.

I would suppose that you can use a variable in the result, but aside from getting the ORDER BY piece in a view, you will not see a benefit by implicitly stating "TOP 100 PERCENT":
declare #t int
set #t=100
select top (#t) percent * from tableOf

Just try this, it explains it pretty much itself. You can't create a view with an ORDER BY except if...
CREATE VIEW v_Test
AS
SELECT name
FROM sysobjects
ORDER BY name
GO
Msg 1033, Level 15, State 1, Procedure TestView, Line 5 The ORDER BY
clause is invalid in views, inline functions, derived tables,
subqueries, and common table expressions, unless TOP, OFFSET or FOR
XML is also specified.

Related

Using a function-generated column in the where clause

I have an SQL query, which calls a stored SQL function, I want to do this:
SELECT dbo.fn_is_current('''{round}''', r.fund_cd, r.rnd) as [current]
FROM BLAH
WHERE current = 1
The select works fine, however, it doesn't know "current". Even though (without the WHERE) the data it generates does have the "current" column, and it's correct.
So, I'm assuming that this is a notation issue.
You cannot use an alias from the select in the where clause (or even again in the same select). Just use a subquery:
SELECT t.*
FROM (SELECT dbo.fn_is_current('''{round}''', r.fund_cd, r.rnd) as [current]
FROM BLAH
) t
WHERE [current] = 1;
As a note: current is a very bad name for a column because it is a reserved word (in many databases at least, including SQL Server). The word is used when defining cursors. Use something else, such as currval.

sql function column not available for where clause

I have a query:
SELECT ROW_NUMBER() OVER(ORDER BY LogId) AS RowNum
FROM [Log] l
where RowNum = 1
and I'm getting the following error:
Invalid column name 'RowNum'.
I did some search here and found that column aliasing is not available in WHERE.
so I tried the the following and it worked:
select *
from
(
SELECT ROW_NUMBER() OVER(ORDER BY LogId) AS RowNum
FROM [Log] l
) as t
where t.RowNum = 1
Is there a better way, from performance point of view, to make this query?
Thanks in advance.
That's just the way it is.
Column aliases can not be used on the same logical level where they were defined. You will have to use the derived table (sub-query) as you have found out.
If you are concerned about performance, then don't. The derived table is mere syntactical sugar, it won't make the query slower (compared to the solution you tried first).
An alternative to this specific query, which won't perform any different but is simpler to write:
SELECT TOP 1 <col list> FROM dbo.[Log] ORDER BY LogId;
As #a_horse explained, don't be concerned that because your second query looks like more code that it is more expensive. If you want to measure the efficiency of different queries that get the same results, compare their execution plans, not code complexity.

SQL pattern to get "and" list of multiple-row matches?

I'm not a database programmer, but I have a simple database-backed app where I have items with tags. Each item may have multiple tags, so I'm using a typical junction table (like this), where each row represents the fact that the item with the appropriate ID has the tag with the appropriate ID.
This works very logically when I want to do something like select all items with a given tag.
But, what is the typical pattern for doing AND searches? That is, what if I want to find all items which have all of a certain set of tags? This is such a common operation that I'd think some of the intro tutorials would cover it, but I guess I'm not looking in the right places.
The approach I tried was to use INTERSECT, first directly and then with subqueries and IN. This works, but builds up long-seeming queries quickly as I add search terms. And, crucially, this approach appears to be about an order of magnitude slower than the approach of shoving all the tags as text into one "tags" column and using SQLite's full-text search. (And, as I would expect/hope, the FTS search gets faster as I add more terms, which doesn't seem to be the case with the INTERSECTS approach.)
What's the proper design pattern here, and what's the right way to make it snappy? I'm using SQLite in this case, but I'm most interested in a general answer, since this must be a common thing to do.
The following is the standard ANSI SQL solution which avoids synchronizing the number of ids and the ids themselves.
with tag_ids (tid) as (
values (1), (2)
)
select id
from tags
where id (select tid from tag_ids)
having count(*) = (select count(*) from tag_ids);
The values clause ("row constructor") is supported by PostgreSQL and DB2. For database that don't support that, you can replace it with a simple "select", e.g. in Oracle this would be:
with tag_ids (tid) as (
select 1 as tid from dual
union all
select 2 from dual
)
select id
from tags
where id (select tid from tag_ids)
having count(*) = (select count(*) from tag_ids);
For SQL Server you would simply leave out the "from dual", as it does not require a FROM clause for a SELECT.
This assumes that one tag can only be assigned exactly once. If that isn't the case, you would need to use a count(distinct id) in the having clause.
I would be inclined to use a group by:
select id
from tags
where id in (<tag1>, <tag2>)
group by id
having count(*) = 2
This would guarantee that both appear.
For an unlimited size list, you could store the ids in a string, such as '|tag1|tag2|tag3|' (note delimiters on ends). Then you can do:
select id
from tags
where #taglist like '%|'+tag+'|%'
group by id
having count(*) = len(#taglist) - (len(replace(#taglist, '|', '') - 1)
This is using SQL Server syntax. But, it is saying two things. The WHERE clause is saying that the tag is in the list. The HAVING clause is saying that the number of matches equals the length of the list. It does this with a trick, by counting the number of separtors and subtracting 1.

Why can't I GROUP BY 1 when it's OK to ORDER BY 1?

Why are column ordinals legal for ORDER BY but not for GROUP BY? That is, can anyone tell me why this query
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY OrgUnitID
cannot be written as
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY 1
When it's perfectly legal to write a query like
SELECT OrgUnitID FROM Employee AS e ORDER BY 1
?
I'm really wondering if there's something subtle about the relational calculus, or something, that would prevent the grouping from working right.
The thing is, my example is pretty trivial. It's common that the column that I want to group by is actually a calculation, and having to repeat the exact same calculation in the GROUP BY is (a) annoying and (b) makes errors during maintenance much more likely. Here's a simple example:
SELECT DATEPART(YEAR,LastSeenOn), COUNT(*)
FROM Employee AS e
GROUP BY DATEPART(YEAR,LastSeenOn)
I would think that SQL's rule of normalize to only represent data once in the database ought to extend to code as well. I'd want to only right that calculation expression once (in the SELECT column list), and be able to refer to it by ordinal in the GROUP BY.
Clarification: I'm specifically working on SQL Server 2008, but I wonder about an overall answer nonetheless.
One of the reasons is because ORDER BY is the last thing that runs in a SQL Query, here is the order of operations
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
so once you have the columns from the SELECT clause you can use ordinal positioning
EDIT, added this based on the comment
Take this for example
create table test (a int, b int)
insert test values(1,2)
go
The query below will parse without a problem, it won't run
select a as b, b as a
from test
order by 6
here is the error
Msg 108, Level 16, State 1, Line 3
The ORDER BY position number 6 is out of range of the number of items in the select list.
This also parses fine
select a as b, b as a
from test
group by 1
But it blows up with this error
Msg 164, Level 15, State 1, Line 3
Each GROUP BY expression must contain at least one column that is not an outer reference.
There is a lot of elementary inconsistencies in SQL, and use of scalars is one of them. For example, anyone might expect
select * from countries
order by 1
and
select * from countries
order by 1.00001
to be a similar queries (the difference between the two can be made infinitesimally small, after all), which are not.
I'm not sure if the standard specifies if it is valid, but I believe it is implementation-dependent. I just tried your first example with one SQL engine, and it worked fine.
use aliasses :
SELECT DATEPART(YEAR,LastSeenOn) as 'seen_year', COUNT(*) as 'count'
FROM Employee AS e
GROUP BY 'seen_year'
** EDIT **
if GROUP BY alias is not allowed for you, here's a solution / workaround:
SELECT seen_year
, COUNT(*) AS Total
FROM (
SELECT DATEPART(YEAR,LastSeenOn) as seen_year, *
FROM Employee AS e
) AS inline_view
GROUP
BY seen_year
databases that don't support this basically are choosing not to. understand the order of the processing of the various steps, but it is very easy (as many databases have shown) to parse the sql, understand it, and apply the translation for you. Where its really a pain is when a column is a long case statement. having to repeat that in the group by clause is super annoying. yes, you can do the nested query work around as someone demonstrated above, but at this point it is just lack of care about your users to not support group by column numbers.

How to refer to a variable create in the course of executing a query in T-SQL, in the WHERE clause?

What the title says, really.
If I SELECT [statement] AS whatever, why can't I refer to the whatever column in the body of the WHERE clause? Is there some kind of a workaround? It's driving me crazy.
As far as I'm aware, you can't directly do this in SQL Server.
If you REALLY have to use your column alias in the WHERE clause, you can do this, but it seems like overkill to use a subquery just for the alias:
SELECT *
FROM
(
SELECT [YourColumn] AS YourAlias, etc...
FROM Whatever
) YourSubquery
WHERE YourAlias > 2
You're almost certainly better off just using the contents of the original column in your WHERE clause.
It has to do with the way a SELECT statement gets translated into an abstract query tree: the 'whatever' only appears in the query result projection part of the tree, which is above the filtering part of the tree, so the WHERE clause cannot understand the 'whatever'. This is not some internal implementation detail, it is a fundamental behavior of relational queries: the projection of the result occurs after the evaluation of the joins and filters.
IS really trivial to work around the 'problem' by making the hierarchy of the query explicit:
select ...
from (
select [something] as whatever
from ...
) as subquery
WHERE whatever = ...;
A common table expression can also server the same purpose:
with cte as (
select [something] as whatever
from ...)
select ... from cte
WHERE whatever = ...;
It's to do with the order of operations in the select statement. The WHERE clause is evaluated before the SELECT clause so this information isn't available. Although it is available in the ORDER BY clause as this is processed last.
As others have mentioned, a sub-query will get around this problem.