Avoid "SELECT TOP 1" and "ORDER BY" in Queries

Avoid "SELECT TOP 1" and "ORDER BY" in Queries - sql

I have the very table in sql server 2008 with lot of data
|ID|Name|Column_1|Column_2|
|..|....|........|........|
more than 18,000 records. So i need to the the row with the lowest value of Column_1 that is date but could by any data type (that is unsorted) so I use these sentence
SELECT TOP 1 ID, Name from table ORDER BY Column_1 ASC
But this is very very slow. And i think that i don't need to to sort the whole table. My question es how to get the same date with out using TOP 1 and ORDER BY

I cannot see why 18,000 rows of information would cause too much of a slow down, but that is obviously without seeing what the data is you are storing.
If you are regularly going to be using the Column_1 field, then I would suggest you place a non-clustered index on it... that will speed up your query.
You can do it by "designing" your table via Sql Server Management Studio, or directly via TSQL...
CREATE INDEX IX_myTable_Column_1 ON myTable (Column_1 ASC)
More information on MSDN about creating indexes here
Update thanks to comments by #GarethD who helped me with this, as I wasn't actually aware of it.
As an extra part of the above TSQL statement, it will increase the speed of your queries if you include the names of the other columns that will be used within the index....
CREATE INDEX IX_myTable_Column_1 ON myTable (Column_1 ASC) INCLUDE (ID, Name)
As GarethD points out, using this SQLFiddle as proof, the execution plan is much quicker as it avoids a "RID" (or Row Identifier) lookup.
More information on MSDN about creating indexes with include columns here
Thank you #GarethD

Would this work faster? When I read this question, this was the code that came to mind:
Select top 1 ID, Name
from table
where Column_1 = (Select min(Column_1) from table)

Related

db2 10.5 multi-column index explanation

My first time working with indexes in database and so far I've learn that if you have a multi-column index such as index('col1', 'col2', 'col3'), and if you do a query that uses where col2='col2' and col3='col3', that index would not be use.
I also learn that if a column is very low selectivity column. Indexing is useless.
However, from my test, it seems none of the above is true at all. Can someone explain more on this?
I have a table with more than 16 million records. Let's say claimID is the primary key, then there're a historynumber column that only have 3 distinct values (1,2,3), and a last column with storeNumber that has about 1 million distinct values.
I have an index for claimID alone, another index(historynumber, claimID), and other index with index(historynumber, storeNumber), and finally index(storeNumber, historynumber).
My guess was that if I do:
select * from my_table where claimId='123456' and historynumber = 1
would be much faster than
select * from my_table where historynumber = 1 and claimId = '123456'
However, the 2 have exactly the same performance (instant). So I thought the primary key index can work on any column order. Therefore, I tried the same thing but on historynumber and storeNumber instead. The result is exactly the same. Then I start trying out on columns that has no indexes and of course the result is the same also.
Finally, I do a
select * from my_table where historynumber = 1
and the query takes so long I had to cancel it.
So my conclusion is that the column order in where clause is completely useless, and so is the column order in the index definition since it seems like the database is smart enough to tell which column is the highest selectivity column.
Could someone give me an example that could prove otherwise?

Index explanation is a huge topic.
Don't worry about the sequence of different attributes in the SQL - it has no effect whether you specify
...where claimId='123456' and historynumber = 1
or the other way round. Each SQL is checked and optimized by the optimizer. To proove how the data gets accessed you could do a EXPLAIN. Check the documentation for more details.
For your other problem
select * from my_table where historynumber = 1
with an index of (storeNumber, historynumber).
Have you ever tried to lookup the name of a caller (having the telephone number) in a telephone book?
Well it is pretty much the same for an index - so the column order when creatin the index matters!
There are techniques which could help - i.e. index jump scan - but there is no guarantee.
Check out following sites to learn a little bit more about DB2 indexes:
http://db2commerce.com/2013/09/19/db2-luw-basics-indexes/
http://use-the-index-luke.com/sql/where-clause/the-equals-operator/concatenated-keys

How to create database INDEX for SQL expression?

I am beginner with indexes. I want to create index for this SQL expression which takes too much time to execute so I would like on what exact columns should I create index?
I am using DB2 db but never mind I think that question is very general.
My SQL expression is:
select * from incident where (relatedtoglobal=1)
and globalticketid in (select ticketid from INCIDENT where status='RESOLVED')
and statusdate <='2012-10-09 12:12:12'
Should I create index with this 5 columns or how?
Thanks

Your query:
select *
from incident
where relatedtoglobal = 1
and globalticketid in ( select ticketid
from INCIDENT
where status='RESOLVED'
)
and statusdate <='2012-10-09 12:12:12' ;
And the subquery inside:
select ticketid
from INCIDENT
where status='RESOLVED'
An index on (status, ticketid) will certainly help efficiency of the subquery evaluation and thus of the query.
For the query, besides the previous index, you'll need one more index. The (relatedtoglobal, globalticketid) may be sufficient.
I'm not sure if a more complex indexing would/could be used by the DB2 engine.
Like one on (relatedtoglobal, globalticketid) INCLUDE (statusdate) or
Two indices, one on (relatedtoglobal, globalticketid) and one on (relatedtoglobal, statusdate)
The DB2 documentation is not an easy read but has many details. Start with CREATE INDEX statement and Implementing Indexes.

Is there efficient SQL to query a portion of a large table

The typical way of selecting data is:
select * from my_table
But what if the table contains 10 million records and you only want records 300,010 to 300,020
Is there a way to create a SQL statement on Microsoft SQL that only gets 10 records at once?
E.g.
select * from my_table from records 300,010 to 300,020
This would be way more efficient than retrieving 10 million records across the network, storing them in the IIS server and then counting to the records you want.

SELECT * FROM my_table is just the tip of the iceberg. Assuming you're talking a table with an identity field for the primary key, you can just say:
SELECT * FROM my_table WHERE ID >= 300010 AND ID <= 300020
You should also know that selecting * is considered poor practice in many circles. They want you specify the exact column list.

Try looking at info about pagination. Here's a short summary of it for SQL Server.

Absolutely. On MySQL and PostgreSQL (the two databases I've used), the syntax would be
SELECT [columns] FROM table LIMIT 10 OFFSET 300010;
On MS SQL, it's something like SELECT TOP 10 ...; I don't know the syntax for offsetting the record list.
Note that you never want to use SELECT *; it's a maintenance nightmare if anything ever changes. This query, though, is going to be incredibly slow since your database will have to scan through and throw away the first 300,010 records to get to the 10 you want. It'll also be unpredictable, since you haven't told the database which order you want the records in.
This is the core of SQL: tell it which 10 records you want, identified by a key in a specific range, and the database will do its best to grab and return those records with minimal work. Look up any tutorial on SQL for more information on how it works.

When working with large tables, it is often a good idea to make use of Partitioning techniques available in SQL Server.
The rules of your partitition function typically dictate that only a range of data can reside within a given partition. You could split your partitions by date range or ID for example.
In order to select from a particular partition you would use a query similar to the following.
SELECT <Column Name1>…/*
FROM <Table Name>
WHERE $PARTITION.<Partition Function Name>(<Column Name>) = <Partition Number>
Take a look at the following white paper for more detailed infromation on partitioning in SQL Server 2005.
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope this helps however please feel free to pose further questions.
Cheers, John

I use wrapper queries to select the core query and then just isolate the ROW numbers that i wish to take from the query - this allows the SQL server to do all the heavy lifting inside the CORE query and just pass out the small amount of the table that i have requested. All you need to do is pass the [start_row_variable] and the [end_row_variable] into the SQL query.
NOTE: The order clause is specified OUTSIDE the core query [sql_order_clause]
w1 and w2 are TEMPORARY table created by the SQL server as the wrapper tables.
SELECT
w1.*
FROM(
SELECT w2.*,
ROW_NUMBER() OVER ([sql_order_clause]) AS ROW
FROM (
<!--- CORE QUERY START --->
SELECT [columns]
FROM [table_name]
WHERE [sql_string]
<!--- CORE QUERY END --->
) AS w2
) AS w1
WHERE ROW BETWEEN [start_row_variable] AND [end_row_variable]
This method has hugely optimized my database systems. It works very well.
IMPORTANT: Be sure to always explicitly specify only the exact columns you wish to retrieve in the core query as fetching unnecessary data in these CORE queries can cost you serious overhead

Use TOP to select only a limited amont of rows like:
SELECT TOP 10 * FROM my_table WHERE ID >= 300010
Add an ORDER BY if you want the results in a particular order.
To be efficient there has to be an index on the ID column.

Slow distinct query in SQL Server over large dataset

We're using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that one of the tables has started to take an obscene amount of time to query.
The table has 3 columns:
id -- autonumber (clustered)
typeUUID -- GUID generated before the insert happens; used to group the types together
typeName -- The type name (duh...)
One of the queries we run is a distinct on the typeName field:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
The typeName field has a non-clusted, non-unique ascending index on it. The table contains approximately 200M records at the moment. When we run this query, the query took 5m 58s to return! Perhaps we're not understanding how the indexes work... But I didn't think we mis-understood them that much.
To test this a little further, we ran the following query:
SELECT DISTINCT [typeName] FROM (SELECT TOP 1000000 [typeName] FROM [types] WITH (nolock)) AS [subtbl]
This query returns in about 10 seconds, as I would expect, it's scanning the table.
Is there something we're missing here? Why does the first query take so long?
Edit: Ah, my apologies, the first query returns 76 records, thank you ninesided.
Follow up: Thank you all for your answers, it makes more sense to me now (I don't know why it didn't before...). Without an index, it's doing a table scan across 200M rows, with an index, it's doing an index scan across 200M rows...
SQL Server does prefer the index, and it does give a little bit of a performance boost, but nothing to be excited about. Rebuilding the index did take the query time down to just over 3m instead of 6m, an improvement, but not enough. I'm just going to recommend to my boss that we normalize the table structure.
Once again, thank you all for your help!!

You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?

There is an issue with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.
So we took queries such as:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
and break it up into the following:
SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)
SELECT DISTINCT typeName FROM #tempTable1
Another way to get around it is to use a GROUP BY, which gets a different optimization plan.

I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.
If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.

As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.
So it's really a matter of limiting the number of rows that need to be scanned.
The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??
If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.
That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.
Marc

A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).
Idea was from this question:
select typeName into #Result from Types where 1=0;
declare #t varchar(100) = (select min(typeName) from Types);
while #t is not null
begin
set #t = (select top 1 typeName from Types where typeName > #t order by typeName);
if (#t is not null)
insert into #Result values (#t);
end
select * from #Result;
And looks like there are also some other methods (notably the recursive CTE #Paul White):
different-ways-to-find-distinct-values-faster-methods
sqlservercentral Topic873124-338-5

My first thought is statistics. To find last updated:
SELECT
name AS index_name,
STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
sys.indexes
WHERE
object_id = OBJECT_ID('MyTable');
Edit: Stats are updated when indexes are rebuilt, which I see are not maintained
My second thought is that is the index still there? The TOP query should still use an index.
I've just tested on one of my tables with 57 million rows and both use the index.

An indexed view can make this faster.
create view alltypes
with schemabinding as
select typename, count_big(*) as kount
from dbo.types
group by typename
create unique clustered index idx
on alltypes (typename)
The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)
Alternatively you could make a small table holding all values:
select distinct typename
into alltypes
from types
alter table alltypes
add primary key (typename)
alter table types add foreign key (typename) references alltypes
The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.

I should try something like this:
SELECT typeName FROM [types] WITH (nolock)
group by typeName;
And like other i would say you need to normalize that column.

An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.
You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:
select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid
Another option is to normalize the big table into two tables:
talbe1: id, guid, typeid
type table: typeid, typename
This would be very beneficial if the number of types was relatively small.

I could be missing something but would it be more efficient if an overhead on load to create a view with distinct values and query that instead?
This would give almost instant responses to the select if the result set is significantly smaller with the overhead over populating it on each write though given the nature of the view that might be trivial in itself.
It does ask the question how many writes compared to how often you want the distinct and the importance of the speed when you do.

What is an efficient method of paging through very large result sets in SQL Server 2005?

EDIT: I'm still waiting for more answers. Thanks!
In SQL 2000 days, I used to use temp table method where you create a temp table with new identity column and primary key then select where identity column between A and B.
When SQL 2005 came along I found out about Row_Number() and I've been using it ever since...
But now, I found a serious performance issue with Row_Number().
It performs very well when you are working with not-so-gigantic result sets and sorting over an identity column. However, it performs very poorly when you are working with large result sets like over 10,000 records and sorting it over non-identity column. Row_Number() performs poorly even if you sort by an identity column if the result set is over 250,000 records. For me, it came to a point where it throws an error, "command timeout!"
What do you use to do paginate a large result set on SQL 2005?
Is temp table method still better in this case? I'm not sure if this method using temp table with SET ROWCOUNT will perform better... But some say there is an issue of giving wrong row number if you have multi-column primary key.
In my case, I need to be able to sort the result set by a date type column... for my production web app.
Let me know what you use for high-performing pagination in SQL 2005. And I'd also like to know a smart way of creating indexes. I'm suspecting choosing right primary keys and/or indexes (clustered/non-clustered) will play a big role here.
Thanks in advance.
P.S. Does anyone know what stackoverflow uses?
EDIT: Mine looks something like...
SELECT postID, postTitle, postDate
FROM
(SELECT postID, postTitle, postDate,
ROW_NUMBER() OVER(ORDER BY postDate DESC, postID DESC) as RowNum
FROM MyTable
) as DerivedMyTable
WHERE RowNum BETWEEN #startRowIndex AND (#startRowIndex + #maximumRows) - 1
postID: Int, Identity (auto-increment), Primary key
postDate: DateTime
EDIT: Is everyone using Row_Number()?

The row_number() technique should be quick. I have seen good results for 100,000 rows.
Are you using row_number() similiar to the following:
SELECT column_list
FROM
(SELECT column_list
ROW_NUMBER() OVER(ORDER BY OrderByColumnName) as RowNum
FROM MyTable m
) as DerivedTableName
WHERE RowNum BETWEEN #startRowIndex AND (#startRowIndex + #maximumRows) - 1
...and do you have a covering index for the column_list and/or an index on the 'OrderByColumnName' column?

Well, for your sample query ROW_COUNT should be pretty fast with thousands of rows, provided you have an index on your PostDate field. If you don't, the server needs to perform a complete clustered index scan on your PK, practically load every page, fetch your PostDate field, sort by it, determine the rows to extract for the result set and again fetch those rows. It's kind of creating a temp index over and over again (you might see an table/index spool in the plain).
No wonder you get timeouts.
My suggestion: set an index on PostDate DESC, this is what ROW_NUMBER will go over - (ORDER BY PostDate DESC, ...)
As for the article you are referring to - I've done pretty much paging and stuff with SQL Server 2000 in the past without ROW_COUNT and the approach used in the article is the most efficient one. It does not work in all circumstances (you need unique or almost unique values). An overview of some other methods is here.
.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas