SQL-Server-2005: Why are results being returned in a different order with(nolock) - sql

i have a primary key clustered index in col1
why when i run the following statements are the results returned in a different order
select * from table
vs
select * from table with(nolock)
the results are also different with tablock
schema:
col1 int not null
col2 varchar (8000)

Without any ORDER BY no order of results is guaranteed.
Your question is now heavily truncated but the original version mentioned that you saw different order of result when using nolock as well as tablock.
Both of these locking options allow SQL Server to use an allocation order scan rather than reading along the clustered index data pages in logical order (following pointers along the linked list).
That should not be taken as meaning that the order is guaranteed to be in clustered index order without that as the advanced scanning mechanism, or parallelism for example could both change this.

The order of rows is never guaranteed unless you use an ORDER BY.
If you have to have the rows in a specific order there is no other solution that will return the rows in a predictable order.
If you leave out the order by the DBMS is free to return the rows in any order it thinks is most efficient

Sql Server makes no guarantee about the ordering, it will change based on how Sql Server optimises the query.
To guarantee the order you must use an order by clause.

if you are not specifying an order, it's completely nondeterministic. Today they may be different, tomorrow maybe not.
Supplying a hint may inadvertently guide the query optimizer down a more efficient path.

Related

Do clustered index on a column GUARANTEES returning sorted rows according to that column [duplicate]

This question already has an answer here:
Does a SELECT query always return rows in the same order? Table with clustered index
(1 answer)
Closed 8 years ago.
I am unable to get clear cut answers on this contentious question .
MSDN documentation mentions
Clustered
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the
index definition. There can be only one clustered index per table,
because the data rows themselves can be sorted in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table
has no clustered index, its data rows are stored in an unordered
structure called a heap.
While I see most of the answers
Does a SELECT query always return rows in the same order? Table with clustered index
http://sqlwithmanoj.com/2013/06/02/clustered-index-do-not-guarantee-physically-ordering-or-sorting-of-rows/
answering negative.
What is it ?
Just to be clear. Presumably, you are talking about a simple query such as:
select *
from table t;
First, if all the data on the table fits on a single page and there are no other indexes on the table, it is hard for me to imagine a scenario where the result set is not ordered by the primary key. However, this is because I think the most reasonable query plan would require a full-table scan, not because of any requirement -- documented or otherwise -- in SQL or SQL Server. Without an explicit order by, the ordering in the result set is a consequence of the query plan.
That gets to the heart of the issue. When you are talking about the ordering of the result sets, you are really talking about the query plan. And, the assumption of ordering by the primary key really means that you are assuming that the query uses full-table scan. What is ironic is that people make the assumption, without actually understanding the "why". Furthermore, people have a tendency to generalize from small examples (okay, this is part of the basis of human intelligence). Unfortunately, they see consistently that results sets from simple queries on small tables are always in primary key order and generalize to larger tables. The induction step is incorrect in this example.
What can change this? Off-hand, I think that a full table scan would return the data in primary key order if the following conditions are met:
Single threaded server.
Single file filegroup
No competing indexes
No table partitions
I'm not saying this is always true. It just seems reasonable that under these circumstances such a query would use a full table scan starting at the beginning of the table.
Even on a small table, you can get surprises. Consider:
select NonPrimaryKeyColumn
from table
The query plan would probably decide to use an index on table(NonPrimaryKeyColumn) rather than doing a full table scan. The results would not be ordered by the primary key (unless by accident). I show this example because indexes can be used for a variety of purposes, not just order by or where filtering.
If you use a multi-threaded instance of the database and you have reasonably sized tables, you will quickly learn that results without an order by have no explicit ordering.
And finally, SQL Server has a pretty smart optimizer. I think there is some reluctance to use order by in a query because users think it will automatically do a sort. SQL Server works hard to find the best execution plan for the query. IF it recognizes that the order by is redundant because of the rest of the plan, then the order by will not result in a sort.
And, of course you want to guarantee the ordering of results, you need order by in the outermost query. Even a query like this:
select *
from (select top 100 t.* from t order by col1) t
Does not guarantee that the results are ordered in the final result set. You really need to do:
select *
from (select top 100 t.* from t order by col1) t
order by col1;
to guarantee the results in a particular order. This behavior is documented here.
Without ORDER BY, there is no default sort order even if you have clustered index
in this link there is a good example :
CREATE SCHEMA Data AUTHORIZATION dbo
GO
CREATE TABLE Data.Numbers(Number INT NOT NULL PRIMARY KEY)
GO
DECLARE #ID INT;
SET NOCOUNT ON;
SET #ID = 1;
WHILE #ID < 100000 BEGIN
INSERT INTO Data.Numbers(Number)
SELECT #ID;
SET #ID = #ID+1;
END
CREATE TABLE Data.WideTable(ID INT NOT NULL
CONSTRAINT PK_WideTable PRIMARY KEY,
RandomInt INT NOT NULL,
CHARFiller CHAR(1000))
GO
CREATE VIEW dbo.WrappedRand
AS
SELECT RAND() AS random_value
GO
CREATE ALTER FUNCTION dbo.RandomInt()
RETURNS INT
AS
BEGIN
DECLARE #ret INT;
SET #ret = (SELECT random_value*1000000 FROM dbo.WrappedRand);
RETURN #ret;
END
GO
INSERT INTO Data.WideTable(ID,RandomInt,CHARFiller)
SELECT Number, dbo.RandomInt(), 'asdf'
FROM Data.Numbers
GO
CREATE INDEX WideTable_RandomInt ON Data.WideTable(RandomInt)
GO
SELECT TOP 100 ID FROM Data.WideTable
OUTPUT:
1407
253
9175
6568
4506
1623
581
As you have seen, the optimizer has chosen to use a non-clustered
index to satisfy this SELECT TOP query.
Clearly you cannot assume that your results are ordered unless you
explicitly use ORDER BY clause.
One must specify ORDER BY in the outermost query in order to guarantee rows are returned in a particular order. The SQL Server optimizer will optimize the query and data access to improve performance which may result in rows being returned in a different order. Examples of this are allocation order scans and parallelism. A relational table should always be viewed as an unordered set of rows.
I wish the MSDN documentation were clearer about this "sorting". It is more correct to say that SQL Server b-tree indexes provide ordering by 1) storing adjacent keys in the same page and 2) linking index pages in key order.

Unique sort order for postgres pagination

While trying to implement pagination from server side in postgres, i came across a point that while using limit and offset keywords you have to provide an ORDER BY clause on a unique column probably the primary key.
In my case i am using the UUID generation for Pkeys so I can't rely on a sequential order of increasing keys. ORDER BY pkey DESC - might not result in newer rows on top always.
So i resorted to using Created Date column - timestamp column which should be unique.
But my question comes what if the UI client wants to sort by some other column? in the event that it might not always be a unique column i resort to ORDER BY user_column, created_dt DESC so as to maintain predictable results for postgres pagination.
is this the right approach? i am not sure if i am going the right way. please advise.
I talked about this exact problem on an old blog post (in the context of using an ORM):
One last note about using sorting and paging in conjunction. A query
that implements paging can have odd results if the ORDER BY clause
does not include a field that represents an empirical sequence in the
data; sort order is not guaranteed beyond what is explicitly specified
in the ORDER BY clause in most (maybe all) database engines. An
example: if you have 100 orders that all occurred on the exact same
date, and you ask for the first page of this data sorted by this date,
then ask for the second page of data sorted the same way, it is
entirely possible that you will get some of the data duplicated across
both pages. So depending on the query and the distribution of data
that is “sortable,” it can be a good practice to always include a
unique field (like a primary key) as the final field in a sort clause
if you are implementing paging.
http://psandler.wordpress.com/2009/11/20/dynamic-search-objects-part-5sorting/
The strategy of using a column that uniquely identifies a record as pkey or insertion_date may not be possible in some cases.
I have an application where the user sets up his own grid query then it can simply put any column from multiple tables and perhaps none is a unique identifier.
In a case that can be useful you use rownum. You simply select the rownum and use his sort in over function. It would be something like:
select col1, col2, col3, row_number() over(order by col3) from tableX order by col3
It's important that over(order by *) match with order by *. Thus your paging will have consistent results every time.

Getting RID Lookup instead of Table Scan?

SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.
What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....
The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data
If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);

Using limit in sqlite SQL statement in combination with order by clause

Will the following two SQL statements always produce the same result set?
1. SELECT * FROM MyTable where Status='0' order by StartTime asc limit 10
2. SELECT * FROM (SELECT * FROM MyTable where Status='0' order by StartTime asc) limit 10
Yes, but ordering subqueries is probably a bad habit to get into. You could feasibly add a further ORDER BY outside the subquery in your second example, e.g.
SELECT *
FROM (SELECT *
FROM Test
ORDER BY ID ASC
) AS A
ORDER BY ID DESC
LIMIT 10;
SQLite still performs the ORDER BY on the inner query, before sorting them again in the outer query. A needless waste of resources.
I've done an SQL Fiddle to demonstrate so you can view the execution plans for each.
No. First because the StartTime column may not have UNIQUE constraint. So, even the first query may not always produce the same result - with itself!
Second, even if there are never two rows with same StartTime, the answer is still negative.
The first statement will always order on StartTime and produce the first 10 rows. The second query may produce the same result set but only with a primitive optimizer that doesn't understand that the ORDER BY in the subquery is redundant. And only if the execution plan includes this ordering phase.
The SQLite query optimizer may (at the moment) not be very bright and do just that (no idea really, we'll have to check the source code of SQLite*). So, it may appear that the two queries are producing identical results all the time. Still, it's not a good idea to count on it. You never know what changes will be made in a future version of SQLite.
I think it's not good practice to use LIMIT without ORDER BY, in any DBMS. It may work now, but you never know how long these queries will be used by the application. And you may not be around when SQLite is upgraded or the DBMS is changed.
(*) #Gareth's link provides the execution plan which suggests that current SQLite code is dumb enough to execute the redundant ordering.

Does 'Select' always order by primary key?

A basic simple question for all of you DBA.
When I do a select, is it always guaranteed that my result will be ordered by the primary key, or should I specify it with an 'order by'?
I'm using Oracle as my DB.
No, if you do not use "order by" you are not guaranteed any ordering whatsoever. In fact, you are not guaranteed that the ordering from one query to the next will be the same. Remember that SQL is dealing with data in a set based fashion. Now, one database implementation or another may happen to provide orderings in a certain way but you should never rely on that.
When I do a select, is it always guaranteed that my result will be ordered by the primary key, or should I specify it with an 'order by'?
No, it's by far not guaranteed.
SELECT *
FROM table
most probably will use TABLE SCAN which does not use primary key at all.
You can use a hint:
SELECT /*+ INDEX(pk_index_name) */
*
FROM table
, but even in this case the ordering is not guaranteed: if you use Enterprise Edition, the query may be parallelized.
This is a problem, since ORDER BY cannot be used in a SELECT clause subquery and you cannot write something like this:
SELECT (
SELECT column
FROM table
WHERE rownum = 1
ORDER BY
other_column
)
FROM other_table
No, ordering is never guaranteed unless you use an ORDER BY.
The order that rows are fetched is dependent on the access method (e.g. full table scan, index scan), the physical attributes of the table, the logical location of each row within the table, and other factors. These can all change even if you don't change your query, so in order to guarantee a consistent ordering in your result set, ORDER BY is necessary.
It depends on your DB and also it depends on indexed fields.
For example, in my table Users every user has unique varchar(20) field - login, and primary key - id.
And "Select * from users" returns rowset ordered by login.
If you desire specific ordering then declare it specifically using ORDER BY.
What if the table doesn't have primary key?
If you want your results in a specific order, always specify an order by