Different behaviour MSSQL2008 versus 2016 when inserting data - sql

I came across an old script that in essence does the following:
CREATE TABLE #T (ColA VARCHAR (20), ID INT)
INSERT INTO #T VALUES ('BBBBBBBB', 1), ('AAAAAAA', 4), ('RRRRRR', 3)
CREATE TABLE #S (ColA VARCHAR (100), ID INT)
INSERT INTO #S
SELECT * FROM #T
ORDER BY ID -- odd to do an order by in an insert statement, but that's the code as it is...
SELECT * FROM #S
DROP TABLE #T, #S
First, I want to mention that I am aware of the fact that tables such as the ones I created here do not have an actual order, we just order the resultset if we want.
However, if you run the script above on a SQL version 2008, you will get the results ordered in the order that was specified in the insert statement. On a 2016 machine, this is not the case. There it returns the rows in the order they were created in the first place. Does anyone know what changes cause this different behaviour?
Thanks a lot!

As to your example - nothing is changed. The relation in the relation theory is represented in the SQL with a table. And the relation is not ordered. So, you are not allowed to defined how rows are ordered when they are materialized - and you should not care about this.
If you want to SELECT the data in a ordered way each time, you must specified unique order by criteria.
Also, in your example - you can SELECT the data one billion times and the data can be returned as "you inserted" it each time, but on the very next time you can get different results. The engine returns the data in the "best" way according to it when there is no order specified, but this can change anytime.

As you know - unless order by is specified, the database engine returns the rows in an arbitrary order - How this order is generated has to do with the internal parts of the database engine - the algorithm may change between versions, even between service packs, without any need for documentation since it's known to be arbitrary.
Please note that arbitrary is not the same as random - meaning you should not expect to get different row order each time you run the query - in fact, you will probably get the same row order every time until something changes - that might be a restart to the server, a rebuild of an index, another row added to the table, an index created or removed - I can't say because it's not documented anywhere.
Moreover, unless you have an Identity column in your table, the optimizer will simply ignore the order by clause in the insert...select statement, exactly because what you already wrote in your question - Database tables have no intrinsic order.

Order the result set of a query by the specified column list and,
optionally, limit the rows returned to a specified range. The order
in which rows are returned in a result set are not guaranteed unless
an ORDER BY clause is specified.
MSSQL Docs

Related

Can SQL return different results for two runs of the same query using ORDER BY?

I have the following table:
CREATE TABLE dbo.TestSort
(
Id int NOT NULL IDENTITY (1, 1),
Value int NOT NULL
)
The Value column could (and is expected to) contain duplicates.
Let's also assume there are already 1000 rows in the table.
I am trying to prove a point about unstable sorting.
Given this query that returns a 'page' of 10 results from the first 1000 inserted results:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value
My intuition tells me that two runs of this query could return different rows if the Value column contains repeated values.
I'm basing this on the facts that:
the sort is not stable
if new rows are inserted in the table between the two runs of the query, it could possibly create a re-balancing of B-trees (the Value column may be indexed or not)
EDIT: For completeness: I assume rows never change once inserted, and are never deleted.
In contrast, a query with stable sort (ordering also by Id) should always return the same results, since IDs are unique:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value, Id
The question is: Is my intuition correct? If yes, can you provide an actual example of operations that would produce different results (at least "on your machine")? You could modify the query, add indexes on the Values column etc.
I don't care about the exact query, but about the principle.
I am using MS SQL Server (2014), but am equally satisfied with answers for any SQL database.
If not, then why?
Your intuition is correct. In SQL, the sort for order by is not stable. So, if you have ties, they can be returned in any order. And, the order can change from one run to another.
The documentation sort of explains this:
Using OFFSET and FETCH as a paging solution requires running the query
one time for each "page" of data returned to the client application.
For example, to return the results of a query in 10-row increments,
you must execute the query one time to return rows 1 to 10 and then
run the query again to return rows 11 to 20 and so on. Each query is
independent and not related to each other in any way. This means that,
unlike using a cursor in which the query is executed once and state is
maintained on the server, the client application is responsible for
tracking state. To achieve stable results between query requests using
OFFSET and FETCH, the following conditions must be met:
The underlying data that is used by the query must not change. That is, either the rows touched by the query are not updated or all
requests for pages from the query are executed in a single transaction
using either snapshot or serializable transaction isolation. For more
information about these transaction isolation levels, see SET
TRANSACTION ISOLATION LEVEL (Transact-SQL).
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
Although this specifically refers to offset/fetch, it clearly applies to running the query multiple times without those clauses.
If you have ties when ordering the order by is not stable.
LiveDemo
CREATE TABLE #TestSort
(
Id INT NOT NULL IDENTITY (1, 1) PRIMARY KEY,
Value INT NOT NULL
) ;
DECLARE #c INT = 0;
WHILE #c < 100000
BEGIN
INSERT INTO #TestSort(Value)
VALUES ('2');
SET #c += 1;
END
Example:
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
DBCC DROPCLEANBUFFERS; -- run to clear cache
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
The point is I force query optimizer to use parallel plan so there is no guaranteed that it will read data sequentially like Clustered index probably will do when no parallelism is involved.
You cannot be sure how Query Optimizer will read data unless you explicitly force to sort result in specific way using ORDER BY Id, Value.
For more info read No Seatbelt - Expecting Order without ORDER BY.
I think this post will answer your question:
Is SQL order by clause guaranteed to be stable ( by Standards)
The result is everytime the same when you are in a single-threaded environment. Since multi-threading is used, you can't guarantee.

Strange issue with the Order By --SQL

Few days ago I came across a strange problem with the Order By , While creating a new table I used
Select - Into - From and Order By (column name)
and when I open that table see tables are not arranged accordingly.
I re-verified it multiple times to make sure I am doing the right thing.
One more thing I would like to add is till the time I don't use INTO, I can see the desired result but as soon as I create new table, I see there is no Order for tht column. Please help me !
Thanks in advance.. Before posting the question I did research for 3 days but no solution yet
SELECT
[WorkOrderID], [ProductID], [OrderQty], [StockedQty]
INTO
[AdventureWorks2012].[Production].[WorkOrder_test]
FROM
[AdventureWorks2012].[Production].[WorkOrder]
ORDER BY
[StockedQty]
SQL 101 for beginners: SELECT statements have no defined order unless you define one.
When i open that table
That likely issues a SELECT (TOP 1000 IIFC) without order.
While creating a new table i used Select - Into - From and Order By (column name)
Which sort of is totally irrelevant - you basically waste performance ordering the input data.
You want an order in a select, MAKE ONE by adding an order by clause to the select. The table's internal order is by clustered index, but an query can return results in any order it wants. Fundamental SQL issue, as I said in the first sentence. Any good book on sql covers that in one of the first chapters. SQL uses a set approach, sets have no intrinsic order.
Firstly T-SQL is a set based language and sets don't have orders. More over it also doesn't mean serial execution of commands i.e, the above query is not executed in sequence written but the processing order for a SELECT statement is as:
1.FROM
2.ON
3.JOIN
4.WHERE
5.GROUP BY
6.WITH CUBE or WITH ROLLUP
7.HAVING
8.SELECT
9.DISTINCT
10.ORDER BY
Now when you execute your query without into selected column data gets ordered based on the condition specified in 'Order By' clause but when Into is used format of new_table is determined by evaluating the expressions in the select list.(Remember order by clause has not been evaluated yet).
The columns in new_table are created in the order specified by the select list but rows cannot be ordered. It's a limitation of Into clause you can refer here:
Specifying an ORDER BY clause does not guarantee the rows are inserted
in the specified order.

iSeries query changes selected RRN of subquery result rows

I'm trying to make an optimal SQL query for an iSeries database table that can contain millions of rows (perhaps up to 3 million per month). The only key I have for each row is its RRN (relative record number, which is the physical record number for the row).
My goal is to join the table with another small table to give me a textual description of one of the numeric columns. However, the number of rows involved can exceed 2 million, which typically causes the query to fail due to an out-of-memory condition. So I want to rewrite the query to avoid joining a large subset with any other table. So the idea is to select a single page (up to 30 rows) within a given month, and then join that subset to the second table.
However, I ran into a weird problem. I use the following query to retrieve the RRNs of the rows I want for the page:
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
This query works just fine, returning the correct RRNs for the rows I need. However, when I attempted to join the result of the subquery with another table, the RRNs changed. So I simplified the query to a subquery within a simple outer query, without any join:
select rrn(e) as RRN, e.*
from TABLE1 as e
where rrn(e) in (
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
)
order by e.UPDATED, e.ACCOUNT
The outer query simply grabs all of the columns of each row selected by the subquery, using the RRN as the row key. But this query does not work - it returns rows with completely different RRNs.
I need the actual RRN, because it will be used to retrieve more detailed information from the table in a subsequent query.
Any ideas about why the RRNs end up different?
Resolution
I decided to break the query into two calls, one to issue the simple subquery and return just the RRNs (rows-IDs), and the second to do the rest of the JOINs and so forth to retrieve the complete info for each row. (Since the table gets updated only once a day, and rows never get deleted, there are no potential timing problems to worry about.)
This approach appears to work quite well.
Addendum
As to the question of why an out-of-memory error occurs, this appears to be a limitation on only some of our test servers. Some can only handle up to around 2m rows, while others can handle much more than that. So I'm guessing that this is some sort of limit imposed by the admins on a server-by-server basis.
Trying to use RRN as a primary key is asking for trouble.
I find it hard to believe there isn't a key available.
Granted, there may be no explicit primary key defined in the table itself. But is there a unique key defined in the table?
It's possible there's no keys defined in the table itself ( a practice that is 20yrs out of date) but in that case there's usually a logical file with a unique key defined that is by the application as the de-facto primary key to the table.
Try looking for related objects via green screen (DSPDBR) or GUI (via "Show related"). Keyed logical files show in the GUI as views. So you'd need to look at the properties to determine if they are uniquely keyed DDS logicals instead of non-keyed SQL views.
A few times I've run into tables with no existing de-facto primary key. Usually, it was possible to figure out what could be defined as one from the existing columns.
When there truly is no PK, I simply add one. Usually a generated identity column. There's a technique you can use to easily add columns without having to recompile or test any heritage RPG/COBOL programs. (and note LVLCHK(*NO) is NOT it!)
The technique is laid out in Chapter 4 of the modernizing Redbook
http://www.redbooks.ibm.com/abstracts/sg246393.html
1) Move the data to a new PF (or SQL table)
2) create new LF using the name of the existing PF
3) repoint existing LF to new PF (or SQL table)
Done properly, the record format identifiers of the existing objects don't change and thus you don't have to recompile any RPG/COBOL programs.
I find it hard to believe that querying a table of mere 3 million rows, even when joined with something else, should cause an out-of-memory condition, so in my view you should address this issue first (or cause it to be addressed).
As for your question of why the RRNs end up different I'll take the liberty of quoting the manual:
If the argument identifies a view, common table expression, or nested table expression derived from more than one base table, the function returns the relative record number of the first table in the outer subselect of the view, common table expression, or nested table expression.
A construct of the type ...where something in (select somethingelse...) typically translates into a join, so there.
Unless you can specifically control it, e.g., via ALWCPYDTA(*NO) for STRSQL, SQL may make copies of result rows for any intermediate set of rows. The RRN() function always accesses physical record number, as contrasted with the ROW_NUMBER() function that returns a logical row number indicating the relative position in an ordered (or unordered) set of rows. If a copy is generated, there is no way to guarantee that RRN() will remain consistent.
Other considerations apply over time; but in this case it's as likely to be simple copying of intermediate result rows as anything.

SQL Server Implicit Order

i've got an issue due to database conception.
My data are grouped in a table which looks like :
IdGroup | IdValue
So for each group i've got the list of value.
Indeed, we should have had an order column or an id, but i can't.
Do you know anyway which can prove the order of the select value based on the insert order ?
I mean, if I inserted 1003,1001,1002 could i garantuee it to be retrieve in this order ?
IdGroup | IdValue
1 | 1003
1 | 1001
1 | 1002
Of course, using an order by doesn't seems to fit because i don't have any column usable.
Any idea ? Using a system proc or something like this.
Thanks a lot :)
Stop telling me to use an order by and altering the table, it doesn't fit and yes i know it's the good pratice to do... thanks :)
A couple of ideas:
DBCC PAGE (undocumented) can be used to look at the raw data pages of the table. It may be possible to determine insert order by looking at the low level information.
If you cannot alter the table, can you add a table to the database? If so, consider creating a table with an identity column and use a trigger on the original table to insert the records in the new table.
Also, you should include which version(s) of SQL Server are involved. Doing anything this unusual will very often be version specific.
You shouldn't rely on the data being returned in a particular order; use an ORDER BY clause to guarantee the order.
(Despite the fact that data appears to be returned in clustered index order, this might not always be the case).
Whilst some small scale tests will show that it returns it in what appears to be the right order, it just will not hold.
The golden rule remains - unless an order by clause is specified, there are no guarentees provided on the order of the returned data.
edit : If you place a non-clustered index on the idgroup column it is forced to add a hidden field, the uniqueifier since the values are the same - the problem it, you can't access it in an order by clause, but from a forensic perspective, you can determine the order it was inserted in.
As others have said, the only way to guarantee an ordering is with an ORDER BY clause. What isn't highlighted in their answers is that, the only place that this ORDER BY matters is in the SELECT statement. It doesn't* matter if you apply an ORDER BY clause during the INSERT statement; the system is free to return results from a select in whatever order it finds most efficient, unless an ORDER BY is specified at that time.
*There's a particular way to ensure what order IDENTITY values are assigned to a result set during an INSERT, using an ORDER BY, but I can't remember the exact details, and it still doesn't effect the order of SELECT.
Can you add the Created Date column? In this way you can get the records using Order by Clause Created Date. Moreover set it's default value Getdate()

Index of a record in a table

I need to locate the index position of a record in a large database table in order to preset a pager to that item's page. I need an efficient SQL query that can give me this number. Sadly SQL doesn't offer something like:
SELECT INDEX(*) FROM users WHERE userid='123'
Any bright ideas?
EDIT: Lets assume there is an ORDER BY clause appended to this. the point is I do not want to have to load all records to locate the position of a specific one. I am trying to open a pager to the page holding an existing item that had previously been chosen - because i want to provide information about that already chosen item within a context that allows a user to choose a different one.
You might use something like (pseudo-code):
counting query: $n = select count(uid) from {users} where ... (your paging condition including userid 123 as the limit)
$page = floor($n / $pager_size);
display query: select what,you,want from {users} where (your paging condition without the limit), passed to db_query_range(thequery, $page, $pager_size)
You should really look at pager_query, though, because that's what it's all about, and it basically works like this: a counting query and a display query, except it tries to build the counting query automatically.
Assuming you are really asking how to page records in SQL Server 2005 onwards, have a look at this code from David Hayden:
(you will need to change Date, Description to be your columns)
CREATE PROCEDURE dbo.ShowUsers
#PageIndex INT,
#PageSize INT
AS
BEGIN
WITH UserEntries AS (
SELECT ROW_NUMBER() OVER (ORDER BY Date DESC) AS Row, Date, Description
FROM users)
SELECT Date, Description
FROM UserEntries
WHERE Row BETWEEN (#PageIndex - 1) * #PageSize + 1 AND #PageIndex * #PageSize
END
SQL doesn't guarantee the order of objects in the table unless you use the OrderBy clause. In other words, the index of any particular row may change in subsequent queries. Can you describe what you are trying to accomplish with the pager?
You might be interested in something that simulates the rownum() of Oracle in MySQL... if you are using MySQL of course as it's not specified in the question.
Notes:
You'll have to look through all the records of your pages for that to work of course. You don't need to fetch them back to the PHP page from the database server but you'll have to include them in the query. There's no magic trick to determine the position of your row inside a result set other than querying the result set as it might change because of the where conditions, the orders and the groups. It needs to be in context.
Of course, if all your rows are sequential, with incremental ids, none are deleted, and you know the first and last ids; then you could use a count and with simple math get the position without querying everything.... but I doubt that's your case, it never is.