This is my Very first Post! Bear with me.
I have an Update Statement that I am trying to understand how SQL Server handles it.
UPDATE a
SET a.vField3 = b.vField3
FROM tableName a
INNER JOIN tableName b on a.vField1 = b.vField1
AND b.nField2 = a.nField2 – 1
This is my query in its simplest form.
vField1 is a Varchar
nField2 is an int (autonumber)
vField3 is a Varchar
I have left the WHERE clause out so understand there is logic that otherwise makes this a nessessity.
Say vField1 is a Customer Number and that Customer has 3 records
The value in nField2 is 1, 2, and 3 consecutively.
vField3 is a Status
When the Update comes to a.nField2 = 1 there is no a.nField2 -1 so it continues
When the Update comes to a.nField2 = 2, b.nField2 = 1
When the Update comes to a.nField2 = 3, b.nField2 = 2
So when the Update is on a.nField2 = 2, alias b reflects what is on the line prior (b.nField2 = 1)
And it SETs the Varchar Value of a.vField3 = b.vField3
When the Update is on a.nField2 = 3, alias b reflects what is on the line prior (b.nField2 = 2)
And it (should) SET the Varchar Value of a.vField3 = b.vField3
When the process is complete –the Second of three records looks as expected –hence the value in vField3 of the second record reflects the value in vField3 from the First record
However, vField3 of the Third record does not reflect the value in vField3 from the Second record.
I think this demonstrates that SQL Server may be producing a transaction of some sort and then an update.
Question: How can I get the DB to Update after each transaction so I can reference the values generated by each transaction?
Thank you.
davlyo
Firstly, and most importantly, you are mistaken in your misconception that the transaction will logically implement some sort of defined loop because your records are held 'consecutively'. In fact the order of your records in your database is undefined and it makes no sense to at all to think of your table storage in this ordered manner. In fact you should try to rid yourself of it completely or else it will lead you into all sorts of traps and bad habits. Try instead to logically think of statements executing as set (in the mathematical sense) operations, not cursor traverses.
It is certainly true that in most relational databases the order in which records will be retrieved by a SELECT without an ORDER BY clause is in insertion order, but this is an implementation issue and in fact should never be relied on in any logic (always use an ORDER BY clause to retrieve data if you care about the order). To emphasis, according to ANSI SQL the order of retrieval of records from a database is undefined without an ORDER BY clause - technically it wouldn't even have to be consistent on sequential execution of the same SELECT statement.
It follows therefore that in order for an UPDATE operation on a relational database to yield consistent results any query must operate as a single transaction. The transaction grabs a snapshot of the records it will update, updates them in a consistent, atomic, manner, then applies the results back to the data. There simply is no logical conception of the SQL looping over records or whatever.
The whole update query is one operation - and one transaction if that's the only thing in the transaction. So, the query does not see it's own results. The query operates without any implied order - almost as if it all happens at once.
Also bear in mind that this is a self join, so what was originally the second/third record will not be after the query is run. One record will be "lost" - the original third record, and the record at value 1 duplicated.
E.g. you start with, Customer, aField2, aField3
mdma 1 A
mdma 2 B
mdma 3 C
After running your update, the values will be
mdma 1 A
mdma 2 A
mdma 3 B
Is that what you are seeing/expecting?
Related
I come up by chances to this curious case.
Environment:
Oracle 12.2.2
Involved 2 tables.
N. of rows 16 milions
As far I know, and reported here Oracle / PLSQL: EXISTS Condition the use of where exists is in general less perfomant of other way.
In my case however when updating a table's column with the value with another on join condition with the exists, the query run in about 12-13 sec without issues(I did only some check, as I really do not know all the content of the table):
update fdm_auftrag ou
set (ou.e_hr,ou.e_budget) = ( select b.e_hr,b.e_budget
from fdm_budget_auftrag b
where b.fk_column1 = ou.fk_column1
and b.fk_column2 = ou.fk_column2
and b.fk_col3 = ou.fk_col3 )
where exists ( select b.e_hr,b.e_budget
from fdm_budget_auftrag b
where b.fk_column1 = ou.fk_column1
and b.fk_column2 = ou.fk_column2
and b.fk_col3 = ou.fk_col3 );
instead without the exists, it takes so much time then I even interrupt it.
I am just gessing as the condition in exist is valuated as a boolean, if the enginee found out at least one row, then had to do less touch on the DB, but I am not sure about it.
It is correct this "guess", have someone a more clear explanation?
The where clause is limiting the number of rows being updated.
Fewer updated rows means that the update query runs faster. There is a lot of overhead to updating a row, including stashing away information for roll-back purposes.
I am assuming that you are updating relatively few rows in a much larger table. If the where clause is selecting most of the rows, then there might be no performance difference.
And, finally, the two queries are not identical. Without the where unmatched values will be assigned NULL.
What guarantees SQL specifications gives for UPDATE statement where some column both being updated and read?
From my experiment I can see that when column used at right side of "=" sign, SQL uses old value, even though we update this very column in the same statement.
Consider the following T-SQL code.
create table test
(
a int primary key,
b int
)
insert into test
values (1,2)
update test
set a = b,
b = 3
select *
from test
update test
set b = 4,
a = b
select *
from test
The sample above yields:
(2, 3)
(3, 4)
Even though in the second update it seems to update "b" column before "a". Is it guaranteed that if I refer to some column I will get result unaffected by this UPDATE independent of the order of assignment in SET clause?
SQL tries to be a set-based language. For an update, this means that database systems try to act "as-if" all updates are applied in parallel, both to all columns within a row, and to all rows within the set. (Indeed, failures to implement this correctly can lead to the Halloween Problem).
Since all operations are "occurring in parallel", no assignment can see the result of any other assignment operation, and so they're all based on the original values of any columns.
The behavior that you observe is correct, it is standard, and it should even be intuitive.
The piece that your are missing are the ACID properties of databases and transactions. The update does not take effect until the statement is executed. Hence, any values that reference the row in the table come from the "before" view of the row. The values that are set are in the "after" view of the row.
I have the following table:
CREATE TABLE dbo.TestSort
(
Id int NOT NULL IDENTITY (1, 1),
Value int NOT NULL
)
The Value column could (and is expected to) contain duplicates.
Let's also assume there are already 1000 rows in the table.
I am trying to prove a point about unstable sorting.
Given this query that returns a 'page' of 10 results from the first 1000 inserted results:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value
My intuition tells me that two runs of this query could return different rows if the Value column contains repeated values.
I'm basing this on the facts that:
the sort is not stable
if new rows are inserted in the table between the two runs of the query, it could possibly create a re-balancing of B-trees (the Value column may be indexed or not)
EDIT: For completeness: I assume rows never change once inserted, and are never deleted.
In contrast, a query with stable sort (ordering also by Id) should always return the same results, since IDs are unique:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value, Id
The question is: Is my intuition correct? If yes, can you provide an actual example of operations that would produce different results (at least "on your machine")? You could modify the query, add indexes on the Values column etc.
I don't care about the exact query, but about the principle.
I am using MS SQL Server (2014), but am equally satisfied with answers for any SQL database.
If not, then why?
Your intuition is correct. In SQL, the sort for order by is not stable. So, if you have ties, they can be returned in any order. And, the order can change from one run to another.
The documentation sort of explains this:
Using OFFSET and FETCH as a paging solution requires running the query
one time for each "page" of data returned to the client application.
For example, to return the results of a query in 10-row increments,
you must execute the query one time to return rows 1 to 10 and then
run the query again to return rows 11 to 20 and so on. Each query is
independent and not related to each other in any way. This means that,
unlike using a cursor in which the query is executed once and state is
maintained on the server, the client application is responsible for
tracking state. To achieve stable results between query requests using
OFFSET and FETCH, the following conditions must be met:
The underlying data that is used by the query must not change. That is, either the rows touched by the query are not updated or all
requests for pages from the query are executed in a single transaction
using either snapshot or serializable transaction isolation. For more
information about these transaction isolation levels, see SET
TRANSACTION ISOLATION LEVEL (Transact-SQL).
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
Although this specifically refers to offset/fetch, it clearly applies to running the query multiple times without those clauses.
If you have ties when ordering the order by is not stable.
LiveDemo
CREATE TABLE #TestSort
(
Id INT NOT NULL IDENTITY (1, 1) PRIMARY KEY,
Value INT NOT NULL
) ;
DECLARE #c INT = 0;
WHILE #c < 100000
BEGIN
INSERT INTO #TestSort(Value)
VALUES ('2');
SET #c += 1;
END
Example:
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
DBCC DROPCLEANBUFFERS; -- run to clear cache
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
The point is I force query optimizer to use parallel plan so there is no guaranteed that it will read data sequentially like Clustered index probably will do when no parallelism is involved.
You cannot be sure how Query Optimizer will read data unless you explicitly force to sort result in specific way using ORDER BY Id, Value.
For more info read No Seatbelt - Expecting Order without ORDER BY.
I think this post will answer your question:
Is SQL order by clause guaranteed to be stable ( by Standards)
The result is everytime the same when you are in a single-threaded environment. Since multi-threading is used, you can't guarantee.
Imagine I have two tables, t1 and t2. t1 has two fields, one containing unique values called a and another field called value. Table t2 has a field that does not contain unique values called b and a field also called value.
Now, if I use the following update query (this is using MS Access btw):
UPDATE t1
INNER JOIN t2 ON t1.a=t2.b
SET t1.value=t2.value
If I have the following data
t1 t2
a | value b | value
------------ ------------
'm' | 0.0 'm'| 1.1
'm'| 0.2
and run the query what value ends up in t1.value? I ran some tests but couldn't find consistent behaviour, so I'm guessing it might just be undefined. Or this kind of update query is something that just shouldn't be done? There is a long boring story about why I've had to do it this way, but it's irrelevant to the technical nature of my enquiry.
This is known as a non deterministic query, it means exactly what you have found that you can run the query multiple times with no changes to the query or underlying data and get different results.
In practice what happens is the value will be updated with the last record encountered, so in your case it will be updated twice, but the first update will be overwritten by last. What you have absolutely no control over is in what order the SQL engine accesses the records, it will access them it whatever order it deems fit, this could be simply a clustered index scan from the begining, or it could use other indexes and access the clustered index in a different order. You have no way of knowing this. It is quite likely that running the update multiple times would yield the same result, because with no changes to the data the sql optimiser will use the same query plan. But again there is no guarantee, so you should not rely on a non determinstic query to get deterministic results.
EDIT
To update the value in T1 to the Maximum corresponding value in T2 you can use DMax:
UPDATE T1
SET Value = DMax("Value", "T2", "b=" & T1.a);
When you execute the query as you’ve indicated, the “value” that ends up in “t1” for the row ‘m’ will be, effectively, random, due to the fact that “t2” has multiple rows for the identity value ‘m’.
Unless you specifically specify that you want the maximum (max function), minimum (min function) or some-other aggregate of the collection of rows with the identity ‘m’ the database has no ability to make a defined choice and as such returns whatever value it first comes across, hence the inconsistent behaviour.
Hope this helps.
I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.
(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;
Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.
The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.