How can I determine the actual database row insertion order? - sql

I have a multithreaded process which inserts several records into a single table. The inserts are performed in a stored procedure, with the sequence being generated INTO a variable, and that variable is later used inside of an INSERT.
Given that I'm not doing mysequence.nextval inside the INSERT itself, it makes me think that it is possible for two concurrent processes to grab a sequence in one order, then do the inserts in the reverse order. If this is the case, then the sequence numbers will not reflect the true order of insertion.
I also record the sysdate in a DATE column for each of my inserts, but I've noticed that often times the dates for two records match and I need to sort by the sequence number to break the tie. But given the previous issue, this doesn't seem to guarantee the actual insert order.
How can I determine the absolute order of insertion into the database?

DATE datatypes only go to seconds, whereas TIMESTAMP goes to milliseconds. Would that address the problem?
According to Oracle's docs:
TIMESTAMP: Year, month, and day values
of date, as well as hour, minute, and
second values of time, where
fractional_seconds_precision is the
number of digits in the fractional
part of the SECOND datetime field.
Accepted values of
fractional_seconds_precision are 0 to
9. The default is 6. The default format is determined explicitly by the
NLS_DATE_FORMAT parameter or
implicitly by the NLS_TERRITORY
parameter. The sizes varies from 7 to
11 bytes, depending on the precision.
This datatype contains the datetime
fields YEAR, MONTH, DAY, HOUR, MINUTE,
and SECOND. It contains fractional
seconds but does not have a time zone.
Whereas date does not:
DATE: Valid date range from January 1,
4712 BC to December 31, 9999 AD. The
default format is determined
explicitly by the NLS_DATE_FORMAT
parameter or implicitly by the
NLS_TERRITORY parameter. The size is
fixed at 7 bytes. This datatype
contains the datetime fields YEAR,
MONTH, DAY, HOUR, MINUTE, and SECOND.
It does not have fractional seconds or
a time zone.
Of course, having said that, I am not sure why it matters when the records were written, but that is a way that might solve your problem.

Sequence should be thread safe:
create table ORDERTEST (
ORDERID number not null ,
COLA varchar2(10) ,
INSERTDATE date default sysdate,
constraint ORDERTEST_pk primary key (orderid)
) ;
create sequence ORDERTEST_seq start with 1 nocycle nocache ;
insert into ORDERTEST (ORDERID, COLA, INSERTDATE)
select ORDERTEST_SEQ.NEXTVAL , substr(OBJECT_NAME,1,10), sysdate
from USER_OBJECTS
where rownum <= 5; --just to limit results
select *
from ORDERTEST
order by ORDERID desc ;
ORDERID COLA INSERTDATE
---------------------- ---------- -------------------------
5 C_COBJ# 16-JUL-10 12.15.36
4 UNDO$ 16-JUL-10 12.15.36
3 CON$ 16-JUL-10 12.15.36
2 I_USER1 16-JUL-10 12.15.36
1 ICOL$ 16-JUL-10 12.15.36
now in a different session:
insert into ORDERTEST (ORDERID, COLA, INSERTDATE)
select ORDERTEST_SEQ.NEXTVAL , substr(OBJECT_NAME,1,10), sysdate
from USER_OBJECTS
where rownum <= 5; --just to limit results
select *
from ORDERTEST
order by ORDERID desc ;
5 rows inserted
ORDERID COLA INSERTDATE
---------------------- ---------- -------------------------
10 C_COBJ# 16-JUL-10 12.17.23
9 UNDO$ 16-JUL-10 12.17.23
8 CON$ 16-JUL-10 12.17.23
7 I_USER1 16-JUL-10 12.17.23
6 ICOL$ 16-JUL-10 12.17.23
The Oralce Sequence is thread safe:
http://download.oracle.com/docs/cd/B19306_01/server.102/b14231/views.htm#ADMIN020
"If two users are accessing the same sequence concurrently, then the sequence numbers each user receives might have gaps because sequence numbers are also being generated by the other user." the numbers may not be 1,2,3,4,5 (as in my example --> if you fear this you can up the cache)
this can also help, although they do not site their source:
http://forums.oracle.com/forums/thread.jspa?threadID=910428
"the sequence is incremented immediately and permanently, whether you commit or roll back the transaction. Concurrent access of NextVal on a sequence will always return separate values to each caller."
If your fear is the inserts will be out of order and you need the sequence value use the returning clause:
declare
x number ;
begin
insert into ORDERTEST (ORDERID, COLA, INSERTDATE)
values( ORDERTEST_SEQ.NEXTVAL , 'abcd', sysdate)
returning orderid into x;
dbms_output.put_line(x);
end;
--11
then you know it got inserted right then and there.

There are several effects going on. Todays computers can execute so many operations per second that the timers can't keep up. Also, getting the current time is a somewhat expensive operation so you have gaps that can last several milliseconds until the value changes. That's why you get the same sysdate for different rows.
Now to solve your insert problem. Calling nextval on a sequence is guaranteed to remove this value from the sequence. If two threads call nextval several times, you can get interleaved numbers (i.e. thread 1 will see 1 3 4 7 and thread 2 will see 2 5 6 8) but you can be sure that each thread will get different numbers.
So even if you don't use the result of nextval immediately, you should be safe. As for the "absolute" insert order in the database, this might be hard to tell. For example, a DB could keep the rows in a cache before writing them to disk. The rows could be reordered to optimize disk access. But as long as you assign the results from nextval to your rows in the order in which you insert them, this shouldn't matter and they should always appear to be inserted in order.

While there may be some concept of insertion order in to a database, there is certainly no concept of retrieval order. Any rows that come back from the database will come back in whatever order the DB sees fit to return them in, and this may or may not have ANYTHING to do with the order they were inserted in to the database. Also, the order that rows are inserted in to the DB may have little to nothing related to how they are physically stored on disk.
Relying upon any order from a DB query without the use of an ORDER BY clause is folly. If you wish to be certain of any order, you need to maintain that relationship at a formal level (sequences, timestamps, whatever) in your logic when creating the records for insertion.

If the transactions are separate, you can determine this from the ora_rowscn pseudo-column for the table.
[Edit]
Some more detail, and I'll delete my answer if this is not of use - unless you created the table with the non-default "rowdependencies" clause, you'll have other rows from the block tagged with the scn, so this may be misleading. If you really want this information without an application change you'll have to rebuild the table with this clause.

Given your description of the issue you're trying to resolve, I would think that the sequences would be fine. If you have two processes that call the the same stored procedure and the second (chronological) one finishes first for some reason is that actually relevant? I would think the order in which the procedures were called (which will be reflected by the sequence (unless you're using RAC)) would be more meaningful than the order in which they were written to the database.
If you're really worried about the sequence the rows were inserted in, then you need to look at when the commits were issued, not when the inserts statements were issued. Otherwise you have the following scenario as a possibility:
Transaction 1 is started
Transaction 2 is started
Transaction 3 is started
Transaction 2 inserts
Transaction 1 inserts
Transaction 3 inserts
Transaction 3 commits
Transaction 1 commits
Transaction 2 commits
In this case Transaction 1 was started first, Transaction 2 inserted first, and Transaction 3 committed first. The sequence number gives you a good idea of the order in which the transactions were started. A timestamp field will give you an idea of when the inserts were issued. The only reliable way to get an order for the commits is to serialize writes to the table, which is generally a bad idea (it removes scalability).

You should (a) add the timestamp to each record, and (b) move the sequence NEXTVAL to the INSERT statement.
That way, when you query the table, you can ORDER BY timestamp, id, which will effectively be the order in which the rows were actually inserted.

Related

How to create a queue like structure in SQL Server

Is there a good way to create a queue like structure in SQL Server?
Requirements:
When I insert rows, I want them to default to the bottom of the queue
When I select rows, I want to easily be able to get the top of the queue
Here's the tough one: I want to be able to easily move something up the queue, and reorient the rest. Example: move item 5 up to number 1, then 1-4 becomes 2-5
A simple identity column would work for requirements 1 and 2, but how would I handle 3?
Solution
I ended up implementing the solution from #roger-wolf
One difference, I used a trigger rather than a stored procedure to renumber. Here's my trigger code:
CREATE TRIGGER [dbo].[TR_Queue]
ON [dbo].[Queue]
AFTER INSERT, DELETE, UPDATE
AS
BEGIN
SET NOCOUNT ON;
-- Get the current max value in priority
DECLARE #maxPriority INT = COALESCE((SELECT MAX([priority]) FROM [dbo].[Queue]), 0);
WITH newValues AS (
-- Renumber by priority, starting at 1
SELECT [queueID]
,ROW_NUMBER() OVER(ORDER BY [priority] ASC) AS [priority]
FROM (
-- Pretend all nulls are greater than previous max priority
SELECT [queueID]
,COALESCE([priority], #maxPriority+1) AS [priority]
FROM [dbo].[Queue]
) AS tbl
)
UPDATE q
SET q.[priority] = newValues.[priority]
FROM [dbo].[Queue] AS qroger-wolf
INNER JOIN newValues
ON q.[queueID] = newValues.[queueID]
END
This works well for me as the queue is always relatively small and infrequently updated, so I don't have to work about performance of the trigger.
Use a float column for prioritisation and an approach similar to Celko trees:
If you have items with priorities 1, 2, and 3 and the last needs to become second, calculate an average between its new neighbours, 1.5 in this example;
If another one needs to become second, its priority would be 1.25. This can go on for quite a while;
When displaying queued items by their priority, use row_number() instead of float values in UI;
If items become too close together (say, 1e-10 or less), have a stored procedure ready to renumber them as integers.
The only deficiency I see here is that it becomes a bit more difficult to find N-th item in a middle of a queue, when it's neither first nor last. If you don't need that, the approach should work.
You could add a Priority column of type DateTime, and when you set a row as a priority row you set the current date-time in the Priority column and then use that as part of your order by criteria?
I had a similar requirement in a past project, what I did (and it worked):
Add column update_at_utc of type datetime2
When inserting, set update_at_utc = GETDATEUTC()
When retrieving, order by update_at_utc
When moving a row in the queue, for example between rows 3 and 4, simply take average of update_at_utc of these rows and use it to set update_at_utc of the row being moved.
Note 1: Point 4 assumes that the frequency of inserts and of moving the rows up/down the queue is such that datetime2 type has sufficient resolution. For example, if you insert 2 rows 1 millisecond apart, and then try to move 1000 rows between these 2 rows, then datetime2 resolution will be insufficient (https://learn.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-2017). In such case, the moving of rows up/down the queue would need to be more complicated; When moving a row N places lower down:
Remember update_at_utc of the row N places lower down
For all rows between the current and the new position: assign row's update_at_utc to the preceding row's update_at_utc
Assign update_at_utc of the row being moved to the date remembered in point 1 above.
Note 2: I suggest UTC dates instead of local dates to avoid issues during a daylight saving switch.

PostgreSQL FOR UPDATE with ORDER BY and LIMIT interation

According to https://www.postgresql.org/docs/9.5/static/sql-select.html
It is possible for a SELECT command running at the READ COMMITTED
transaction isolation level and using ORDER BY and a locking clause to
return rows out of order. This is because ORDER BY is applied first.
The command sorts the result, but might then block trying to obtain a
lock on one or more of the rows. Once the SELECT unblocks, some of the
ordering column values might have been modified, leading to those rows
appearing to be out of order (though they are in order in terms of the
original column values). This can be worked around at need by placing
the FOR UPDATE/SHARE clause in a sub-query, for example
SELECT * FROM (SELECT * FROM mytable FOR UPDATE) ss ORDER BY column1;
Note that this will result in locking all rows of mytable, whereas FOR
UPDATE at the top level would lock only the actually returned rows.
This can make for a significant performance difference, particularly
if the ORDER BY is combined with LIMIT or other restrictions. So this
technique is recommended only if concurrent updates of the ordering
columns are expected and a strictly sorted result is required.
At the REPEATABLE READ or SERIALIZABLE transaction isolation level
this would cause a serialization failure (with a SQLSTATE of '40001'),
so there is no possibility of receiving rows out of order under these
isolation levels.
Let's say I run this query,
SELECT * FROM banks as x WHERE x.date < {inputDate} ORDER BY x.date DESC LIMIT 1 FOR UPDATE;
Assume row date can not be modified
Possible to INSERT / DELETE row for banks
Will I be safe to get the correct result every time regardless of race condition?
Will the single row that I retrieved be LOCK FOR UPDATE? (I read many sources regarding interaction of LIMIT and FOR UPDATE, and still is quite confused, can someone confirm what is really going to happen for this case?)
The article mentioned incorrect order, but I can really apply to my use case whether I will get the correct result when add / delete is going on.
Yes, your query can return a wrong result.
See this example:
CREATE TABLE banks (
id integer PRIMARY KEY,
date timestamp with time zone NOT NULL
);
INSERT INTO banks VALUES (1, '2017-09-01 00:00:00');
INSERT INTO banks VALUES (2, '2017-09-01 01:00:00');
Now, in one session we start a transaction that updates a row:
BEGIN;
UPDATE banks
SET date = '2017-08-01 00:00:00'
WHERE id = 2;
Then, in another session, we run your statement:
SELECT *
FROM banks AS x
WHERE x.date < current_timestamp
ORDER BY x.date DESC
LIMIT 1
FOR UPDATE;
This session will hang as it tries to lock the row with id = 2.
Now in the first session, close the transaction:
COMMIT;
Then the second session will unblock and return a wrong result:
┌────┬────────────────────────┐
│ id │ date │
├────┼────────────────────────┤
│ 2 │ 2017-08-01 00:00:00+02 │
└────┴────────────────────────┘
(1 row)
This is because the query returns the row identified before locking it, but the values displayed are the current ones from after the lock has succeeded.

Can SQL return different results for two runs of the same query using ORDER BY?

I have the following table:
CREATE TABLE dbo.TestSort
(
Id int NOT NULL IDENTITY (1, 1),
Value int NOT NULL
)
The Value column could (and is expected to) contain duplicates.
Let's also assume there are already 1000 rows in the table.
I am trying to prove a point about unstable sorting.
Given this query that returns a 'page' of 10 results from the first 1000 inserted results:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value
My intuition tells me that two runs of this query could return different rows if the Value column contains repeated values.
I'm basing this on the facts that:
the sort is not stable
if new rows are inserted in the table between the two runs of the query, it could possibly create a re-balancing of B-trees (the Value column may be indexed or not)
EDIT: For completeness: I assume rows never change once inserted, and are never deleted.
In contrast, a query with stable sort (ordering also by Id) should always return the same results, since IDs are unique:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value, Id
The question is: Is my intuition correct? If yes, can you provide an actual example of operations that would produce different results (at least "on your machine")? You could modify the query, add indexes on the Values column etc.
I don't care about the exact query, but about the principle.
I am using MS SQL Server (2014), but am equally satisfied with answers for any SQL database.
If not, then why?
Your intuition is correct. In SQL, the sort for order by is not stable. So, if you have ties, they can be returned in any order. And, the order can change from one run to another.
The documentation sort of explains this:
Using OFFSET and FETCH as a paging solution requires running the query
one time for each "page" of data returned to the client application.
For example, to return the results of a query in 10-row increments,
you must execute the query one time to return rows 1 to 10 and then
run the query again to return rows 11 to 20 and so on. Each query is
independent and not related to each other in any way. This means that,
unlike using a cursor in which the query is executed once and state is
maintained on the server, the client application is responsible for
tracking state. To achieve stable results between query requests using
OFFSET and FETCH, the following conditions must be met:
The underlying data that is used by the query must not change. That is, either the rows touched by the query are not updated or all
requests for pages from the query are executed in a single transaction
using either snapshot or serializable transaction isolation. For more
information about these transaction isolation levels, see SET
TRANSACTION ISOLATION LEVEL (Transact-SQL).
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
Although this specifically refers to offset/fetch, it clearly applies to running the query multiple times without those clauses.
If you have ties when ordering the order by is not stable.
LiveDemo
CREATE TABLE #TestSort
(
Id INT NOT NULL IDENTITY (1, 1) PRIMARY KEY,
Value INT NOT NULL
) ;
DECLARE #c INT = 0;
WHILE #c < 100000
BEGIN
INSERT INTO #TestSort(Value)
VALUES ('2');
SET #c += 1;
END
Example:
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
DBCC DROPCLEANBUFFERS; -- run to clear cache
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
The point is I force query optimizer to use parallel plan so there is no guaranteed that it will read data sequentially like Clustered index probably will do when no parallelism is involved.
You cannot be sure how Query Optimizer will read data unless you explicitly force to sort result in specific way using ORDER BY Id, Value.
For more info read No Seatbelt - Expecting Order without ORDER BY.
I think this post will answer your question:
Is SQL order by clause guaranteed to be stable ( by Standards)
The result is everytime the same when you are in a single-threaded environment. Since multi-threading is used, you can't guarantee.

oracle sql query requirements

I have some data in oracle table abot 10,000 rows i want to genrate column which return 1 for ist row and 2 for second and so on 1 for 3rd and 2 for 4th and 1 for 5th and 2 for 6th and so on..Is there any way that i can do it using sql query or any script which can update my column like this.that it will generate 1,2 as i mentioned above i have thought much but i didn't got to do this using sql or any other scencrio for my requirements.plz help if any possibility for doing this with my table data
You can use the combination of the ROWNUM and MOD functions.
Your query would look something like this:
SELECT ROWNUM, 2 - MOD(ROWNUM, 2) FROM ...
The MOD function will return 0 for even rows and 1 for odd rows.
select mod(rownum,5)+1,fld1, fld2, fld3 from mytable;
Edit:
I did not misunderstand requirements, I worked around them. Adding a column and then updating a table that way is a bad design idea. Tables are seldom completely static, even rule and validation tables. The only time this might make any sense is if the table is locked against delete, insert, and update. Any change to any existing row can alter the logical order. Which was never specified. Delete means the entire sequence has to be rewritten. Update and insert can have the same effect.
And if you wanted to do this you can use a sequence to insert a bogus counter. A sequence that cycles over and over, assuming you know the order and can control inserts and updates in terms of that order.

SQL Query to Extract Timestamp Difference

I have been delegated a task on a dataset that has been pre-extracted from another data source(s) and I currently only have Access available to query this data (Excel for basic data analysis as less than row limit at the moment). Essentially, I have three relevant fields:
FK_ID = arbitrary number associated with a transaction
CD = code associated with status of transaction (assume only BEGIN and END are the values)
TIMESTAMP = timestamp of transaction
Now a simplified example of this data set:
FK_ID CD TIMESTAMP
000012 END 2012-01-02-14.27.59.133612
000012 BEGIN 2012-01-02-14.27.57.176631
000015 END 2011-12-12-14.27.59.133612
000015 BEGIN 2011-12-11-14.27.59.133612
000019 END 2011-11-10-14.27.59.133612
000019 BEGIN 2011-11-09-14.27.59.133612
000019 END 2011-11-08-14.27.59.133612
000019 BEGIN 2011-11-07-14.27.59.133612
As you can see, it's not very complicated, the problem is I need to calculate the timestamp difference between the BEGIN and END codes for each unique FK_ID and then create a column to tally that difference, also accounting for the fact that some FK_IDs have multiple timestamps BEGIN/END pairs associated with them.
Now I have been authorized to ignore cases where more than a pair exists (by ignore, I mean only count that initial pair), but it is not preferable.
I need to get these differences though to determine a total average time to determine if that time is within our goals approximately.
What's the best query to go about getting this timestamp difference for each FK_ID pair or other automated means you'd suggest?
I do understand SQL and am proficient enough in C#, but the time frame and other factors are wreaking havoc on my ability to break down this problem logically.
Assuming the table name is TABLE1, in Access I would do something like:
SELECT Table1.FK_ID, DateDiff("s",[TABLE1].[TIMESTAMP],[END_QUERY].[TIMESTAMP]) AS DifferenceInSeconds
FROM Table1
INNER JOIN
(SELECT Table1.FK_ID, Table1.CD, Table1.TIMESTAMP
FROM Table1
WHERE (((Table1.CD)="END"))
ORDER BY Table1.FK_ID, Table1.CD) AS END_QUERY
ON Table1.FK_ID = END_QUERY.FK_ID
WHERE (((Table1.CD)="BEGIN"))
ORDER BY Table1.FK_ID, Table1.CD;
Basically, get all the BEGIN and END on two subqueries and get the difference between the queries (in seconds -- you didn't mention this part). The one issue you'll encounter is one a trasaction has multiple entries. You could do a GROUP BY to get the very first BEGIN and the very last END, but they may be some discrepancies.
I hope this helps you a little.