I have a table in Oracle.
I am creating multiple Batch jobs.
Each Batch job is inserting some number of records in the table.
I wanted to know whether there insert statements are sequential executed ?
In particular job, if it's runs in one transaction answer is 'YES', but if you want to find out sequence through all jobs , which starts same time, it all depends of particular situation and you jobs implementation. For example if we have 2 jobs and one them starts earlier then other, but first one needs more time to collect data then we can't say that first job inserts will be done earlier then second. There are many factors which affect to order of records insert. So if it's critical for you to control order then you should implement consistency yourself, using timestamp checks or object locks.
Related
So, my data model is similar to an assignment problem. So, let's assume we have a firm that provides suitable workers for requested jobs.
For now, I have such relations:
Customer (id)
Job (id)
Worker (id, available)
Jobs In Progress (customer_id, job_id)
Busy Workers (customer_id, worker_id)
There are many-to-many between Customer and Job and many-to-many between Customer and Worker. This data is like real-time, so it's highly dynamic.
We want to maintain such queries:
Request a worker for a job.
Return the worker when the job has finished.
This queries requires to read, update, delete and insert data in several tables.
For example, if customer requests for a worker, we have to check whether this customer already exists in the table; whether he already owns a suitable worker in Busy Workers; if no, find a suitable available worker in Worker; check whether such a job already registered in Job. And in the worst case, we have to atomically insert customer in Customer, insert job in Job, insert a corresponding row in Jobs In Progress, decrement Worker.avaiable and insert a row in Busy Workers.
In the second query, we have to do all of this stuff in a reversed order: increment Worker.available, delete the customer if he has no jobs, delete the job if no one customer needs it and so on.
So we have a lot of consistency rules: number of busy workers have to be consistent with Worker.available, a customer has to be present in the table only if he has requested not finished jobs, a job has to be present in the table only if there are no customers with such a job requested.
I read a lot about isolations levels and locking in databases, but I still don't understand how to ensure consistency across multiple tables. It seems like isolation levels don't work because multiple tables are involved and data may become inconsistent between select from two tables. And it seems like locks don't work too, because AFAIK SQL Server can't atomically acquire a lock on multiple tables and therefore data may become inconsistent between locks.
And, actually, I'm looking for a solution or idea of a solution in general, without referencing to a concrete RDBMS, it should be something that applicable one way or another to the most famous RDBMS's like MySQL, PostgreSQL, SQL Server, and Oracle. So it does not have to be a proper solution with examples with all of this RDMS's, maybe some practices, tips or
references.
I apologize for my English and thank you in advance.
First: Think about your model a bit more. I would not keep so much redundant information. "decrement Worker.avaiable and insert a row in Busy Workers" are completely superfluous because you can get the information easily by asking the other tables. You might say that is more costly to query. That I would call premature optimization. Redundancy is very costly per se.
Second: Think of locks as exclusive resources that only one may get. So the most simple way to ensure consistency would be to let all dbms-users lock just one record in the database using select ... for update. All changes would be serialized. If you use a MVCC-Dbms like postgres, oracle or even sql-server, the readers would always see a consistent situation.
Third: Doing your change perhaps you just need to detect, if another user/transaction already changed a certain record. This can be done by maintaining, so called, version-attributes and checking during updates, if those attributes where changed. If a change was detected, you have to repeat the complete transaction. That is called optimistic locking.
Fourth or better the most important point: I hope you understood the concept of dbms-transaction as a means to bring a dbms from one consistent state to the other.
I am implementing a CQRS pattern where one or more processes are inserting records into the database and one or more processes are pulling them at a difference pace.
I'd like consumer processes to poll the database for new records that were inserted since last check, but I'm not sure how to (safely) implement this.
You can assume that rows will not change once they are inserted. It seems it isn't enough for each row to have a unique id, and a timestamp indicating when it was inserted.
If I query for records with a timestamp greater than the last row I saw then I run into problems if multiple records were inserted at the same time (having the same timestamp).
If I query for records with an id greater than the last row I saw then I run into problems where concurrent transactions may commit IDs in non-increasing order (e.g. postgreSQL sessions allocate and cache sequence IDs ahead of time to improve performance).
Ideally, I am looking for a DBMS-agnostic solution and be able to consume data as close to real-time as possible. Any ideas?
Clarification: Each row should be consumed multiple times, once per consumer. Meaning, just because one consumer processes a row should not prevent other consumers from doing so. Each consumer will do something different with the same data.
Since you have a lot of data coming in and might have multiple records for the last time stamp, you need a way to keep track of the data read. Here are a few different approaches with their pro and cons:
You can wait for the data to come in for a time stamp. You would do this by not reading the MAX(timestamp) so you would get all the data from the table except the last one for which the data might still be coming in.
Pro: Simple design
Con: Not real time processing
You can store the id's you have read each time for the last time stamp. When getting the data, you can use a query like (timestamp = lasttimestamp and id not in (set of ids)) or timestamp > lasttimestamp)
Pro: Almost real time
Con: Additional storage required
If you don't use sharding or similar:
You can use optimistic locking.
For this you can create an order column, with an unique index on the records table (the Log). Before each insertion, the producer query the Log for the greatest order, it increments it and insert the next record with this order.
If a concurrency exception occurs (i.e. Duplicate entry '12345' for key order) then you retry the entire process (query, increment, insert).
If you use sharding or similar:
Then you will need an additional service/table that will generate a new, unique, always-increasing order integer every time it is asked to do so.
This has the disadvantage that there is another piece that must be managed, a single point of failure that must be highly-available.
P.S.
"sharding or similar" means that you can't have unique indexes on the entire table because you use sharding or you write to multiple tables.
you can't rely on the timestamps or anything that relates to physical time because the system time may be adjusted, by an automated service (NTP) or by an human operator.
I'm trying to better understand a nuance of SQL Server transactions.
Say I have a query that updates 1,000 existing rows, updating one of the columns to have the values 1 through 1,000. It's possible to execute this query and, when completed, those rows would not be numbered sequentially. This is because it's possible for another query to modify one of those rows before my query finishes.
On the other hand, if I wrap those updates in a transaction, that guarantees that if any one update fails, I can fail all updates. But does it also mean that those rows would be guaranteed to be sequential when I'm done?
In other words, are transactions always atomic?
But does it also mean that those rows would be guaranteed to be sequential when I'm done?
No. This has nothing to do with transactions, because what you're asking for simply doesn't exists: relational tables have no order an asking for 'sequential rows' is the wrong question to ask. You can rephrase the question as 'will the 1000 updated rows contain the entire sequence from 1 to 1000, w/o gaps' ? Most likely yes, but the truth of the matter is that there could be gaps depending on the way you do the updates. Those gaps would not appear because updated rows are modified after the update before commit, but because the update will be a no-op (will not update any row) which is a common problem of read-modify-write back type of updates ( the row 'vanishes' between the read and the write-back due to concurrent operations).
To answer your question more precisely whether your code is correct or not you have to post the exact code you're doing the update with, as well as the exact table structure, including all indexes.
Atomic means the operation(s) within the transaction with either occur, or they don't.
If one of the 1,000 statements fails, none of the operations within the transaction will commit. The smaller the sample of statements within a transaction -- say 100 -- means that the blocks of 100 leading up to the error (say at the 501st) can be committed (the first 400; the 500 block won't, and the 600+ blocks will).
But does it also mean that those rows would be guaranteed to be sequential when I'm done?
You'll have to provide more context about what you're doing in a transaction to be "sequential".
The 2 points are unrelated
Sequential
If you insert values 1 to 1000, it will be sequential with an WHERE and ORDER BY to limit you to these 1000 rows in some column. Unless there are duplicates, so you'd need a unique constraint
If you rely on an IDENTITY, it isn't guaranteed: Do Inserted Records Always Receive Contiguous Identity Values.
Atomicity
All transactions are atomic:
Is neccessary to encapsulate a single merge statement (with insert, delete and update) in a transaction?
SQL Server and connection loss in the middle of a transaction
Does it delete partially if execute a delete statement without transaction?
SQL transactions, like transactions on all database platforms, put the data in isolation to cover the entire ACID acronym (atomic, consistent, isolated and durable). So the answer is yes.
A transaction guarantees atomicity. That is the point.
You problem is that after you do the insert, they are only "Sequential" until the next thing comes along and touches one of the new records.
If another step in you process requires them to still be sequential then that step, too, needs to be within your original transaction.
I have a list of 100 entries that I want to process with multiple threads. Each thread will take up to 20 entries to process.
I'm currently using global temp tables to store the entries that meet certain criteria -- I also do not want threads to overlap entries to process.
How do I do this (preventing the overlap)?
Thanks!
If on 11g, I'd use the SELECT ... FOR UPDATE SKIP LOCKED.
If on a previous version, I'd use Advanced Queuing to populate a queue with the primary key values of the entries to be processed, and have your threads dequeue those keys to process those records. Because the dequeue can (but doesn't have to be, if memory serves) within the processing transactional scope, the dequeue commits or rolls back with the processing, and no two threads can get the same records to process.
There are two issues here, so let's handle them separately:
How do you split some work among several threads/sessions?
You could use Advanced Queuing or the SKIP LOCKED feature as suggested by Adam.
You could also use a column that contains processing information, for example a STATE column that is empty when not processed. Each thread would start work on a row with:
UPDATE your_table
SET state='P'
WHERE STATE IS NULL
AND rownum = 1
RETURNING id INTO :id;
At this point the thread would commit to prevent other thread being locked. Then you would do your processing and select another row when you're done.
Alternatively, you could also split the work beforehand and assign each process with a range of ids that need to be processed.
How will temporary tables behave with multiple threads?
Most likely each thread will have its own Oracle session (else you couldn't run queries in parallel). This means that each thread will have its own virtual copy of the temporary table. If you stored data in this table beforehand, the threads will not be able to see it (the temp table will always be empty at the beginning of a session).
You will need regular tables if you want to store data accessible to multiple sessions. Temporary tables are fine for storing data that is private to a single session, for example intermediate data in a complex process.
Easiest will be to use DBMS_SCHEDULER to schedule a job for every row that you want to process. You have to pass a key to a permanent table to identifiy the row that you want to process, or put the full row in the arguments for the job, since a temporary table's content is not visible in different sessions. The number of concurrent jobs are controlled by resource manager, mostly limited by the number of cpus.
Why would you want to process row by row anyway? Set operations are in most occasions a lot faster.
I am working on a SQL Job which involves 5 procs, a few while loops and a lot of Inserts and Updates.
This job processes around 75000 records.
Now, the job works fine for 10000/20000 records with speed of around 500/min. After around 20000 records, execution just dies. It loads around 3000 records every 30 mins and stays at same speed.
I was suspecting network, but don't know for sure. These kind of queries are difficult to analyze through SQL Performance Monitor. Not very sure where to start.
Also, there is a single cursor in one of the procs, which executes for very few records.
Any suggestions on how to speed this process up on the full-size data set?
I would check if your updates are within a transaction. If they are, it could explain why it dies after a certain amount of "modified" data. You might check how large your "tempdb" gets as an indicator.
Also I have seen cases when during long-running transactions the database would die when there are other "usages" at the same time, again because of transactionality and improper isolation levels used.
If you can split your job into independent non-overlaping chunks, you might want to do it: like doing the job in chunks by dates, ID ranges of "root" objects etc.
I suspect your whole process is flawed. I import a datafile that contains 20,000,000 records and hits many more tables and does some very complex processing in less time than you are describing for 75000 records. Remember looping is every bit as bad as using cursors.
I think if you set this up as an SSIS package you might be surprised to find the whole thing can run in just a few minutes.
With your current set-up consider if you are running out of room in the temp database or maybe it is trying to grow and can't grow fast enough. Also consider if at the time the slowdown starts, is there some other job running that might be causing blocking? Also get rid of the loops and process things in a set-based manner.
Okay...so here's what I am doing in steps:
Loading a file in a TEMP table, just an intermediary.
Do some validations on all records using SET-Based transactions.
Actual Processing Starts NOW.
TRANSACTION BEGIN HERE......
LOOP STARTS HERE
a. Pick Records based in TEMP tables PK (say customer A).
b. Retrieve data from existing tables (e.g. employer information)
c. Validate information received/retrieved.
d. Check if record already exists - UPDATE. else INSERT. (THIS HAPPENS IN SEPARATE PROCEDURE)
e. Find ALL Customer A family members (PROCESS ALL IN ANOTHER **LOOP** - SEPARATE PROC)
f. Update status for CUstomer A and his family members.
LOOP ENDS HERE
TRANSACTION ENDS HERE