Can this SQL operation be done without doing it row by row (RBAR)? - sql

I have a set of tasks, with some tasks being more important than others
Each task does some work on one or more databases.
These tasks are assigned to workers that will perform the task (threads in an application poll a table).
When the worker has done the task, it sets the value back to null to signal it can accept work again.
When assigning the tasks to the workers, I would like to impose an upper limit on the number of database connections that can be used at any one time - so a task that uses a database that is currently at it's limit will not be assigned to a worker.
I can get the number of database connections available by subtracting the databases of tasks that are currently assigned to workers from the database limits.
My problem is this, how do I select tasks that can run, in order of importance, based on the number of database connections available, without doing it row by row?
I'm hoping the example below illustrates my problem:
On the right is available database connections, decreasing as we go down the list of tasks in order of importance.
If I'm selecting them in order of the importance of a task, then the connections available to the next task depend on whether the previous one was selected, which depends on if there was space for all it's database connections.
In the case above, task 7 can run only because task 6 couldn't
Also task 8 can't run because task 5 took the last connection to database C as it's a more important task.
Question:
Is there a way to work this out without using while loops and doing it row by row?

Related

SQL code to avoid the unnecessary start of a calculation if it was started my some other process?

I do have the stored procedure to calculate some facts (say usp_calculate). It fills the cache-like table. The part of the table (determined by the arguments of the procedure) table must be recalculated every 20 minutes. Basically, the usp_calculate returns early if cached data is fresh, or it spends say a minute to calculate... and returns after that.
The usp_calculate is shared by more outer procedures that needs the data. How should I prevent starting the time-consuming part of the procedure if it was already started by some other process? How can I implement a kind of signaling and waiting for the result instead of starting the calculation again?
Context: I do have an SQL stored procedure named say usp_products. It finally performs a SELECT that returns the rows for a product code, a product name, and calculated information -- special price for a customer, and for the storage location. There is a lot of combinations (customers, price lists, other conditions) that prevent to precalculate the information by a separate process. It must be calculated on-demand, for the specific combination.
The third party database that is the source of the information is not designed for detecting changes. Anyway, the time condition (not older than 20 minutes) is considered a good enough to consider the data "fresh".
The building block for this is probably going to be application locks.
You can obtain an exclusive application lock using sp_getapplock. You can then determine if the data is "fresh enough" either using freshness information inherently contained in that data or a separate table that you use for tracking this.
At this point, if necessary, you refresh the data and update the freshness information.
Finally, you release the lock using sp_releaseapplock and let all of the other callers have their chance to acquire the lock and discover that the data is fresh.

Rails/SQL How do I get different workers to work on different records?

I have (for argument sake) 1000 records and 10 Heroku workers running.
I want to have each worker work on a different set of records..
What I've got right now is quite good, but not quite complete.
sql = 'update products set status = 2 where id in
(select id from products where status = 1 limit (100) ) return *'
records = connection.execute(sql)
This works rather well.. I get get 100 records and at the same time, I make sure my other workers don't get the same 100..
If I throw it in a while loop then even if I have 20000 records and 2 workers, eventually they will all get processed.
My issue is if there's a crash or exception then the 100 records look like their being processed by another worker but they aren't.
I can't use transaction, because the other selects will pick up the same records.
My question
What strategies do others use to have many workers working on the same dataset, but different records.
I know this is a conversational question... I'd put it as community wiki, but I don't see that ability any more.
Building a task queue in a RDBMS is annoyingly hard. I recommend using a queueing system that's designed for the job instead.
Check out PGQ, Celery, etc.
I have used queue_classic by Heroku to schedule jobs stored in a Postgres database.
If I were to do this it would be something other than a db-side queue. It sounds like standard client processing but you really want is parallel processing of the result set.
The simplest solution might be to do what you are doing but lock them on the client side, and divide them between workers there (spinlocks etc). You can then commit the transaction and re-run after these have finished processing.
The difficulty is that if you have records you are processing for things that are supposed to happen outside the server, and there is a crash, you never really know what records were processed. It is safer to rollback probably, but just keep that in mind.

Efficiently detecting concurrent insertions using standard SQL

The Requirements
I have a following table (pseudo DDL):
CREATE TABLE MESSAGE (
MESSAGE_GUID GUID PRIMARY KEY,
INSERT_TIME DATETIME
)
CREATE INDEX MESSAGE_IE1 ON MESSAGE (INSERT_TIME);
Several clients concurrently insert rows in that table, possibly many times per second. I need to design a "Monitor" application that will:
Initially, fetch all the rows currently in the table.
After that, periodically check if there are any new rows inserted and then fetch
these rows only.
There may be multiple Monitors concurrently running. All the Monitors need to see all the rows (i.e. when a row is inserted, it must be "detected" by all the currently running Monitors).
This application will be developed for Oracle initially, but we need to keep it portable to every major RDBMS and would like to avoid as much database-specific stuff as possible.
The Problem
The naive solution would be to simply find the maximal INSERT_TIME in rows selected in step 1 and then...
SELECT * FROM MESSAGE WHERE INSERT_TIME >= :max_insert_time_from_previous_select
...in step 2.
However, I'm worried this might lead to race conditions. Consider the following scenario:
Transaction A inserts a new row but does not yet commit.
Transaction B inserts a new row and commits.
The Monitor selects rows and sees that the maximal INSERT_TIME
is the one inserted by B.
Transaction A commits. At this point, A's INSERT_TIME is actually
earlier than the B's (A's INSERT was actually executed before
B's, before we even knew who is going to commit first).
The Monitor selects rows newer than B's INSERT_TIME (as a consequence of step 3). Since A's INSERT_TIME is earlier than B's insert time, A's row is skipped.
So, the row inserted by transaction A is never fetched.
Any ideas how to design the client SQL or even change the database schema (as long as it is mildly portable), so these kinds of concurrency problems are avoided, while still keeping a decent performance?
Thanks.
Without using any of the platform-specific change data capture (CDC) technologies, there are a couple of approaches.
Option 1
Each Monitor registers a sort of subscription to the MESSAGE table. The code that writes messages then writes each MESSAGE once per Monitor, i.e.
CREATE TABLE message_subscription (
message_subscription_id NUMBER PRIMARY KEY,
message_id RAW(32) NOT NULLL,
monitor_id NUMBER NOT NULL,
CONSTRAINT uk_message_sub UNIQUE (message_id, monitor_id)
);
INSERT INTO message_subscription
SELECT message_subscription_seq.nextval,
sys_guid,
monitor_id
FROM monitor_subscribers;
Each Monitor then deletes the message from its subscription once that is processed.
Option 2
Each Monitor maintains a cache of the recent messages it has processed that is at least as long as the longest-running transaction could be. If the Monitor maintained a cache of the messages it has processed for the last 5 minutes, for example, it would query your MESSAGE table for all messages later than its LAST_MONITOR_TIME. The Monitor would then be responsible for noting that some of the rows it had selected had already been processed. The Monitor would only process MESSAGE_ID values that were not in its cache.
Option 3
Just like Option 1, you set up subscriptions for each Monitor but you use some queuing technology to deliver the messages to the Monitor. This is less portable than the other two options but most databases can deliver messages to applications via queues of some sort (i.e. JMS queues if your Monitor is a Java application). This saves you from reinventing the wheel by building your own queue table and gives you a standard interface in the application tier to code against.
You need to be able to identify all rows added since the last time you checked (i.e. the monitor checks). You have a "time of insert" column. However, as you spell it out, that time of insert column cannot be used with "greater than [last check]" logic to reliably identify subsequently inserted new items. Commits do not occur in the same order as (initial) inserts. I am not aware of anything that works on all major RDBMSs that would clearly and safely apply such an "as of" tag at the actual time of commit. [This is not to say I would know it if such a thing existed, but it seems a pretty safe guess to me.] Thus, you will have to use something other than a "greater than last check" algorithm.
That leads to filtering. Upon insert, an item (row) is flagged as "not yet checked"; when a montior logs in, it reads all not yet checked items, returns that set, and flips the flag to "checked" (and if there are multiple monitors, this must all be done within its own transaction). The monitors' queries will have to read all the data and pick out which have not yet been checked. The implication is, however, that this will be a fairly small set of data, at least relative to the entire set of data. From here, I see two likely options:
Add a column, perhaps "Checked". Store a binary 1/0 value for is/isnot checked. The cardinality of this value will be extreme -- 99.9s Checked, 00,0s Unchecked, so it should be rather efficient. (Some RDBMSs provide filtered queries, such that the Checked rows won't even be in the index; once flipped to checked, a row will presumably never be flipped back, so the overhead to support this shouldn't be too great.)
Add a separate table identify those rows in the "primary" table that have not yet been checked. When a montior logs in, it reads and deletes the items from that table. This doesn't seem efficient... but again, if the data set involved is small, the overall performance pain might be acceptable.
You should use Oracle AQ with a multi-subscriber queue.
This is Oracle specific, but you can create an abstraction layer of stored procedures (or abstract in Java if you like) so that you have a common API to enqueue the new messages and have each subscriber (monitor) dequeue any pending messages. Behind that API, for Oracle you use AQ.
I am not sure if there is a queuing solution for other databases.
I don't think you will be able to come up with a totally database agnostic approach that meets your requirements. You could extend the example above that included the 'checked' column, to have a second table called monitor_checked - that would contain one row per message per monitor. That is basically what AQ does behind the scenes, so it is sort of reinventing the wheel.
With PostgreSQL, use PgQ. It has all those little details worked out for you.
I doubt you will find a robust and manageable database-agnostic solution for this.

Global Temporary Tables - locking rows + Concurrency question

I have a list of 100 entries that I want to process with multiple threads. Each thread will take up to 20 entries to process.
I'm currently using global temp tables to store the entries that meet certain criteria -- I also do not want threads to overlap entries to process.
How do I do this (preventing the overlap)?
Thanks!
If on 11g, I'd use the SELECT ... FOR UPDATE SKIP LOCKED.
If on a previous version, I'd use Advanced Queuing to populate a queue with the primary key values of the entries to be processed, and have your threads dequeue those keys to process those records. Because the dequeue can (but doesn't have to be, if memory serves) within the processing transactional scope, the dequeue commits or rolls back with the processing, and no two threads can get the same records to process.
There are two issues here, so let's handle them separately:
How do you split some work among several threads/sessions?
You could use Advanced Queuing or the SKIP LOCKED feature as suggested by Adam.
You could also use a column that contains processing information, for example a STATE column that is empty when not processed. Each thread would start work on a row with:
UPDATE your_table
SET state='P'
WHERE STATE IS NULL
AND rownum = 1
RETURNING id INTO :id;
At this point the thread would commit to prevent other thread being locked. Then you would do your processing and select another row when you're done.
Alternatively, you could also split the work beforehand and assign each process with a range of ids that need to be processed.
How will temporary tables behave with multiple threads?
Most likely each thread will have its own Oracle session (else you couldn't run queries in parallel). This means that each thread will have its own virtual copy of the temporary table. If you stored data in this table beforehand, the threads will not be able to see it (the temp table will always be empty at the beginning of a session).
You will need regular tables if you want to store data accessible to multiple sessions. Temporary tables are fine for storing data that is private to a single session, for example intermediate data in a complex process.
Easiest will be to use DBMS_SCHEDULER to schedule a job for every row that you want to process. You have to pass a key to a permanent table to identifiy the row that you want to process, or put the full row in the arguments for the job, since a temporary table's content is not visible in different sessions. The number of concurrent jobs are controlled by resource manager, mostly limited by the number of cpus.
Why would you want to process row by row anyway? Set operations are in most occasions a lot faster.

SQL SERVER Procedure Inconsistent Performance

I am working on a SQL Job which involves 5 procs, a few while loops and a lot of Inserts and Updates.
This job processes around 75000 records.
Now, the job works fine for 10000/20000 records with speed of around 500/min. After around 20000 records, execution just dies. It loads around 3000 records every 30 mins and stays at same speed.
I was suspecting network, but don't know for sure. These kind of queries are difficult to analyze through SQL Performance Monitor. Not very sure where to start.
Also, there is a single cursor in one of the procs, which executes for very few records.
Any suggestions on how to speed this process up on the full-size data set?
I would check if your updates are within a transaction. If they are, it could explain why it dies after a certain amount of "modified" data. You might check how large your "tempdb" gets as an indicator.
Also I have seen cases when during long-running transactions the database would die when there are other "usages" at the same time, again because of transactionality and improper isolation levels used.
If you can split your job into independent non-overlaping chunks, you might want to do it: like doing the job in chunks by dates, ID ranges of "root" objects etc.
I suspect your whole process is flawed. I import a datafile that contains 20,000,000 records and hits many more tables and does some very complex processing in less time than you are describing for 75000 records. Remember looping is every bit as bad as using cursors.
I think if you set this up as an SSIS package you might be surprised to find the whole thing can run in just a few minutes.
With your current set-up consider if you are running out of room in the temp database or maybe it is trying to grow and can't grow fast enough. Also consider if at the time the slowdown starts, is there some other job running that might be causing blocking? Also get rid of the loops and process things in a set-based manner.
Okay...so here's what I am doing in steps:
Loading a file in a TEMP table, just an intermediary.
Do some validations on all records using SET-Based transactions.
Actual Processing Starts NOW.
TRANSACTION BEGIN HERE......
LOOP STARTS HERE
a. Pick Records based in TEMP tables PK (say customer A).
b. Retrieve data from existing tables (e.g. employer information)
c. Validate information received/retrieved.
d. Check if record already exists - UPDATE. else INSERT. (THIS HAPPENS IN SEPARATE PROCEDURE)
e. Find ALL Customer A family members (PROCESS ALL IN ANOTHER **LOOP** - SEPARATE PROC)
f. Update status for CUstomer A and his family members.
LOOP ENDS HERE
TRANSACTION ENDS HERE