How do I get the size of a Service Broker Queue quickly when the table is >500k rows? - service-broker

I'm trying to determine the total size of a Service Broker Queue and transmission queue when the queue is very large. The problem is traditional querys like the ones below don't work since it's not a table. Any ideas?
EXEC sp_spaceused 'sys.transmission_queue'
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('ConfigMgrDrsQueue') AND indid < 2

Remus explains explains how to do it in his blog. Basically you need to query the row count of the underlying b-tree.

Related

Using database transactions to ensure only one process picks up a job from jobs table

I want to ensure that only one worker node (out of a cluster of nodes) picks up any given job from a Jobs table and processes it. I'm using the following database transaction on my Jobs table to achieve this:
BEGIN TRANSACTION;
SELECT id FROM jobs
WHERE status = 'NEW'
ORDER BY created_at
LIMIT 1;
-- rollback if id is null
UPDATE jobs
SET status = 'IN_PROCESS'
WHERE id = 'id';
COMMIT;
Will the above TRANSACTION ensure that only one node will pick up any given job? Is it possible that two nodes will read the SELECT statement simultaneously as NEW, and then both will run the UPDATE statement (after the first one releases the row lock) and start processing the same job?
In other words, will the TRANSACTION provide a lock for the SELECT statement as well, or only for the UPDATE statement?
No, transactions won't help you here, unless you raise the isolation level to SERIALIZABLE. It's an option of last resort, but I would avoid it if possible.
The possibilities I see are:
Pessimistic Locks. Add FOR UPDATE to the SELECT. These limit performance.
Optimistic Locks. Seems like the best fit for your needs.
Use a predefined queue manager linked to your table.
Implement a queue, or a queue process.
Seems to me that #2 is the best fit for it. If that's the case you'll need to add an extra column to the table: version. With that column added your queries will need to be changed as:
SELECT id, version FROM jobs
WHERE status = 'NEW'
ORDER BY created_at
LIMIT 1;
UPDATE jobs
SET status = 'IN_PROCESS', version = version + 1
WHERE id = 'id_retrieved_before' and version = 'version_retrieved_before';
The update above returns the number of updated rows. If the count is 1 then this thread got the row. If it's 0, then another competing thread got the row and you'll need to retry this strategy again.
As you can see:
No transactions are necessary.
All engines return the number of updated rows, so this strategy will work well virtually in any database.
No locks are necessary. This offers great performance.
The downside is that if another thread got the row, the logic needs to start again from the beginning. No big deal, but it doesn't ensure optimal response time in the scenario when there are many competing threads.
Finally, if your database is PostgreSQL, you can pack both SQL statements into a single one. Isn't PostgreSQL awesome?

Breaking down a large number of rows into smaller queries? Parallelism

I want to create a external application which will query one table from a large Oracle database.
The query will run daily and I am expecting to handle 30,000+ rows.
To break down the size of these rows, I would like to create a new thread/ process for each 10,000 rows that exist. So going by the above figure it would be 3 threads to process all those rows.
I don't want each thread to overlap each others row set so I know I will need to add a column within the table to act as a range marker, a row_position
Logic
Get row_count of data set in query parameters
Get first_row_pos
While (row_count > 10,000)
{
Create thread with 10,000 rows starting from first_row_pos
row_count == row_count - 10,000
first_row_pos = first_row_pos + 10,000
}
create thread for remaining rows
all threads run their queries concurrently.
This is basic logic at the moment, however I do not know how feasible this is.
Is this a good way or is there a better way?
Can this be done through one database connection with each thread sharing or is it better to have a seperate db connection for each thread?
Any other advice welcome?
Note: I just realised a do while loop would be better if there is less than 10,000 rows in this case.
Thanks
Oralce provide a parallel hint for sutuations such as this where you have a full table scan or similar problem and want to make use of multiple cores to divide the workload. Further details here.
The syntax is very simple, you specify the table (or alias) and the number of cores (I usually leave as default) e.g.:
select /*+ parallel(a, default) */ *
from table_a a
You can also use this with multiple tables e.g.
select /*+ parallel(a, default) parallel(b,default) */ *
from table_a a, table_b b
where a.some_id = b.some_id
A database connection is not thread-safe, so if you are going to query the database from several threads, you would have to have a separate connection for each of them. You can either create a connection or get them from a pool.
Before you implement your approach, take some time to analyze where is the time spent. Oracle overall is pretty good with utilizing multiple cores. And the database interaction is usually is the most time-consuming part. By splitting the query in three you might actually slow things down.
If indeed your application is spending most of the time performing calculations on that data, your best approach might be loading all data in a single thread and then splitting processing into multiple threads.

Rails/SQL How do I get different workers to work on different records?

I have (for argument sake) 1000 records and 10 Heroku workers running.
I want to have each worker work on a different set of records..
What I've got right now is quite good, but not quite complete.
sql = 'update products set status = 2 where id in
(select id from products where status = 1 limit (100) ) return *'
records = connection.execute(sql)
This works rather well.. I get get 100 records and at the same time, I make sure my other workers don't get the same 100..
If I throw it in a while loop then even if I have 20000 records and 2 workers, eventually they will all get processed.
My issue is if there's a crash or exception then the 100 records look like their being processed by another worker but they aren't.
I can't use transaction, because the other selects will pick up the same records.
My question
What strategies do others use to have many workers working on the same dataset, but different records.
I know this is a conversational question... I'd put it as community wiki, but I don't see that ability any more.
Building a task queue in a RDBMS is annoyingly hard. I recommend using a queueing system that's designed for the job instead.
Check out PGQ, Celery, etc.
I have used queue_classic by Heroku to schedule jobs stored in a Postgres database.
If I were to do this it would be something other than a db-side queue. It sounds like standard client processing but you really want is parallel processing of the result set.
The simplest solution might be to do what you are doing but lock them on the client side, and divide them between workers there (spinlocks etc). You can then commit the transaction and re-run after these have finished processing.
The difficulty is that if you have records you are processing for things that are supposed to happen outside the server, and there is a crash, you never really know what records were processed. It is safer to rollback probably, but just keep that in mind.

Using SQL Server as a DB queue with multiple clients

Given a table that is acting as a queue, how can I best configure the table/queries so that multiple clients process from the queue concurrently?
For example, the table below indicates a command that a worker must process. When the worker is done, it will set the processed value to true.
| ID | COMMAND | PROCESSED |
| 1 | ... | true |
| 2 | ... | false |
| 3 | ... | false |
The clients might obtain one command to work on like so:
select top 1 COMMAND
from EXAMPLE_TABLE
with (UPDLOCK, ROWLOCK)
where PROCESSED=false;
However, if there are multiple workers, each tries to get the row with ID=2. Only the first will get the pessimistic lock, the rest will wait. Then one of them will get row 3, etc.
What query/configuration would allow each worker client to get a different row each and work on them concurrently?
EDIT:
Several answers suggest variations on using the table itself to record an in-process state. I thought that this would not be possible within a single transaction. (i.e., what's the point of updating the state if no other worker will see it until the txn is committed?) Perhaps the suggestion is:
# start transaction
update to 'processing'
# end transaction
# start transaction
process the command
update to 'processed'
# end transaction
Is this the way people usually approach this problem? It seems to me that the problem would be better handled by the DB, if possible.
I recommend you go over Using tables as Queues.
Properly implemented queues can handle thousands of concurrent users and service as high as 1/2 Million enqueue/dequeue operations per minute. Until SQL Server 2005 the solution was cumbersome and involved a mixing a SELECT and an UPDATE in a single transaction and give just the right mix of lock hints, as in the article linked by gbn. Luckly since SQL Server 2005 with the advent of the OUTPUT clause, a much more elegant solution is available, and now MSDN recommends using the OUTPUT clause:
You can use OUTPUT in applications
that use tables as queues, or to hold
intermediate result sets. That is, the
application is constantly adding or
removing rows from the table
Basically there are 3 parts of the puzzle you need to get right in order for this to work in a highly concurrent manner:
You need to dequeue automically. You have to find the row, skip any locked rows, and mark it as 'dequeued' in a single, atomic operation, and this is where the OUTPUT clause comes into play:
with CTE as (
SELECT TOP(1) COMMAND, PROCESSED
FROM TABLE WITH (READPAST)
WHERE PROCESSED = 0)
UPDATE CTE
SET PROCESSED = 1
OUTPUT INSERTED.*;
You must structure your table with the leftmost clustered index key on the PROCESSED column. If the ID was used a primary key, then move it as the second column in the clustered key. The debate whether to keep a non-clustered key on the ID column is open, but I strongly favor not having any secondary non-clustered indexes over queues:
CREATE CLUSTERED INDEX cdxTable on TABLE(PROCESSED, ID);
You must not query this table by any other means but by Dequeue. Trying to do Peek operations or trying to use the table both as a Queue and as a store will very likely lead to deadlocks and will slow down throughput dramatically.
The combination of atomic dequeue, READPAST hint at searching elements to dequeue and leftmost key on the clustered index based on the processing bit ensure a very high throughput under a highly concurrent load.
My answer here shows you how to use tables as queues... SQL Server Process Queue Race Condition
You basically need "ROWLOCK, READPAST, UPDLOCK" hints
If you want to serialize your operations for multiple clients, you can simply use application locks.
BEGIN TRANSACTION
EXEC sp_getapplock #resource = 'app_token', #lockMode = 'Exclusive'
-- perform operation
EXEC sp_releaseapplock #resource = 'app_token'
COMMIT TRANSACTION
Rather than using a boolean value for Processed you could use an int to define the state of the command:
1 = not processed
2 = in progress
3 = complete
Each worker would then get the next row with Processed = 1, update Processed to 2 then begin work. When work in complete Processed is updated to 3. This approach would also allow for extension of other Processed outcomes, for example rather than just defining that a worker is complet you may add new statuses for 'Completed Succesfully' and 'Completed with Errors'
Probably the better option will be use a trisSate processed column along with a version/timestamp column. The three values in the processed column will then indicate indicates if the row is under processing, processed or unprocessed.
For example
CREATE TABLE Queue ID INT NOT NULL PRIMARY KEY,
Command NVARCHAR(100),
Processed INT NOT NULL CHECK (Processed in (0,1,2) ),
Version timestamp)
You grab the top 1 unprocessed row, set the status to underprocessing and set the status back to processed when things are done. Base your update status on the Version and the primary key columns. If the update fails then someone has already been there.
You might want to add a client identifier as well, so that if the client dies while processing it up, it can restart, look at the last row and then start from where it was.
I would stay away from messing with locks in a table. Just create two extra columns like IsProcessing (bit/boolean) and ProcessingStarted (datetime). When a worker crashes or doesn't update his row after a timeout you can have another worker try to process the data.
One way is to mark the row with a single update statement. If you read the status in the where clause and change it in the set clause, no other process can come in between, because the row will be locked. For example:
declare #pickup_id int
set #pickup_id = 1
set rowcount 1
update YourTable
set status = 'picked up'
, #pickup_id = id
where status = 'new'
set rowcount 0
return #pickup_id
This uses rowcount to update one row at most. If no row was found, #pickup_id will be -1.

Getting a Chunk of Work

Recently I had to deal with a problem that I imagined would be pretty common: given a database table with a large (million+) number of rows to be processed, and various processors running in various machines / threads, how to safely allow each processor instance to get a chunk of work (say 100 items) without interfering with one another?
The reason I am getting a chunk at a time is for performance reasons - I don't want to go to the database for each item.
There are a few approaches - you could associate each processor a token, and have a SPROC that sets that token against the next [n] available items; perhaps something like:
(note - needs suitable isolation-level; perhaps serializable: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE)
(edited to fix TSQL)
UPDATE TOP (1000) WORK
SET [Owner] = #processor, Expiry = #expiry
OUTPUT INSERTED.Id -- etc
WHERE [Owner] IS NULL
You'd also want a timeout (#expiry) on this, so that when a processor goes down you don't lose work. You'd also need a task to clear the owner on things that are past their Expiry.
You can have a special table to queue work up, where the consumers delete (or mark) work as being handled, or use a middleware queuing solution, like MSMQ or ActiveMQ.
Middleware comes with its own set of problems so, if possible, I'd stick with a special table (keep it as small as possible, hopefully just with an id so the workers can fetch the rest of the information by themselves on the rest of the database and not lock the queue table up for too long).
You'd fill this table up at regular intervals and let processors grab what they need from the top.
Related questions on SQL table queues:
Queue using table
Working out the SQL to query a priority queue table
Related questions on queuing middleware:
Building a high performance and automatically backupped queue
Messaging platform
You didn't say which database server you're using, but there are a couple of options.
MySQL includes an extension to SQL99's INSERT to limit the number of rows that are updated. You can assign each worker a unique token, update a number of rows, then query to get that worker's batch. Marc used the UPDATE TOP syntax, but didn't specify the database server.
Another option is to designate a table used for locking. Don't use the same table with the data, since you don't want to lock it for reading. Your lock table likely only needs a single row, with the next ID needing work. A worker locks the table, gets the current ID, increments it by whatever your batch size is, updates the table, then releases the lock. Then it can go query the data table and pull the rows it reserved. This option assumes the data table has a monotonically increasing ID, and isn't very fault-tolerant if a worker dies or otherwise can't finish a batch.
Quite similar to this question: SQL Server Process Queue Race Condition
You run a query to assign a 100 rows to a given processorid. If you use these locking hints then it's "safe" in the concurrency sense. And it's a single SQL statement with no SET statements needed.
This is taken from the other question:
UPDATE TOP (100)
foo
SET
ProcessorID = #PROCID
FROM
OrderTable foo WITH (ROWLOCK, READPAST, UPDLOCK)
WHERE
ProcessorID = 0 --Or whatever unassigned is