Rails/SQL How do I get different workers to work on different records? - sql

I have (for argument sake) 1000 records and 10 Heroku workers running.
I want to have each worker work on a different set of records..
What I've got right now is quite good, but not quite complete.
sql = 'update products set status = 2 where id in
(select id from products where status = 1 limit (100) ) return *'
records = connection.execute(sql)
This works rather well.. I get get 100 records and at the same time, I make sure my other workers don't get the same 100..
If I throw it in a while loop then even if I have 20000 records and 2 workers, eventually they will all get processed.
My issue is if there's a crash or exception then the 100 records look like their being processed by another worker but they aren't.
I can't use transaction, because the other selects will pick up the same records.
My question
What strategies do others use to have many workers working on the same dataset, but different records.
I know this is a conversational question... I'd put it as community wiki, but I don't see that ability any more.

Building a task queue in a RDBMS is annoyingly hard. I recommend using a queueing system that's designed for the job instead.
Check out PGQ, Celery, etc.

I have used queue_classic by Heroku to schedule jobs stored in a Postgres database.

If I were to do this it would be something other than a db-side queue. It sounds like standard client processing but you really want is parallel processing of the result set.
The simplest solution might be to do what you are doing but lock them on the client side, and divide them between workers there (spinlocks etc). You can then commit the transaction and re-run after these have finished processing.
The difficulty is that if you have records you are processing for things that are supposed to happen outside the server, and there is a crash, you never really know what records were processed. It is safer to rollback probably, but just keep that in mind.

Related

How can data be synchronized between processes in SQL?

I'm wondering something perhaps extremely stupid, but I can't seem to find an answer (which is not a good sign, usually).
Assuming we have a SQL server (MySQL, PostgreSQL, this question even applies to Sqlite3 though there's no server) and several clients connected to it. I've seen countless times queries that might be hard to sync in my opinion.
So let's assume we have a table (usage statistics, say) with a row per day.
statistic (
day,
num_requests
)
(I avoid mentioning data types, since it's not the point, but the number of requests should be a number of some sort.)
So when a new web request is sent, the web server will ask this table's current statistic and increase the number of requests. No biggie right?
number = cursor.execute("""
SELECT num_requests FROM statistic
WHERE ...
""")
number += 1
cursor.execute("""
UPDATE statistic SET num_requests=?
WHERE ...
""", (number, ))
But what does happen if two requests are handled somewhat simultaneously, perhaps on several clients? Different processes? They each ask for today's current statistic (just a read operation, non-blocking), they get the number of requests from this row (this step doesn't involve the server) and then they increment it by 1. At this point, if both requests are running somewhat simultaneously, they have both incremented the same number once and they send an UPDATE requests with their number.
In the end, the number of requests for today's statistic has increased by one, although they were two requests. I know there are mechanisms to ensure proper data synchronization, but I fail to see how it could address the situation in this case. Read usually is non-blocking as far as I know. Write can be blocking, but since read for the other process has happened before, the second write operation will not be acceptable. And I don't see any way to express that logically.
In other words, this seems like the point where we would lock the row in most programming languages, and say "from that point onward, you can neither read it or write it, I'm working on it". The first request will execute its read (lock), increment and write, and then will unlock. The second request will have to wait patiently for the lock to be released. I don't see that mechanism in SQL. Is that transparent and not even necessary? And if so, how does it work? Or have we lived our entire life with problems like it?
Thanks!
cursor.execute("""
UPDATE statistic SET num_requests=num_requests+1
WHERE ...
""", (number, ))

"Trigger" on frequent changes with time range?

I have a need to monitor a particular table in a SQL database, and get notified (email ideally) if there are a large number of changes within any 30 second period (for example). Unsure if this is best done in SQL Server, or by some other method.
I've been hunting and can't seem to find anything relevant and thought I'd ask here to get pointed in the right direction.
So logically, something like this:
If # of Updates or Inserts exceeds 500 within any 30 second timeframe, send email to x#xxxxxx.com
Any thoughts on how best to attack this would be greatly appreciated.
I'd split this into separate concerns.
Firstly, how do you detect changes? The easiest is a "last updated" column; you can then run a select query on changes since the current time, and compare that to your limiter.
The second question is how you execute that query and send your emails. Your question title mentions triggers - these are specific SQL Server objects. I have a personal dislike of triggers - they are "side effects", and tend to make code hard to understand, and lead to entertaining bugs. In your case, they're probably a bad solution - you want triggers to be incredibly fast, as they hold up your data changes while they execute. Sending an email is probably slower than you want.
The simplest option is probably to use an agent job, scheduled to run at whatever frequency you need. Again, performance may be a challenge here - having a high-load query run every 30 seconds may slow down your database dramatically.
If you are using any audit/log table which keeps track of transactions (Insert, Update or Delete) on that particular table, then it should be fairly easy.
Use SQL server agent job to execute 2 tasks:
Execute a script to count the number of transactions on that audit/log table for the last 30 seconds
If it exceeds 500, run powershell script/ .exe file to send emails

Postgres: How to fire multiple queries in same time?

I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius
I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.
It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id
PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.
I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."

SQL SERVER Procedure Inconsistent Performance

I am working on a SQL Job which involves 5 procs, a few while loops and a lot of Inserts and Updates.
This job processes around 75000 records.
Now, the job works fine for 10000/20000 records with speed of around 500/min. After around 20000 records, execution just dies. It loads around 3000 records every 30 mins and stays at same speed.
I was suspecting network, but don't know for sure. These kind of queries are difficult to analyze through SQL Performance Monitor. Not very sure where to start.
Also, there is a single cursor in one of the procs, which executes for very few records.
Any suggestions on how to speed this process up on the full-size data set?
I would check if your updates are within a transaction. If they are, it could explain why it dies after a certain amount of "modified" data. You might check how large your "tempdb" gets as an indicator.
Also I have seen cases when during long-running transactions the database would die when there are other "usages" at the same time, again because of transactionality and improper isolation levels used.
If you can split your job into independent non-overlaping chunks, you might want to do it: like doing the job in chunks by dates, ID ranges of "root" objects etc.
I suspect your whole process is flawed. I import a datafile that contains 20,000,000 records and hits many more tables and does some very complex processing in less time than you are describing for 75000 records. Remember looping is every bit as bad as using cursors.
I think if you set this up as an SSIS package you might be surprised to find the whole thing can run in just a few minutes.
With your current set-up consider if you are running out of room in the temp database or maybe it is trying to grow and can't grow fast enough. Also consider if at the time the slowdown starts, is there some other job running that might be causing blocking? Also get rid of the loops and process things in a set-based manner.
Okay...so here's what I am doing in steps:
Loading a file in a TEMP table, just an intermediary.
Do some validations on all records using SET-Based transactions.
Actual Processing Starts NOW.
TRANSACTION BEGIN HERE......
LOOP STARTS HERE
a. Pick Records based in TEMP tables PK (say customer A).
b. Retrieve data from existing tables (e.g. employer information)
c. Validate information received/retrieved.
d. Check if record already exists - UPDATE. else INSERT. (THIS HAPPENS IN SEPARATE PROCEDURE)
e. Find ALL Customer A family members (PROCESS ALL IN ANOTHER **LOOP** - SEPARATE PROC)
f. Update status for CUstomer A and his family members.
LOOP ENDS HERE
TRANSACTION ENDS HERE

Getting a Chunk of Work

Recently I had to deal with a problem that I imagined would be pretty common: given a database table with a large (million+) number of rows to be processed, and various processors running in various machines / threads, how to safely allow each processor instance to get a chunk of work (say 100 items) without interfering with one another?
The reason I am getting a chunk at a time is for performance reasons - I don't want to go to the database for each item.
There are a few approaches - you could associate each processor a token, and have a SPROC that sets that token against the next [n] available items; perhaps something like:
(note - needs suitable isolation-level; perhaps serializable: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE)
(edited to fix TSQL)
UPDATE TOP (1000) WORK
SET [Owner] = #processor, Expiry = #expiry
OUTPUT INSERTED.Id -- etc
WHERE [Owner] IS NULL
You'd also want a timeout (#expiry) on this, so that when a processor goes down you don't lose work. You'd also need a task to clear the owner on things that are past their Expiry.
You can have a special table to queue work up, where the consumers delete (or mark) work as being handled, or use a middleware queuing solution, like MSMQ or ActiveMQ.
Middleware comes with its own set of problems so, if possible, I'd stick with a special table (keep it as small as possible, hopefully just with an id so the workers can fetch the rest of the information by themselves on the rest of the database and not lock the queue table up for too long).
You'd fill this table up at regular intervals and let processors grab what they need from the top.
Related questions on SQL table queues:
Queue using table
Working out the SQL to query a priority queue table
Related questions on queuing middleware:
Building a high performance and automatically backupped queue
Messaging platform
You didn't say which database server you're using, but there are a couple of options.
MySQL includes an extension to SQL99's INSERT to limit the number of rows that are updated. You can assign each worker a unique token, update a number of rows, then query to get that worker's batch. Marc used the UPDATE TOP syntax, but didn't specify the database server.
Another option is to designate a table used for locking. Don't use the same table with the data, since you don't want to lock it for reading. Your lock table likely only needs a single row, with the next ID needing work. A worker locks the table, gets the current ID, increments it by whatever your batch size is, updates the table, then releases the lock. Then it can go query the data table and pull the rows it reserved. This option assumes the data table has a monotonically increasing ID, and isn't very fault-tolerant if a worker dies or otherwise can't finish a batch.
Quite similar to this question: SQL Server Process Queue Race Condition
You run a query to assign a 100 rows to a given processorid. If you use these locking hints then it's "safe" in the concurrency sense. And it's a single SQL statement with no SET statements needed.
This is taken from the other question:
UPDATE TOP (100)
foo
SET
ProcessorID = #PROCID
FROM
OrderTable foo WITH (ROWLOCK, READPAST, UPDLOCK)
WHERE
ProcessorID = 0 --Or whatever unassigned is