How to ensure data consistency across multiple related tables with concurrent queries - sql

So, my data model is similar to an assignment problem. So, let's assume we have a firm that provides suitable workers for requested jobs.
For now, I have such relations:
Customer (id)
Job (id)
Worker (id, available)
Jobs In Progress (customer_id, job_id)
Busy Workers (customer_id, worker_id)
There are many-to-many between Customer and Job and many-to-many between Customer and Worker. This data is like real-time, so it's highly dynamic.
We want to maintain such queries:
Request a worker for a job.
Return the worker when the job has finished.
This queries requires to read, update, delete and insert data in several tables.
For example, if customer requests for a worker, we have to check whether this customer already exists in the table; whether he already owns a suitable worker in Busy Workers; if no, find a suitable available worker in Worker; check whether such a job already registered in Job. And in the worst case, we have to atomically insert customer in Customer, insert job in Job, insert a corresponding row in Jobs In Progress, decrement Worker.avaiable and insert a row in Busy Workers.
In the second query, we have to do all of this stuff in a reversed order: increment Worker.available, delete the customer if he has no jobs, delete the job if no one customer needs it and so on.
So we have a lot of consistency rules: number of busy workers have to be consistent with Worker.available, a customer has to be present in the table only if he has requested not finished jobs, a job has to be present in the table only if there are no customers with such a job requested.
I read a lot about isolations levels and locking in databases, but I still don't understand how to ensure consistency across multiple tables. It seems like isolation levels don't work because multiple tables are involved and data may become inconsistent between select from two tables. And it seems like locks don't work too, because AFAIK SQL Server can't atomically acquire a lock on multiple tables and therefore data may become inconsistent between locks.
And, actually, I'm looking for a solution or idea of a solution in general, without referencing to a concrete RDBMS, it should be something that applicable one way or another to the most famous RDBMS's like MySQL, PostgreSQL, SQL Server, and Oracle. So it does not have to be a proper solution with examples with all of this RDMS's, maybe some practices, tips or
references.
I apologize for my English and thank you in advance.

First: Think about your model a bit more. I would not keep so much redundant information. "decrement Worker.avaiable and insert a row in Busy Workers" are completely superfluous because you can get the information easily by asking the other tables. You might say that is more costly to query. That I would call premature optimization. Redundancy is very costly per se.
Second: Think of locks as exclusive resources that only one may get. So the most simple way to ensure consistency would be to let all dbms-users lock just one record in the database using select ... for update. All changes would be serialized. If you use a MVCC-Dbms like postgres, oracle or even sql-server, the readers would always see a consistent situation.
Third: Doing your change perhaps you just need to detect, if another user/transaction already changed a certain record. This can be done by maintaining, so called, version-attributes and checking during updates, if those attributes where changed. If a change was detected, you have to repeat the complete transaction. That is called optimistic locking.
Fourth or better the most important point: I hope you understood the concept of dbms-transaction as a means to bring a dbms from one consistent state to the other.

Related

Is using triggers best solution for this scenario

A large SQL transactional database has more than 100 tables (and it will grow). One of them is called Order. Then, there is another table WorkLoad which derives from Order and many other joined table which contains a list of all active order. Every time an order record is created, if it meets certain conditions, it should be instantly inserted into WorkLoad table. And finally, there is a third table WorkLoadAggregation which displays aggregated data grouped by date and shop and it is completely built from WorkLoad table. WorkLoadAggregation should also display live data meaning that if a record is inserted in WorkLoad table then matching date/shop aggregation should also be updated.
My idea was to handle this by following triggers:
When record is inserted in Order table, trigger calls stored procedure which inserts record into WorkLoad table
When Order record is deleted trigger deletes the record from WorkLoad table
When Order record is updated in a way that it doesn't meet WorkLoad conditions, trigger deletes the record from WorkLoad table
When record is inserted/deleted/updated in WorkLoad table, trigger calls stored procedure which updates matching date/shop aggregated record in WorkLoadAggregation table
I haven't used triggers that much in such large transaction dbs and for such frequent calls. Is there anything bad in this approach? My biggest concern is usage of "chained triggers", meaning that trigger on one table activates trigger on another table. I've been reading few articles which state that developers should be very cautious when using triggers. Are there any better solutions? Should I consider any NoSQL solution?
Database is hosted on SQL Server 2012.
Note: In case 5) the stored procedure that's called contains a CTE (case someone suggests using an indexed view)
It is a little difficult to provide a more concrete opinion, but based on the presentation of the problem space. I would not recommend this solution, as it would be difficult to test effectively and I can see this causing issues under times of high load. Also it is really hard to quantify the total impact as I am not sure what the read load would look like and how many other processes may need information out of those tables.
Based on how you have described the problem, and the fact that you asked about NoSQL approaches, I assume that eventual consistency is not much of a concern so I would recommend a more EventDriven Architecture. Keep in mind that this may mean a significant re-write of your current implementation but would definitely allow for better domain decomposition and scaling.

Locking database table between query and insert

Forgive me if this is a silly question (I'm new to databases and SQL), but is it possible to lock a table, similar to the lock keyword in C#, so that I can query the database to see if a condition is met, then insert a row afterwards while ensuring the state of the table has not changed between the two actions?
In this case, I have a table transactions which has two columns: user and product. This is a many-to-one relationship; multiple users can have the same product. However, the number of products is limited.
When a user adds a product to their account, I want to first check if the total number of items with the same product value to see if it is under a certain threshold, then add the transaction afterwards. However, since this is a multithreaded application, multiple transactions can come in at the same time. I want to make sure that one of these is rejected, and one succeeds, such that the number of transactions with the same product value can never be higher than the limit.
Rough pseudo-code for what I am trying to do:
my_user, my_product = ....
my_product_count = 0
for each transaction in transactions:
if product == my_product:
my_product_count += 1
if my_product_count < LIMIT:
insert my_user, my_product into transactions
return SUCCESS
else:
return FAILURE
I am using SQLAlchemy with SQLite3, if that matters.
Not needed if you do both operations in a transaction - which is supported by databases. Databases do maintain locks to guarantee transactional integrity. In fact that is one of the four pillars of what a database does - they are called ACID guaranetees (for (Atomicity, Consistency, Isolation, Durability).
So, in your case, to ensure consistence you would make both operations in one transaction and seat the transaction parameters in such a way to block reads on the already read rows.
SQL locking is WAY more powerfull than the lock statement because, among other things, databases per definition have multiple threads (users) hitting the same data - something that is exceedingly rare in programming (where same data access is avoided in multi threaded programming as much as possible).
I suggest a good book about SQL - because you need to simply LEARN some fundamental concepts at one point, or you will make mistakes that cost money.
Transactions allow you to use multiple SQL statements atomically.
(SQLite implements transactions by locking the entire database, but the exact mechanism doesn't matter, and you might want to use another database later anyway.)
However, you don't even need to bother with explicit transactions if your desired algorithm can be handled with single SQL statement, like this:
INSERT INTO transactions(user, product)
SELECT #my_user, #my_product
WHERE (SELECT COUNT(*)
FROM transactions
WHERE product = #my_product) < #LIMIT;

SQL transaction affecting a big amount of rows

The situation is as follows:
A big production client/server system where one central database table has a certain column that has had NULL as default value but now has 0 as default value. But all the rows created before that change of course still have value as null and that generates a lot of unnecessary error messages in this system.
Solution is of course simple as that:
update theTable set theColumn = 0 where theColumn is null
But I guess it's gonna take a lot of time to complete this transaction? Apart from that, will there be any other issues I should think of before I do this? Will this big transaction block the whole database, or that particular table during the whole update process?
This particular table has about 550k rows and 500k of them has null value and will be affected by the above sql statement.
The impact on the performance of other connected clients depends on:
How fast the servers hardware is
How many indexes containing the column your update statement has to update
Which transaction isolation settings the other clients connect to the database
The db engine will acquire write locks, so when your clients only need read access to the table, it should not be a big problem.
500.000 records sounds not too much for me, but as i said, the time and resources the update takes depends on many factors.
Do you have a similar test system, where you can try out the update?
Another solution is to split the one big update into many small ones and call them in a loop.
When you have clients writing frequently to that table, your update statement might get blocked "forever". I have seen databases where performing the update row by row was the only way of getting the update through. But that was a table with about 200.000.000 records and about 500 very active clients!
it's gonna take a lot of time to complete this transaction
there's no definite way to say this. Depends a lot on the hardware, number of concurrent sessions, whether the table has got locks, the number of interdependent triggers et al.
Will this big transaction block the whole database, or that particular table during the whole update process
If the "whole database" is dependent on this table then it might.
will there be any other issues I should think of before I do this
If the table has been locked by other transaction - you might run into a row-lock situation. In rare cases, perhaps a dead lock situation. Best would be to ensure that no one is utilizing the table, check for any pre-exising locks and then run the statement.
Locking issues are vendor specific.
Asuming no triggers on the table, half a million rows is not much for a dediated database server even with many indexes on the table.

Should I be using InnoDB for this?

I am developing a personal PHP/MySQL app, and I came across this particular scenario in my project:
I have various comment threads. This is handled by two tables - 'Comments' and 'Threads', with each comment in 'Comments' table having a 'thread_id' attribute indicating which thread the comment belongs to. When the user deletes a comment thread, currently I am doing two separate DELETE SQL queries:
First delete all the comments belonging to the thread in the 'Comments' table
Then, clearing the thread record from the 'Threads' table.
I also have another situation, where I need to insert data from a form into two separate tables.
Should I be using transactions for these kind of situations? If so, is it a general rule of thumb to use transactions whenever I need to perform such multiple SQL queries?
It depends on your actual needs, transactions are just a way of ensuring that all data manipulation that forms a single transaction gets executed successfully, and that transactions happen sequentially (a new transaction cannot be made until the previous one has either succeeded or failed). If one of the queries fails for whatever reason, the whole transaction will fail and the previous state will be restored.
If you absolutely need to make sure that no threads will be deleted unless all the comments have been deleted beforehand, go for transactions. If you need all the speed you can get, go for MyISAM
Yes, it is a general rule of thumb to use transactions when doing multiple operations that are related. If you do switch to InnoDB (usually a good idea, but not always. We didn't really discuss any requirements besides transactions, so I won't comment more), I'd also suggest setting a constraint on Comments that points to Threads as it sounds like a comment must be assigned to a thread. Deleting the thread would then remove associated comments in a single atomic statement.
If you want ACID transactions, you want InnoDB. If having one DELETE succeed and the other fail means having to manually DELETE the failed attempt, I'd say that's a hardship better handled with the database. Those situations call for transactions.
For the first part of your question I would recommend declaring thread_id as a foreign key in your comments table. This should reference the id column of the thread table. You can then set 'ON DELETE CASCADE' this means that when the ID is removed from the thread table all comments that reference that ID will also be deleted.

Best practices for multithreaded processing of database records

I have a single process that queries a table for records where PROCESS_IND = 'N', does some processing, and then updates the PROCESS_IND to 'Y'.
I'd like to allow for multiple instances of this process to run, but don't know what the best practices are for avoiding concurrency problems.
Where should I start?
The pattern I'd use is as follows:
Create columns "lockedby" and "locktime" which are a thread/process/machine ID and timestamp respectively (you'll need the machine ID when you split the processing between several machines)
Each task would do a query such as:
UPDATE taskstable SET lockedby=(my id), locktime=now() WHERE lockedby IS NULL ORDER BY ID LIMIT 10
Where 10 is the "batch size".
Then each task does a SELECT to find out which rows it has "locked" for processing, and processes those
After each row is complete, you set lockedby and locktime back to NULL
All this is done in a loop for as many batches as exist.
A cron job or scheduled task, periodically resets the "lockedby" of any row whose locktime is too long ago, as they were presumably done by a task which has hung or crashed. Someone else will then pick them up
The LIMIT 10 is MySQL specific but other databases have equivalents. The ORDER BY is import to avoid the query being nondeterministic.
Although I understand the intention I would disagree on going to row level locking immediately. This will reduce your response time and may actually make your situation worse. If after testing you are seeing concurrency issues with APL you should do an iterative move to “datapage” locking first!
To really answer this question properly more information would be required about the table structure and the indexes involved, but to explain further.
DOL, datarow locking uses a lot more locks than allpage/page level locking. The overhead in managing all the locks and hence the decrease of available memory due to requests for more lock structures within the cache will decrease performance and counter any gains you may have by moving to a more concurrent approach.
Test your approach without the move first on APL (all page locking ‘default’) then if issues are seen move to DOL (datapage first then datarow). Keep in mind when you switch a table to DOL all responses on that table become slightly worse, the table uses more space and the table becomes more prone to fragmentation which requires regular maintenance.
So in short don’t move to datarows straight off try your concurrency approach first then if there are issues use datapage locking first then last resort datarows.
You should enable row level locking on the table with:
CREATE TABLE mytable (...) LOCK DATAROWS
Then you:
Begin the transaction
Select your row with FOR UPDATE option (which will lock it)
Do whatever you want.
No other process can do anything to this row until the transaction ends.
P. S. Some mention overhead problems that can result from using LOCK DATAROWS.
Yes, there is overhead, though i'd hardly call it a problem for a table like this.
But if you switch to DATAPAGES then you may lock only one row per PAGE (2k by default), and processes whose rows reside in one page will not be able to run concurrently.
If we are talking of table with dozen of rows being locked at once, there hardly will be any noticeable performance drop.
Process concurrency is of much more importance for design like that.
The most obvious way is locking, if your database doesn't have locks, you could implement it yourself by adding a "Locked" field.
Some of the ways to simplify the concurrency is to randomize the access to unprocessed items, so instead of competition on the first item, they distribute the access randomly.
Convert the procedure to a single SQL statement and process multiple rows as a single batch. This is how databases are supposed to work.