Locking database table between query and insert - sql

Forgive me if this is a silly question (I'm new to databases and SQL), but is it possible to lock a table, similar to the lock keyword in C#, so that I can query the database to see if a condition is met, then insert a row afterwards while ensuring the state of the table has not changed between the two actions?
In this case, I have a table transactions which has two columns: user and product. This is a many-to-one relationship; multiple users can have the same product. However, the number of products is limited.
When a user adds a product to their account, I want to first check if the total number of items with the same product value to see if it is under a certain threshold, then add the transaction afterwards. However, since this is a multithreaded application, multiple transactions can come in at the same time. I want to make sure that one of these is rejected, and one succeeds, such that the number of transactions with the same product value can never be higher than the limit.
Rough pseudo-code for what I am trying to do:
my_user, my_product = ....
my_product_count = 0
for each transaction in transactions:
if product == my_product:
my_product_count += 1
if my_product_count < LIMIT:
insert my_user, my_product into transactions
return SUCCESS
else:
return FAILURE
I am using SQLAlchemy with SQLite3, if that matters.

Not needed if you do both operations in a transaction - which is supported by databases. Databases do maintain locks to guarantee transactional integrity. In fact that is one of the four pillars of what a database does - they are called ACID guaranetees (for (Atomicity, Consistency, Isolation, Durability).
So, in your case, to ensure consistence you would make both operations in one transaction and seat the transaction parameters in such a way to block reads on the already read rows.
SQL locking is WAY more powerfull than the lock statement because, among other things, databases per definition have multiple threads (users) hitting the same data - something that is exceedingly rare in programming (where same data access is avoided in multi threaded programming as much as possible).
I suggest a good book about SQL - because you need to simply LEARN some fundamental concepts at one point, or you will make mistakes that cost money.

Transactions allow you to use multiple SQL statements atomically.
(SQLite implements transactions by locking the entire database, but the exact mechanism doesn't matter, and you might want to use another database later anyway.)
However, you don't even need to bother with explicit transactions if your desired algorithm can be handled with single SQL statement, like this:
INSERT INTO transactions(user, product)
SELECT #my_user, #my_product
WHERE (SELECT COUNT(*)
FROM transactions
WHERE product = #my_product) < #LIMIT;

Related

How to ensure data consistency across multiple related tables with concurrent queries

So, my data model is similar to an assignment problem. So, let's assume we have a firm that provides suitable workers for requested jobs.
For now, I have such relations:
Customer (id)
Job (id)
Worker (id, available)
Jobs In Progress (customer_id, job_id)
Busy Workers (customer_id, worker_id)
There are many-to-many between Customer and Job and many-to-many between Customer and Worker. This data is like real-time, so it's highly dynamic.
We want to maintain such queries:
Request a worker for a job.
Return the worker when the job has finished.
This queries requires to read, update, delete and insert data in several tables.
For example, if customer requests for a worker, we have to check whether this customer already exists in the table; whether he already owns a suitable worker in Busy Workers; if no, find a suitable available worker in Worker; check whether such a job already registered in Job. And in the worst case, we have to atomically insert customer in Customer, insert job in Job, insert a corresponding row in Jobs In Progress, decrement Worker.avaiable and insert a row in Busy Workers.
In the second query, we have to do all of this stuff in a reversed order: increment Worker.available, delete the customer if he has no jobs, delete the job if no one customer needs it and so on.
So we have a lot of consistency rules: number of busy workers have to be consistent with Worker.available, a customer has to be present in the table only if he has requested not finished jobs, a job has to be present in the table only if there are no customers with such a job requested.
I read a lot about isolations levels and locking in databases, but I still don't understand how to ensure consistency across multiple tables. It seems like isolation levels don't work because multiple tables are involved and data may become inconsistent between select from two tables. And it seems like locks don't work too, because AFAIK SQL Server can't atomically acquire a lock on multiple tables and therefore data may become inconsistent between locks.
And, actually, I'm looking for a solution or idea of a solution in general, without referencing to a concrete RDBMS, it should be something that applicable one way or another to the most famous RDBMS's like MySQL, PostgreSQL, SQL Server, and Oracle. So it does not have to be a proper solution with examples with all of this RDMS's, maybe some practices, tips or
references.
I apologize for my English and thank you in advance.
First: Think about your model a bit more. I would not keep so much redundant information. "decrement Worker.avaiable and insert a row in Busy Workers" are completely superfluous because you can get the information easily by asking the other tables. You might say that is more costly to query. That I would call premature optimization. Redundancy is very costly per se.
Second: Think of locks as exclusive resources that only one may get. So the most simple way to ensure consistency would be to let all dbms-users lock just one record in the database using select ... for update. All changes would be serialized. If you use a MVCC-Dbms like postgres, oracle or even sql-server, the readers would always see a consistent situation.
Third: Doing your change perhaps you just need to detect, if another user/transaction already changed a certain record. This can be done by maintaining, so called, version-attributes and checking during updates, if those attributes where changed. If a change was detected, you have to repeat the complete transaction. That is called optimistic locking.
Fourth or better the most important point: I hope you understood the concept of dbms-transaction as a means to bring a dbms from one consistent state to the other.

Microsoft SQL, understand transaction locking on low level

I had a weird case when I got a deadlock between a select statement that joined two tables and a transaction that performed multiple updates on those two tables.
I am coming from a Java world, so I thought that using a transaction will lock all the tables in it, but what I come to understand now is that the lock will only be requested when you access the table from your transaction and if someone else is doing a heavy select on that table during that time you might get a deadlock. Just to be fair I also must say that I had multiple connections making that same sequence of calls where they performed a heavy query on two tables and then created a transaction to update those tables, so whatever age case you might think of, there is a big chance I been running into it.
With that been said, can you please provide a low level explanation in what situations you might get a deadlock between a select statement and a transaction?

Several same time requested queries execution sequence

For example one user executes query like this:
UPDATE table SET column = 100;
And second user:
UPDATE table SET column = 200;
And lets say, these two queries are requested exactly same time, same seconds, same nanoseconds (or minimal time measurement unit, which is for this DB), then which query will be executed first and which one second?
Will database in this case choose queries sequence just randomly?
p.s. I don't tag some concrete database, I think this mechanism for all major RDBMS are similar. Or may be not?
RDBMS's implement a set of properties abbreviated (and called) ACID. Wikipedia explains the concept.
Basically, ACID-compliant databases lock the data at some level (table, page, and row locks are typical). In principle, only one write lock can be acquired for the same object at the same time. So, the database will arbitrarily lock the row for one of the transactions.
What happens to the other transaction? That depends, but one of two things should happen:
The transaction waits until the lock is available. So "in the end", it will assign the value (lose the lock, win the war ;).
The transaction will "timeout" because the appropriate row(s) are not available.
Your case is rather more complicated, because all rows in a table are affected. In the end, though, all rows should have the same value in an ACID-compliant database.
I should note that major databases are (usually) ACID-compliant. However, even though they have locks and transactions and similar mechanisms, the details can and do vary among databases.
Usually, the DML operations are done by acquiring DML locks, with the help of which the operations are made atomic and consistent. So, in your case, either of the query will be given the DML lock and executed and then the second one will go ahead in the similar fashion. which one goes first and second is not known as such.

Is using triggers best solution for this scenario

A large SQL transactional database has more than 100 tables (and it will grow). One of them is called Order. Then, there is another table WorkLoad which derives from Order and many other joined table which contains a list of all active order. Every time an order record is created, if it meets certain conditions, it should be instantly inserted into WorkLoad table. And finally, there is a third table WorkLoadAggregation which displays aggregated data grouped by date and shop and it is completely built from WorkLoad table. WorkLoadAggregation should also display live data meaning that if a record is inserted in WorkLoad table then matching date/shop aggregation should also be updated.
My idea was to handle this by following triggers:
When record is inserted in Order table, trigger calls stored procedure which inserts record into WorkLoad table
When Order record is deleted trigger deletes the record from WorkLoad table
When Order record is updated in a way that it doesn't meet WorkLoad conditions, trigger deletes the record from WorkLoad table
When record is inserted/deleted/updated in WorkLoad table, trigger calls stored procedure which updates matching date/shop aggregated record in WorkLoadAggregation table
I haven't used triggers that much in such large transaction dbs and for such frequent calls. Is there anything bad in this approach? My biggest concern is usage of "chained triggers", meaning that trigger on one table activates trigger on another table. I've been reading few articles which state that developers should be very cautious when using triggers. Are there any better solutions? Should I consider any NoSQL solution?
Database is hosted on SQL Server 2012.
Note: In case 5) the stored procedure that's called contains a CTE (case someone suggests using an indexed view)
It is a little difficult to provide a more concrete opinion, but based on the presentation of the problem space. I would not recommend this solution, as it would be difficult to test effectively and I can see this causing issues under times of high load. Also it is really hard to quantify the total impact as I am not sure what the read load would look like and how many other processes may need information out of those tables.
Based on how you have described the problem, and the fact that you asked about NoSQL approaches, I assume that eventual consistency is not much of a concern so I would recommend a more EventDriven Architecture. Keep in mind that this may mean a significant re-write of your current implementation but would definitely allow for better domain decomposition and scaling.

Table design and Querying

I have a table design that is represented by this awesome hand drawn image.
Basically, I have an account event, which can be either a Transaction (Payment to or from a third party) or a Transfer (transfer between accounts held by the user).
All common data is held in the event table (Date, CreatedBy, Source Account Id...) and then if it's a transaction, then transaction specific data is held in the Account Transaction table (Third Party, transaction type (Debit, Credit)...). If the event is a transfer, then transfer specific data is in the account_transfer table (Amount, destination account id...).
Note, something I forgot to draw, is that the Event table has an event_type_id. If event_type_id = 1, then it's a transaction. If it's a 2, then it's a Transfer.
Both the transfer and transaction tables are linked to the event table via an event id foreign key.
Note though that a transaction doesn't have an amount, as the transaction can be split into multiple payment lines, so it has a child account_transaction_line. To get the amount of the transaction, you sum it's child lines.
Foreign keys are all setup, with an index on primary keys...
My question is about design and querying. If I want to list all events for a specific account, I can either:
Select
from Event,
where event_type = 1 (transaction),
then INNER join to the Transaction table,
and INNER join to the transaction line (to sum the total)...
and then UNION to another selection,
selecting
from Event,
where event_type = 2 (transfer),
INNER join to transfer table...
and producing a list of all events.
or
Select
from Event,
then LEFT join to transaction,
then LEFT join to transaction line,
then LEFT join to transfer ...
and sum up totals (because of the transaction lines).
Which is more efficient? I think option 1 is best, as it avoids the LEFT joins (Scans?)
OR...
An Indexed View of option 1?
On performance
For performance analysis in SQL server, there are quite a few factors at play, e.g.
What is the number of queries you are going to run, esp. on the same data? For example, if 80% of your queries are around 20% of your data, then caching may help significantly. (See below the design section on how this can matter)
Are your databases distributed or collocated on the same server? I assume it's a single server system, but if they were distributed, the design and optimization might vary.
Are these queries executed in a background process or on-demand and a user is expecting to get the results quicker?
Without these (and perhaps some other follow up questions once answers to these are provided), it would be unwise to give an answer stating one being preferable over the other.
Having said that, based on my personal experience, your best bet specifically for SQL server is to use query analyzer, which is actually pretty reasonable, as your first stop. After that, you can do some performance analyses to find the optimal solution. Typically, these are done by modeling the query traffic as it would be when the system is under regular load. (FYI: The modeling link is to ASP.NET performance modeling, but various core concepts apply to SQL as well.) You typically put the system under load and then:
Look at how many connections are lost -- this can increase if the queries are expensive.
Performance counters on the server(s) to see how the system is dealing with the load.
Responses from the queries to see if some start failing to provide a valid response, although this is unlikely to happen
FYI: This is based on my personal experience, after having done various types of performance analyses for multiple projects. We expect to do it again for our current project, although this time around we're using AD and Azure tables instead of SQL, and hence the methodology is not specific to SQL server, although the tools, traffic profiles, and what to measure varies.
On design
Introducing event id in the account transaction line:
Although you do not explicitly state so, but it seems that the event ID and transaction ID is not going to change after the first entry has been made. If that's the case and you are only interested in getting the totals for a transaction in this query, then another option (which will optimize your queries) would be to add a foreign key to AccountEvent's primary key (which I think is the event id). In strictest DB sense, you are de-normalizing the table a bit, but in practice, it often helps with performance.
Computing totals on inserts:
The other approach that I have taken in a past project (just because I was using FoxPro in the previous century and FoxPro tended to be extremely slow at joins) was to keep total amounts in the primary table, equivalent of your transactions table. This would be quite useful if your reads heavily outweighed your writes, and in the case of SQL, you can issue a transaction to make entries in other tables and update totals simultaneously (hence my question about on your query profiles).
Join transaction & transfers tables:
Keep a value to indicate which is which, and keep the totals there -- similar to previous one but at a different level. This will decrease the joins on query, but still have sum of totals on inserts -- I would prefer the previous over this one.
De-normalize completely:
This is yet another approach that folks have used (esp. in NOSQL space), but it gives me shivers when applying in SQL Server, so I have a personal bias against it but you could very well search it and find about it.