how to deal with race conditions among jobs with e.g. beanstalkd - race-condition

I am wanting to set up a job queue with multiple workers. Right now I am looking at beanstalkd, but this is more of a conceptual problem, I believe: How can you ensure that jobs related to a single entity get handled in order?
Let's say the workers manage an email platform for some UI. For a given mailbox, jobs need to be performed serially. For example, sometimes a user will want to re-push their password into the mail platform while troubleshooting. So, they change their password, then change it back right away. That's two password-change jobs submitted to beanstalkd.
Now, most of the time this will go fine, as beanstalkd will hand those jobs out to workers in order. However, some transient error like a DNS lookup delay could cause the second password change (back to the proper one) to go through before the first, leaving the mailbox with an incorrect password.
I have thought about introducing semophores/mutexes, and having a 1:1 worker-machine:beanstalkd-server ratio, but even that would only work of the locks requests are granted in the order requested, which doesn't seem fully reliable. Having a queue per entity opens some other options, but this needs to support hundreds of thousands of entities.
Judging by how little discussion around this topic I've found, this must not be as common of a scenario as I initially thought. Does anyone have experience dealing with this problem?

A couple of potential methods come to mind.
As you point out, unless you are changing priorities, Beanstalkd is a FIFO queue. This means that, if only one worker is dealing with changing the password, it would handle the jobs in order.
If there are multiple workers, then you could store meta-data alongside the password - a last modified time (more exactly, when the password change request was made). That time would be set from the job, but if the time that is already in the database (alongside the password) is ever newer than the latest request - the new request would be dropped as out of date.
Depending on the user data storage, you may need additional locking around the database (with an SQL database, this is quite easy, but a file-based store would need additional locking to avoid potential file corruption).

Related

Fix inconsistent state right away or lazily when data is requested

Our users go through several steps of workflow - the further they go the more objects we create. We also allow users to go back to Step#1 and change one of the existing objects. Which may cause inconsistencies so we must update/delete some of the objects at Step#2. I see 2 options:
Update/delete objects from Step#2 right away. This leads to:
Operation that's supposed to be a simple PATCH of an entity field becomes complicated. And it's a shared object between multiple workflows - so we'll have to add if-statements and do different things depending on the workflow.
Circular dependencies. Operations on Step#1 have to know about objects/operations on Step#2.
On each request in Step#1 we'd have to load data for Step#2 in order to determine whether Step#2 really needs to be updated. Which slows down operations on Step#1. So to change 1 record in DB we'll have to load hundreds (or even thousands) records for Step#2.
Many actions on Step#1 may need fixing state at Step#2. So we have to ensure we don't forget anything today and in the future.
Fix Step#2 lazily - when user goes there (our current approach). Step#2 will recognize that objects are inconsistent and fix them. Which leads to just 1 place where we need to care, but:
Until user opens Step#2 - DB will contain inconsistent objects. This hasn't resulted in any problems so far. But I can imagine it may complicate future SQL migrations.
We update DB state on GET request. This one doesn't seem like that big of a deal since GET stays idempotent anyway. But still it feels awkward.
Anyone knows better approaches? Or maybe improvements to these two?
Update
I haven't found perfect solution, but eventually we implemented an improved version of #1. When updating state on Step#1 we also set a flag "need to rebuild Step#2", when UI opens Step#2 it first checks this flag and issues a PUT to rebuild the state, and only then it GETs Step#2.
This still means that DB state is inconsistent for some period of time. But at least we'll know this for sure from the flag in DB. And if needed - we could write migrations taking this flag into account. This also allows (if needed in the future) to create an async job to fix the state.
I think it is more flexible to separate the state and the context where the objects are stored. Any creation of a new object at any step is accompanied by the preservation of the invariant and consistency of context.
There are separate rules of states - these are rules for transition from one to another and available objects for creation and separate rules for the context, rules for its consistency, which is ensured every time it changes.
What about dirty data asynchronous cleanup?
Whenever user goes back to Step #1 and changes something, mark all related data as "dirty" (e.g. add links to it in "DirtyData" table) and be done for now.
Have a DataCleanup worker (e.g. separate thread or smth) that constantly looks for data to be cleaned up.
Before editing data for Step #2, check if the data is not dirty.
Depending on your logic, 3) might result in user error (e.g. user would need to repeat Step #2). If DataCleanup worker has enough resources (i.e. it processes DirtyData table almost instantaneously), that should happen only on very rare occasions. If that is not OK, you could opt for checking for dirty data on each fetch, but that could be expensive.
It sounds like you're familiar with the HTTP spec regarding GET requests, but for future readers:
Why shouldn't a GET request change data on the server?
Why is using a HTTP GET to update state on the server in a RESTful call incorrect?
For the other bullet under 2, we probably don't need a specification to agree that persisting valid data is preferable to persisting invalid data.
So what can we do for the bullets under 1 to avoid complex branching logic in a particular step and also circular dependencies? My suggestion is an event-driven design. When step #2 changes it should fire a change event. In this scenario, step #2 has no knowledge of the concrete listener(s) who may receive its events, so it remains decoupled from any complex handling logic.
There's probably no way to guarantee you don't forget anything in the future; but if every step in the workflow is defined as a listener, it forces you to consider change events to some extent every time you implement a new step.
One side note on granularity: if a step has many changes, it can batch up its events rather than fire each one individually. You can adjust the size for efficiency.
In summary, I would strongly consider the Observer design pattern.

Keeping multi-user state across DB sessions

The situation
Suppose we have a web application connected to a (Postgre)SQL database whose task can be summarized as:
A SELECT operation to visualize the data.
An UPDATE operation that stores modifications based on the visualized data.
Simple, but... the data involved isn't user specific, so it might potentially be changed during the process by other users. The editing task may take long time (perhaps more than an hour), meaning that the probability of these collisions happening isn't low: it makes sense to implement a robust solution to the problem.
The approach
The idea would be that, once the user tries to submit the changes (i.e. firing the UPDATE operation), a number of database checks will be triggered to ensure that the involved data didn't change in the meantime.
Assuming we have timestamped every change on the data, it would be as easy as keeping the access time when the data was SELECTed and ensuring that no new changes were done after that time on the involved data.
The problem
We could easily just keep that access time in the frontend application while the user performs the editing, and later provide it as an argument to the trigger function when performing the UPDATE, but that's not desirable for security reasons. The database should store the user's access time.
An intuitive solution could be a TEMPORARY TABLE associated to the database session. But, again, the user might take a long time doing the task, so capturing a connection from the pool and keeping it idle for such a long time doesn't seem like a good option either. The SELECT and the UPDATE operations will be performed under different sessions.
The question
Is there any paradigm or canonical way to address and solve this problem efficiently?
This problem is known as the "lost update" problem.
There are several solutions that depend on whether a connection pool is used or not and on the transaction isolation level used:
pessimistic locking with SELECT ... FOR UPDATE without connection pool
optimistic locking with timestamp column if connection pool is used.

How can I deal with the webserver UI of one machine being out of sync with backend/API of another?

The system my company sells is software for a multi-machine solution. In some cases, there is a UI on one of the machines and a backend/API on another. These systems communicate and both use their own clocks for various operations and storage values.
When the UI's system clock gets ahead of the backend by 30 seconds or more, the queries start to misbehave due to the UI's timestamp being sent over as key information to the REST request. There is a "what has been updated by me" query that happens every 30 seconds and the desync will cause the updated data to be missed since they are outside the timing window.
Since I do not have any control over the systems that my software is installed on, I need a solution on my code's side. I can't force customers to keep their clocks in sync.
Possible solutions I have considered:
The UI can query the backend for it's system time and cache that.
The backend/API can reach back further in time when looking for updates. This will give the clocks some room to slip around, but will cause a much heavier query load on systems with large sets of data.
Any ideas?
Your best bet is to restructure your API somewhat.
First, even though NTP is a good idea, you can't actually guarantee it's in use. Additionally, even when it is enabled, OSs (Windows at least) may reject packets that are too far out of sync, to prevent certain attacks (on the order of minutes, though).
When dealing with distributed services like this, the mantra is "do not trust the client". This applies even when you actually control the client, too, and doesn't necessarily mean the client is attempting anything malicious - it just means that the client isn't the authoritative source.
This should include timestamps.
Consider; the timestamps are a problem here because you're trying to use the client's time to query the server - except, we shouldn't trust the client. Instead, what we should do is have the server return a timestamp of when the request was processed, or the update stamp for the latest entry of the database, that can be used in subsequent queries to retrieve new updates (how far back you go on initial query is up to you).
Dealing with concurrent updates safely is a little harder, and depends on what is supposed to happen on collision. There's nothing really different here from most of the questions and answers dealing with database-centric versions of the problem, I'm just mentioning it to note you may need to add extra fields to your API to correctly handle or detect the situation, if you haven't already.

Multiple application on network with same SQL database

I will have multiple computers on the same network with the same C# application running, connecting to a SQL database.
I am wondering if I need to use the service broker to ensure that if I update record A in table B on Machine 1, the change is pushed to Machine 2. I have seen applications that need to use messaging servers to accomplish this before but I was wondering why this is necessary, surely if they connect to the same database, any changes from one machine will be reflected on the other?
Thanks :)
This is mostly about consistency and latency.
If your applications always perform atomic operations on the database, and they always read whatever they need with no caching, everything will be consistent.
In practice, this is seldom the case. There's plenty of hidden opportunities for caching, like when you have an edit form - it has the values the entity had before you started the edit process, but what if someone modified those in the mean time? You'd just rewrite their changes with your data.
Solving this is a bunch of architectural decisions. Different scenarios require different approaches.
Once data is committed in the database, everyone reading it will see the same thing - but only if they actually get around to reading it, and the two reads aren't separated by another commit.
Update notifications are mostly concerned with invalidating caches, and perhaps some push-style processing (e.g. IM client might show you a popup saying you got a new message). However, SQL Server notifications are not reliable - there is no guarantee that you'll get the notification, and even less so that you'll get it in time. This means that to ensure consistency, you must not depend on the cached data, and you have to force an invalidation once in a while anyway, even if you didn't get a change notification.
Remember, even if you're actually using a database that's close enough to ACID, it's usually not the default setting (for performance and availability, mostly). You need to understand what kind of guarantees you're getting, and how to write code to handle this. Even the most perfect ACID database isn't going to help your consistency if your application introduces those inconsistencies :)

Transaction-Style HTTP requests

I recently ran into such problem:
For each user, I need to do the following on server side:
First
(SQL) Insert user's record with a Unique constraint on ID
Then Parallel
(Http) Subscribe user to Service A, get subscription_id_A
(Http) Subscribe user to Service B, get subscription_id_B
Finally
(SQL) Update user's record with both subscription ids
Ideally I want this entire operation to be transactional, eg if any of http requests or sql fails, it would be as if nothing happened. Added: if Request A fails but B succeeds, I would be stuck: Do I cancel the transaction and end up with an untracked subscription or do I commit it and end up with user missing a subcription
Given that this is likely impossible to achieve, what would be the next best thing I can do?
The service A and B does provide APIs to check for existence of subscriptions and to modify, delete a subscription too, but I want to avoid the Check Then Act style. The SQL server has highest isolation level
This is indeed a standard problem. (Often, developers are not aware of this problem and only find out in production.) There is no standard solution. It is impossible to solve in general (see the http://en.wikipedia.org/wiki/Two_Generals%27_Problem - two systems can never agree with 100% certainty on whether they should commit or abort).
Maybe you can perform all the SQL work first. Insert the user but without subscription IDs. You then try to add the subscriptions one by one and add their IDs in separate transactions once you got them.
Install a background job that periodically checks for users that have been created a long time ago but that do not have subscriptions yet. If you find any discrepancies fix them and log this fact.
This periodic cleanup ensures that temporary failures (which will occur due to network glitches, timeouts, redeployments, bugs, ...) are temporary. It also ensures that they are being detected and reported to developers if you like.
This would be an eventually consistent system. The idea is to first transactionally record the target state (the user and the goal to create two subscriptions) and then have a background job try to converge the data to the target state.