Although I couldn't find anything on it, I thought I would double check - does memcache support transactions?
If not, which I'm betting is the likely answer, then what is the correct method to work with memcache in an environment with transactions? Wouldn't you have to read from the DB every time you plan to update, even if the data is in cache, just so you can set your locks? For example, a script that updates some data would look like this:
BEGIN; SELECT ... FOR UPDATE;
Calculate...
UPDATE TABLE ...;
Update cache
COMMIT;
I think you have to update cache after you run your update query, in case you hit a deadlock and need to rollback. But you should also update cache before committing, in case any other thread is waiting to read your data and may accidentally update it's cache with even newer data before you, resulting in your now-out-of-date data overwriting it.
Is this the correct sequence of steps? Is there any way to not have to hit the db on reads for update?
Memcache does have an operator called CAS (Check And Set - or Compare And Swap) which may help you. The PHP manual has some documentation on it, Memcached::cas, but it should also be supported in other libraries and languages.
Memcached does not support transactions in this sense, although its operations are atomic. You can use your database transactioning mechanism and update cache manually (as you specified), or use transactioned wrapper like this.
Related
I am trying to use Redis as a cache that sits in front of an SQL database. At a high level I want to implement these operations:
Read value from Redis, if it's not there then generate the value via querying SQL, and push it in to Redis so we don't have to compute that again.
Write value to Redis, because we just made some change to our SQL database and we know that we might have already cached it and it's now invalid.
Delete value, because we know the value in Redis is now stale, we suspect nobody will want it, but it's too much work to recompute now. We're OK letting the next client who does operation #1 compute it again.
My challenge is understanding how to implement #1 and #3, if I attempt to do it with StackExchange.Redis. If I naively implement #1 with a simple read of the key and push, it's entirely possible that between me computing the value from SQL and pushing it in that any number of other SQL operations may have happened and also tried to push their values into Redis via #2 or #3. For example, consider this ordering:
Client #1 wants to do operation #1 [Read] from above. It tries to read the key, sees it's not there.
Client #1 calls to SQL database to generate the value.
Client #2 does something to SQL and then does operation #2 [Write] above. It pushes some newly computed value into Redis.
Client #3 comes a long, does some other operation in SQL, and wants to do operation #3 [Delete] to Redis knowing that if there's something cached there, it's no longer valid.
Client #1 pushes its (now stale) value to Redis.
So how do I implement my operation #1? Redis offers a WATCH primitive that makes this fairly easy to do against the bare metal where I would be able to observe other things happened on the key from Client #1, but it's not supported by StackExchange.Redis because of how it multiplexes commands. It's conditional operations aren't quite sufficient here, since if I try saying "push only if key doesn't exist", that doesn't prevent the race as I explained above. Is there a pattern/best practice that is used here? This seems like a fairly common pattern that people would want to implement.
One idea I do have is I can use a separate key that gets incremented each time I do some operation on the main key and then can use StackExchange.Redis' conditional operations that way, but that seems kludgy.
It looks like question about right cache invalidation strategy rather then question about Redis. Why i think so - Redis WATCH/MULTI is kind of optimistic locking strategy and this kind of
locking not suitable for most of cases with cache where db read query can be a problem which solves with cache. In your operation #3 description you write:
It's too much work to recompute now. We're OK letting the next client who does operation #1 compute it again.
So we can continue with read update case as update strategy. Here is some more questions, before we continue:
That happens when 2 clients starts to perform operation #1? Both of them can do not find value in Redis and perform SQL query and next both of then write it to Redis. So we should have garanties that just one client would update cache?
How we can be shure in the right sequence of writes (operation 3)?
Why not optimistic locking
Optimistic concurrency control assumes that multiple transactions can frequently complete without interfering with each other. While running, transactions use data resources without acquiring locks on those resources. Before committing, each transaction verifies that no other transaction has modified the data it has read. If the check reveals conflicting modifications, the committing transaction rolls back and can be restarted.
You can read about OCC transactions phases in wikipedia but in few words:
If there is no conflict - you update your data. If there is a conflict, resolve it, typically by aborting the transaction and restart it if still need to update data.
Redis WATCH/MULTY is kind of optimistic locking so they can't help you - you do not know about your cache key was modified before try to work with them.
What works?
Each time your listen somebody told about locking - after some words you are listen about compromises, performance and consistency vs availability. The last pair is most important.
In most of high loaded system availability is winner. Thats this means for caching? Usualy such case:
Each cache key hold some metadata about value - state, version and life time. The last one is not Redis TTL - usually if your key should be in cache for X time, life time
in metadata has X + Y time, there Y is some time to garantie process update.
You never delete key directly - you need just update state or life time.
Each time your application read data from cache if should make decision - if data has state "valid" - use it. If data has state "invalid" try to update or use absolete data.
How to update on read(the quite important is this "hand made" mix of optimistic and pessisitic locking):
Try set pessimistic locking (in Redis with SETEX - read more here).
If failed - return absolete data (rememeber we still need availability).
If success perform SQL query and write in to cache.
Read version from Redis again and compare with version readed previously.
If version same - mark as state as "valid".
Release lock.
How to invalidate (your operations #2, #3):
Increment cache version and set state "invalid".
Update life time/ttl if need it.
Why so difficult
We always can get and return value from cache and rarely have situatiuon with cache miss. So we do not have cache invalidation cascade hell then many process try to update
one key.
We still have ordered key updates.
Just one process per time can update key.
I have queue!
Sorry, you have not said before - I would not write it all. If have queue all becomes more simple:
Each modification operation should push job to queue.
Only async worker should execute SQL and update key.
You still need use "state" (valid/invalid) for cache key to separete application logic with cache.
Is this is answer?
Actualy yes and no in same time. This one of possible solutions. Cache invalidation is much complex problem with many possible solutions - one of them
may be simple, other - complex. In most of cases depends on real bussines requirements of concrete applicaton.
I am looking for a (SQL/RDB) database setup that works something like this:
I will have 3+ databases in an active/active/active configuration
prior to doing any insert, the database will communicate with atleast a majority of the others, such that they all either insert at the same time or rollback (transaction)
this way I can write and read from any of the databases, and always get the same results (as long as the field wasn't updated very recently)
note: this is for a use case that will be very read-heavy and have few writes (and delay on the writes is an OK situation)
does anything like this exist? I see all sorts of solutions with database HA configurations, but most of them suggest writing to a primary node or having a passive backup
alternatively I could setup a custom application, and have each application talk to exactly 1 database, and achieve a similar result, but I was hoping something similar would already exist
So my questions is: does something like this exist? if not, are there any technical/architectural reasons why not?
P.S. - I will NOT be using a SAN where all databases can store/access the same data
edit: more clarifications as far as what I am looking for:
1. I have no database picked out yet, but I am more familiar with MySQL / SQL Server / Oracle, so I would have a minor inclination towards on of those
2. If a majority of the nodes are down (or a single node can't communicate with the collective), then I expect all writes from that node to fail, and accept that it may provide old/outdated information
failure / recover scenario expectations:
1. A node goes down: it will query and get updates from the other nodes when it comes back up
2. A node loses connection with the collective: it will provide potentially old data to read request, and refuse any writes
3. A node is in disagreement with the data stores in others: majority rule
4. 4. majority rule does not work: go with whomever has the latest data (although this really shouldn't happen)
5. The entries are insert/update/read only, i.e. there will be no deletes (except manually ofc), so I shouldn't need to worry about an update after a delete, however in that case I would choose to delete the record and ignore the update
6. Any other scenarios I missed?
update: I the closest I see to what I am looking for seems to be using a quorum + 2 DBs, however I would prefer if I could have 3 DBs instead, such that I can query any of them at any time (to further distribute the reads, and also to keep another copy of the data)
You need to define "very recently". In most environments with replication for inserts, all the databases will have the same data within a few seconds of an insert (and a few seconds seems pessimistic).
An alternative approach is a "read-from-one/write-to-all" approach. In this case, reads are spread through the system. Writes are then sent to all nodes by the application (or a common layer that the application uses).
Remember, though, that the issue with database replication is not how it works when it works. The issue is how it recovers when it fails and even how failures are identified. You need to decide what happens when nodes go down, how they recover lost transactions, how you decide that nodes are really synchronized. I would suggest that you peruse the documentation of the database that you are actually using and understand the replication mechanisms provided by that platform.
I have a farm of servers, each server is regularly making an identical query to the database.
These query results are cached in a shared cache, accessible to all servers.
How can I ensure that a newer query does not get overwritten by an older query?
Is there a way of versioning the queries somehow, by time for example, so that this
doesn't happen? How to deal with concurrent queries?
Thanks!
Edit: db is SQL Server. Query is a select statement. And, caching mechanism is very simple: simple write, with no locking. And that is because there is currently no way of telling the order of the select queries.
One approach is to have a global update counter in the database, either for updates or for reads (updates is more efficient, but also harder to get right).
So on each update, also increment the global counter. On each read, read the global counter, and put the data into the cache along with the counter value. Only overwrite the cache contents if the counter value is larger.
Because of database isolation, transactions should appear as if they happened in a serial manner (assuming you have chosen the SERIALIZABLE isolation level). That, in turn, will mean that strictly higher counter numbers relate to more recent data.
Where is the cache? Is it a disk file? Or a table in the database?
And what does it mean when you say an older query might overwrite a newer one? Isn't it true that the most-recently-completed query (or maybe the most-recently-started) is the newest one?
You should lock the cache container before performing each query, whether that's a file or a table. Each server should only perform the query if it can obtain the lock, otherwise it should wait for the locked resource. That way the cache will contain only the most recent results.
Is that along the lines of what you are asking?
Do you know of any ORM tool that offers deadlock recovery? I know deadlocks are a bad thing but sometimes any system will suffer from it given the right amount of load. In Sql Server, the deadlock message says "Rerun the transaction" so I would suspect that rerunning a deadlock statement is a desirable feature on ORM's.
I don't know of any special ORM tool support for automatically rerunning transactions that failed because of deadlocks. However I don't think that a ORM makes dealing with locking/deadlocking issues very different. Firstly, you should analyze the root cause for your deadlocks, then redesign your transactions and queries in a way that deadlocks are avoided or at least reduced. There are lots of options for improvement, like choosing the right isolation level for (parts) of your transactions, using lock hints etc. This depends much more on your database system then on your ORM. Of course it helps if your ORM allows you to use stored procedures for some fine-tuned command etc.
If this doesn't help to avoid deadlocks completely, or you don't have the time to implement and test the real fix now, of course you could simply place a try/catch around your save/commit/persist or whatever call, check catched exceptions if they indicate that the failed transaction is a "deadlock victim", and then simply recall save/commit/persist after a few seconds sleeping. Waiting a few seconds is a good idea since deadlocks are often an indication that there is a temporary peak of transactions competing for the same resources, and rerunning the same transaction quickly again and again would probably make things even worse.
For the same reason you probably would wont to make sure that you only try once to rerun the same transaction.
In a real world scenario we once implemented this kind of workaround, and about 80% of the "deadlock victims" succeeded on the second go. But I strongly recommend to digg deeper to fix the actual reason for the deadlocking, because these problems usually increase exponentially with the number of users. Hope that helps.
Deadlocks are to be expected, and SQL Server seems to be worse off in this front than other database servers. First, you should try to minimize your deadlocks. Try using the SQL Server Profiler to figure out why its happening and what you can do about it. Next, configure your ORM to not read after making an update in the same transaction, if possible. Finally, after you've done that, if you happen to use Spring and Hibernate together, you can put in an interceptor to watch for this situation. Extend MethodInterceptor and place it in your Spring bean under interceptorNames. When the interceptor is run, use invocation.proceed() to execute the transaction. Catch any exceptions, and define a number of times you want to retry.
An o/r mapper can't detect this, as the deadlock is always occuring inside the DBMS, which could be caused by locks set by other threads or other apps even.
To be sure a piece of code doesn't create a deadlock, always use these rules:
- do fetching outside the transaction. So first fetch, then perform processing then perform DML statements like insert, delete and update
- every action inside a method or series of methods which contain / work with a transaction have to use the same connection to the database. This is required because for example write locks are ignored by statements executed over the same connection (as that same connection set the locks ;)).
Often, deadlocks occur because either code fetches data inside a transaction which causes a NEW connection to be opened (which has to wait for locks) or uses different connections for the statements in a transaction.
I had a quick look (no doubt you have too) and couldn't find anything suggesting that hibernate at least offers this. This is probably because ORMs consider this outside of the scope of the problem they are trying to solve.
If you are having issues with deadlocks certainly follow some of the suggestions posted here to try and resolve them. After that you just need to make sure all your database access code gets wrapped with something which can detect a deadlock and retry the transaction.
One system I worked on was based on “commands” that were then committed to the database when the user pressed save, it worked like this:
While(true)
start a database transaction
Foreach command to process
read data the command need into objects
update the object by calling the command.run method
EndForeach
Save the objects to the database
If not deadlock
commit the database transaction
we are done
Else
abort the database transaction
log deadlock and try again
EndIf
EndWhile
You may be able to do something like with any ORM; we used an in house data access system, as ORM were too new at the time.
We run the commands outside of a transaction while the user was interacting with the system. Then rerun them as above (when you use did a "save") to cope with changes other people have made. As we already had a good ideal of the rows the command would change, we could even use locking hints or “select for update” to take out all the write locks we needed at the start of the transaction. (We shorted the set of rows to be updated to reduce the number of deadlocks even more)
Everyone has accidentally forgotten the WHERE clause on a DELETE query and blasted some un-backed up data once or twice. I was pondering that problem, and I was wondering if the solution I came up with is practical.
What if, in place of actual DELETE queries, the application and maintenance scripts did something like:
UPDATE foo SET to_be_deleted=1 WHERE blah = 50;
And then a cron job was set to go through and actually delete everything with the flag? The downside would be that pretty much every other query would need to have WHERE to_be_deleted != 1 appended to it, but the upside would be that you'd never mistakenly lose data again. You could see "2,349,325 rows affected" and say, "Hmm, looks like I forgot the WHERE clause," and reset the flags. You could even make the to_be_deleted field a DATE column, so the cron job would check to see if a row's time had come yet.
Also, you could remove DELETE permission from the production database user, so even if someone managed to inject some SQL into your site, they wouldn't be able to remove anything.
So, my question is: Is this a good idea, or are there pitfalls I'm not seeing?
That is fine if you want to do that, but it seems like a lot of work. How many people are manually changing the database? It should be very few, especially if your users have an app to work with.
When I work on the production db I put EVERYTHING I do in a transaction so if I mess up I can rollback. Just having a standard practice like that for me has helped me.
I don't see anything really wrong with that though other than ever single point of data manipulation in each applicaiton will have to be aware of this functionality and not just the data it wants.
This would be fine as long as your appliction does not require that the data is immediately deleted since you have to wait for the next interval of the cron job.
I think a better solution and the more common practice is to use a development server and a production server. If your development database gets blown out, simply reload it. No harm done. If you're testing code on your production database, you deserve anything bad that happens.
A lot of people have a delete flag or a row status flag. But if someone is doing a change through the back end (and they will be doing it since often people need batch changes done that can't be accomplished through the front end) and they make a mistake they will still often go for delete. Ultimately this is no substitute for testing the script before applying it to a production environment.
Also...what happens if the following query gets executed "UPDATE foo SET to_be_deleted=1" because they left off the where clause. Unless you have auditing columns with a time stamp how do you know which columns were deleted and which ones were done in error? But even if you have auditing columns with a time stamp, if the auditing is done via a stored procedure or programmer convention then these back end queries may not supply information letting you know that they were just applied.
Too complicated. The standard approach to this is to do all your work inside a transaction, so if you screw up and forget a WHERE clause, then you simply roll back when you see the "2,349,325 rows affected" result.
It may be easier to create a parallel table for deleted rows. A DELETE trigger (and UPDATE too if you want to undo changes as well) on the original table could copy the affected rows to the parallel table. Adding a datetime column to the parallel table to record the date & time of the change would let you permanently remove rows past a certain age using your cron job.
That way, you'd use normal DELETE statements on the original table, so there's no chance you'll forget to run your special "DELETE" statement. You also sidestep the to_be_deleted != 1 expression, which is just a bug waiting to happen when someone inevitably forgets.
It looks like you're describing three cases here.
Case 1 - maintenance scripts. Risk can be minimized by developing them and testing them in an environment other than your production box. For quick maintenance, do the maintenance in a single transaction, and check everything before committing. If you made a mistake, issue the rollback command. For more serious maintenance that you can't necessarily wait around for, or do in a single transaction, consider taking a backup directly before running the maintenance job, so that you can always restore back to the point before you ran your script if you encounter serious problems.
Case 2 - SQL Injection. This is an architecture issue. Your application shouldn't pass SQL into the database, access should be controlled through packages / stored procedures / functions, and values that are going to come from the UI and be used in a DDL statement should be applied using bind variables, rather than by creating dynamic SQL by appending strings together.
Case 3 - Regular batch jobs. These should have been tested before being deployed to production. If you delete too much, you have a bug, and are going to have to rely on your backup strategy.
Everyone has accidentally forgotten
the WHERE clause on a DELETE query and
blasted some un-backed up data once or
twice.
No. I always prototype my DELETEs as SELECTs and only if the latter gives the results I want to delete change the statement before WHERE to a DELETE. This let's me inspect in any needed detail the rows I want to affect before doing anything.
You could set up a view on that table that selects WHERE to_be_deleted != 1, and all of your normal selects are done on that view - that avoids having to put the WHERE on all of your queries.
The pitfall is that it's unnecessarily complicated and someone will inadvertently forget too check the flag in their query. There's also the issue of potentially needing to delete something immediately instead of wait for the scheduled job to run.
To avoid the to_be_deleted WHERE clause you could create a trigger before the delete command fires off to insert the deleted rows into a separate table. This table could be cleared out when you're sure everything in it really needs to be deleted, or you could keep it around for archive purposes.
You also get a "soft delete" feature so you can give the(certain) end-users the power of "undo" - there would have to be a pretty strong downside in the mix to cancel the benefits of soft deleting.
The "WHERE to_be_deleted <> 1" on every other query is a huge one. Another is once you've ran your accidentally rogue query, how will you determine which of the 2,349,325 were previously marked as deleted?
I think the practical solution is regular backups, and failing that, perhaps a delete trigger that captures the tuples to be axed.
The other option would be to create a delete trigger on each table. When anything is deleted, it would insert that "to be deleted" record into another table, ideally named TABLENAME_deleted.
The downside would be that the db would have twice as many tables.
I don't recommend triggers in general, but it might be what you are looking for.
This is why, whenever you are editing data by hand, you should BEGIN TRAN, edit your data, check that it looks good (for instance that you didn't delete more data than you were expecting) and then END TRAN. If you're using Postgres then you want to create lots of savepoints as well so that a typo doesn't wipe out your intermediate work.
But that said, in many applications it does make sense to have software mark records as invalid rather than deleting them. Add a last_modified date that is automatically updated, and you are all prepared to set up incremental updates into a data warehouse. Even if you don't have a data warehouse now, it never hurts to prepare for the future when preparing is cheap. Plus in the event of manual mistakes you still have the data, and can just find all of the records that got "deleted" when you made your mistake and fix them. (You should still use transactions though.)