I have a farm of servers, each server is regularly making an identical query to the database.
These query results are cached in a shared cache, accessible to all servers.
How can I ensure that a newer query does not get overwritten by an older query?
Is there a way of versioning the queries somehow, by time for example, so that this
doesn't happen? How to deal with concurrent queries?
Thanks!
Edit: db is SQL Server. Query is a select statement. And, caching mechanism is very simple: simple write, with no locking. And that is because there is currently no way of telling the order of the select queries.
One approach is to have a global update counter in the database, either for updates or for reads (updates is more efficient, but also harder to get right).
So on each update, also increment the global counter. On each read, read the global counter, and put the data into the cache along with the counter value. Only overwrite the cache contents if the counter value is larger.
Because of database isolation, transactions should appear as if they happened in a serial manner (assuming you have chosen the SERIALIZABLE isolation level). That, in turn, will mean that strictly higher counter numbers relate to more recent data.
Where is the cache? Is it a disk file? Or a table in the database?
And what does it mean when you say an older query might overwrite a newer one? Isn't it true that the most-recently-completed query (or maybe the most-recently-started) is the newest one?
You should lock the cache container before performing each query, whether that's a file or a table. Each server should only perform the query if it can obtain the lock, otherwise it should wait for the locked resource. That way the cache will contain only the most recent results.
Is that along the lines of what you are asking?
Related
Let's say I want to use a built-in solution such as Redis or Memcached to cache database rows (as an example), to avoid recurrent costly trips to the database.
For the sake of the argument, let's assume I have a TABLE(id, x, y) and that I want to cache all rows so I never have to read directly from the database.
Questions:
Consider the following case: NodeA tries to update a given row's field x while NodeB tries to update y, then both simultaneously try to update the cache line. If they try to "manually" update the field they just changed to the row in the cache, if we follow the typical last-write-wins, one of the fields is going to be discarded, which is catastrophic. This makes me think I need to always fill the cache's rows with a full row read from the database.
But this by itself won't necessarily help me. If NodeA writes to x and loads the entire row in memory and then NodeB writes to y and reads the entire row in memory, if NodeB writes to the cache before NodeA then NodeB's changes will be overwritten! This makes me believe I need to always somehow version the rows both in the DB and in the cache. Is this the case? Memcached seems to have a compare and set primitive, but I see no such thing in Redis.
Even if 1. and 2. are not an issue, I still need to guarantee that my write / read has read-after-write consistency, otherwise it may happen that what I'm reading and intending to put in the cache is not necessarily the most up-to-date version. If that's the case, how can I make sure of this? By requiring w + r > n?
This seems to be a very common use-case, I'd guess it's pretty much a solved problem. What can I try to resolve this?
Key value stores as redis support advance data structures, such as HASHs.
If you're doing partial updates to cached entities (only a set of fields is updated as part of the super set), and given your goal is to avoid time-consuming database reads, simply save the table entry as a HASH K/V pairs (using HSET) and the use HGETALL for reading.
Redis OPS are atomic by nature, so that should solve your problems, if I got them right.
On a side note, if you're caching an entire entity yet doing partial updates, you should consider a simpler caching approach, such as read-through (making cache validity a reader-only concern).
As opposed to Database accesses. Redis cache access from different location unless somehow serialized, will always have the potential of being out of order when it comes to distributed systems, as there's always the execution environment (network, threading) to introduce possible delays.
Doing read-through caching will ensure data is always updated after the most recent write without the need to synchronize anything else.
This is how Facebook solved the issue with Memcached: http://nil.csail.mit.edu/6.824/2020/papers/memcache-faq.txt.
The idea is to use the concept of a lease: when a request for a cached value is received and there is no data for such key, a lease token (64 bits id) is returned.
When the webserver fetches the data from the database it can then store the data in the cache with that token. Every time an invalidation request is invoked on a key, a new lease token is created, and as such, if a put is attempted for an old token, the put ends up rejected.
As far as I understand, it's not really possible to (easily) replicate this behavior with Redis without resorting to LUA scripts.
We have pretty big table with hundreds of millions of rows. It takes about 5-15 minutes to run removal of rows for a specific foreign key value. For example removing 8 million rows takes 15 minutes.
The questions is that does the removal of the rows actually even free up space as the database has transaction logging on? Can I remove rows with by-passing transaction logging for that operation?
In simple terms, you can't get around the transaction logging. That's just how the database ensures consistency - if the transaction fails halfway through (or the server's power fails, for example), the database engine needs to know how to get into a consistent state again. Also, appending the things to be changed into the transaction log is much faster than actually performing a change on the data files of the DB, especially in cases like yours.
There's a few special cases where it's safe to get around those things - truncate table will remove all the rows at once, and only if the table has no foreign keys, which makes it rather trivial. You can't limit it in any way, though.
The newly free space will be reclaimed as part of the database maintenance cycle. During each database backup, the database is synchronized to have all the data written in the data files, and the transaction log is backed up and emptied in the DB itself (I'm oversimplifying, since there's a lot of possible configurations - in any case, this is something your DBA should care about).
If this is posing a problem to you, the solution wouldn't be to get around the transaction logging anyway. You probably want to ask why (and how often) you need to delete millions of rows at a time.
I am working on a multithreading .NET 4 application which acquires data continuously and writes them into a SQL database (MySQL or SQL Server - not yet sure).
Everytime when a INSERT is executed, at leat one prior SELECT is necessary in order so synchronize with the databaes. This means the applications gets a block which contains new and old data and then has to check which data sets are new and which are already in the database.
This means a lot of SELECTS which result everytime in more or less the same data.
Would it be a good idea to have a copy of the last x entries per table within the application?
This way the synchronization could be done on the copy instead of the database.
Pro:
Faster
Contra:
Uses a lot of memory
Risk of becomming unsynchronized with the database
What do you think? What is the best practice for such a use case?
Any other pros and cons?
Unless you have an external program writing to your database at the same time, you can use buffering.
But instead of buffering SELECT results, just add to the insert method a buffer of the last X (a reasonable number) insertion requests, and only insert the new one if it isn't on that list.
You might also want to lock the insertion method, to make sure the inclusion check is always correct.
If you have multiple processes writing to the database, it is non-trivial to maintain perfect synchronization between in-memory data and the database. In fact the only way to confirm you are synchronized is by making a SELECT query on database. So you have a trade-off between perfect synchronization and synchronization with some tolerance which is very efficient.
My suggestions, which may help in both cases, would be:
Tune your SELECT queries. Add indexes if necessary.
Create meta-data, like version numbers. So that you have to only check something very trivial to determine if you need synchronization.
Write a stored proc which implements your SELECT and INSERT logic. Then you do not have to worry about making multiple calls to the database.
What is the best way to paginate a FTS Query ? LIMIT and OFFSET spring to mind. However, I am concerned that by using limit and offset I'd be running the same query over and over (i.e., once for page 1, another time for page 2.... etc).
Will PostgreSQL be smart enough to transparently cache the query result ? Thus subsequently satisfying the pagination queries from a cache ? If not, how do I paginate efficiently ?
edit
The database is for single user desktop analytics. But, I still want to know what the best way is, if this were a live OLTP application. I have addressed the problem in the past with SQL Server by creating a ordered set of document id's and cache the query parameters against the IDs in a seperate table. Clearing the cache every few hours (so as to allow new documents to enter the result set).
Perhaps this approach is viable for postgres. But still I wanna know the mechanics present in the database and how best to leverage them. If I were a DB developer I'd enable the query-response cache to work with the FTS system.
A server-side SQL cursor can be effectively used for this if a client session can be tied to a specific db connection that stays open during the entire session. This is because cursors cannot be shared between different connections. But if it's a desktop app with a unique connection per running instance, that's fine.
The doc for DECLARE CURSOR explains how the resultset is going to be materialized when the cursor is declared WITH HOLD in a committed transaction.
Locking shouldn't be a concern at all. Should the data be modified while the cursor is already materialized, it wouldn't affect the reader nor block the writer.
Other than that, there is no implicit query cache in PostgreSQL. The LIMIT/OFFSET technique implies a new execution of the query for each page, which may be as slow as the initial query depending on the complexity of the execution plan and the effectiveness of the buffer cache and disk cache.
Well, to be honest, what you may want is for your query to return a live Cursor, that you can then reuse to fetch certain portions of the results that it (the Cursor) represents. Now, I don't know if PostGre supports this, Mongo DB does, and I've tried going down that road but it's not cool. For example: do you know how much time it will pass between when a query is done and a second page of results from that query are demanded? Can the cursor stay on for that amount if time? And if it can, what does it mean exactly, will it block resources, such that if you have many lazy users, who start queries but take a long time to navigate through pages, your server might be bogged down by locked cursors?
Honestly, I think redoing a paginated query each time someone asks for a certain page is ok. First of all, you'll be returning a small number of entries (no need to display more than 10-20 entries at a time) and that's gonna be pretty fast, and second, you should more likely tune up your server so that it executes frequent request fast (add indexes, put it behind a Solr server if necessary, etc.) rather than have those queries run slow, but caching them.
Finally, if you really want to speed up full text searches, and have fancy indexes like case insensitive, prefix and suffix enabled, etc, you should take a look at Lucene or better yet Solr (which is Lucene on steroids) as an in-between search and indexing solution between your users and your persistence tier.
In an app, Users and Cases have a many-to-many relationship. Users pull their list of Cases often, Users can update a single case at a time (a 1-10 second operation, requiring more than one UPDATE). Under READCOMMITTED, any in-use Case would block all associated Users from pulling their list of Cases. Also, the most recent data is a hotspot for both reads and writes to the Cases table.
I think I want to employ dirty reads to keep the experience snappy. READPAST on Cases won't work for this purpose. NOLOCK will work, but I'd like to be able to show which records are dirty when they are listed.
I don't know of any native way to show which records are dirty, so I'm thinking that for each update or insert to Cases, an INUSE flag will be set. This flag must be cleared by the end of the updating transaction such that under READCOMMITTED, this flag will never appear to be set. Note that this is NOT to replace concurrency management, only to show which records are potentially dirty to the User.
My question is whether this is reliable - if we UPDATE two or more fields (INUSE plus the other fields) in a single statement, is it possible that a concurrent NOLOCK query would read some of the new values but not others? If so, is it possible to guarantee that INUSE be set first?
And if I'm thinking about this all wrong, please enlighten me. My ideal situation would be to, in a manageable way, be able to show the values as they were PRIOR to any related transaction so the data is immediately available and always consistent (but partially out-dated). But I don't think this is available - especially in the more complex actual database.
Thanks!
Restating the problem just to be sure: User A on connection A updates two columns (col1, col2) in MyTable. While this is going on, user B on connection B issues a dirty read, selecting data from that row. You are wondering if user B could get, say, the updated value in col1 AND the old/not updated value in col2. Correct?
I have to say: no way could this happen. As I understand it, updates are indeed an atomic transaction, and if you're writing data to the page (in memory), then the entire row update would have to finish on that set of bytes before anything else (another thread) could get access to them.
But I don't know for sure, and I can't imagine how to set up a test to confirm or deny this. The only answer I'd rely on would have to come from someone who actually had a hand in writing the code, or perhaps a Microsoft technician who has similar access. If you don't get any good answers here, posting the question on the appropriate MSDN forum (link) might get a good answer.
Have you considered using SNAPSHOT isolation level? When used for a query, it requires no locks whatsoever, and it gives precisely the semantics that you're asking for:
show the values as they were PRIOR to any related transaction so the data is immediately available and always consistent (but partially out-dated)