Is there an extra cost to cache misses on Redis - redis

Is there an advantage to set a default value for an entry that will be heavily queried in Redis or will querying for the unset key take the same time?
Given the keys are stored in a distributed hash, it will have to check that the key is not in the bucket before returning on a miss, which may be a bit slower than finding and stopping at a hit. Is the bucket sorted of linear? Does anything else make it slower either way?
Redis is setup in a cluster and has many million entries in this case.

I'm assuming you're just talking about strings & hashes here here (so the only operations you care about are set/get, maybe hget/hset) - From Redis' perspective, a cache hit and cache miss have the same time complexity, if anything, a cache miss will be faster because redis will not have to transfer any data back over the socket to your app.

Related

Making sure (distributed) caches only store the latest value in a distributed system

Let's say I want to use a built-in solution such as Redis or Memcached to cache database rows (as an example), to avoid recurrent costly trips to the database.
For the sake of the argument, let's assume I have a TABLE(id, x, y) and that I want to cache all rows so I never have to read directly from the database.
Questions:
Consider the following case: NodeA tries to update a given row's field x while NodeB tries to update y, then both simultaneously try to update the cache line. If they try to "manually" update the field they just changed to the row in the cache, if we follow the typical last-write-wins, one of the fields is going to be discarded, which is catastrophic. This makes me think I need to always fill the cache's rows with a full row read from the database.
But this by itself won't necessarily help me. If NodeA writes to x and loads the entire row in memory and then NodeB writes to y and reads the entire row in memory, if NodeB writes to the cache before NodeA then NodeB's changes will be overwritten! This makes me believe I need to always somehow version the rows both in the DB and in the cache. Is this the case? Memcached seems to have a compare and set primitive, but I see no such thing in Redis.
Even if 1. and 2. are not an issue, I still need to guarantee that my write / read has read-after-write consistency, otherwise it may happen that what I'm reading and intending to put in the cache is not necessarily the most up-to-date version. If that's the case, how can I make sure of this? By requiring w + r > n?
This seems to be a very common use-case, I'd guess it's pretty much a solved problem. What can I try to resolve this?
Key value stores as redis support advance data structures, such as HASHs.
If you're doing partial updates to cached entities (only a set of fields is updated as part of the super set), and given your goal is to avoid time-consuming database reads, simply save the table entry as a HASH K/V pairs (using HSET) and the use HGETALL for reading.
Redis OPS are atomic by nature, so that should solve your problems, if I got them right.
On a side note, if you're caching an entire entity yet doing partial updates, you should consider a simpler caching approach, such as read-through (making cache validity a reader-only concern).
As opposed to Database accesses. Redis cache access from different location unless somehow serialized, will always have the potential of being out of order when it comes to distributed systems, as there's always the execution environment (network, threading) to introduce possible delays.
Doing read-through caching will ensure data is always updated after the most recent write without the need to synchronize anything else.
This is how Facebook solved the issue with Memcached: http://nil.csail.mit.edu/6.824/2020/papers/memcache-faq.txt.
The idea is to use the concept of a lease: when a request for a cached value is received and there is no data for such key, a lease token (64 bits id) is returned.
When the webserver fetches the data from the database it can then store the data in the cache with that token. Every time an invalidation request is invoked on a key, a new lease token is created, and as such, if a put is attempted for an old token, the put ends up rejected.
As far as I understand, it's not really possible to (easily) replicate this behavior with Redis without resorting to LUA scripts.

Redis: Dump db and delete dumped key / value pairs

I have multiple servers that all store set members in a shared Redis cache. When the cache fills up, I need to persist the data to disk to free up RAM. I then plan to parse the dumped data such that I will be able to combine all of the values that belong to a given key in MongoDB.
My first plan was to have each server process attempt an sadd operation. If the request fails because Redis has reached maxmemory, I planned to query for each of my set keys, and write each to disk.
However, I am wondering if there is a way to use one of the inbuilt persistence methods in Redis to write the Redis data to disk and delete the key/value pairs after writing. If this is possible I could just parse the rdb dump and work with the data in that fashion. I'd be grateful for any help others can offer on this question.
Redis' persistence is meant to be used for whatever's in the RAM. Put differently, you can't persist what ain't in RAM.
To answer your question: no, you can't use persistence to "offload" data from RAM.

Redis Persistence Partial

I have multiple keys in redis most of which are insignificant and can be lost in case my redis server goes down.
However I have one or two keys, which I cannot afford to lose.
Hence I would like that whenever the server restarts, redis reads only these few keys from its persistent storage, and keep on persisting these as and when they change.
Does redis have this feature? If yes what command makes a key persisted to file and how to differ b/w persisted and unpersisted keys.
If no, What approach can I use such that I need not make my own persistent file before writing to Redis
Limitations(If the answer is no)
I do not want to change client code that enters in redis.
I cannot add more servers to redis(if any such solution exists, would like to know about it though).
EDIT
Another reason I would not want to save most keys as persistence because it is huge data, hundreds of records per second- Most of which expires in 10 minutes.

Does Redis persist data?

I understand that Redis serves all data from memory, but does it persist as well across server reboot so that when the server reboots it reads into memory all the data from disk. Or is it always a blank store which is only to store data while apps are running with no persistence?
I suggest you read about this on http://redis.io/topics/persistence . Basically you lose the guaranteed persistence when you increase performance by using only in-memory storing. Imagine a scenario where you INSERT into memory, but before it gets persisted to disk lose power. There will be data loss.
Redis supports so-called "snapshots". This means that it will do a complete copy of whats in memory at some points in time (e.g. every full hour). When you lose power between two snapshots, you will lose the data from the time between the last snapshot and the crash (doesn't have to be a power outage..). Redis trades data safety versus performance, like most NoSQL-DBs do.
Most NoSQL-databases follow a concept of replication among multiple nodes to minimize this risk. Redis is considered more a speedy cache instead of a database that guarantees data consistency. Therefore its use cases typically differ from those of real databases:
You can, for example, store sessions, performance counters or whatever in it with unmatched performance and no real loss in case of a crash. But processing orders/purchase histories and so on is considered a job for traditional databases.
Redis server saves all its data to HDD from time to time, thus providing some level of persistence.
It saves data in one of the following cases:
automatically from time to time
when you manually call BGSAVE command
when redis is shutting down
But data in redis is not really persistent, because:
crash of redis process means losing all changes since last save
BGSAVE operation can only be performed if you have enough free RAM (the amount of extra RAM is equal to the size of redis DB)
N.B.: BGSAVE RAM requirement is a real problem, because redis continues to work up until there is no more RAM to run in, but it stops saving data to HDD much earlier (at approx. 50% of RAM).
For more information see Redis Persistence.
It is a matter of configuration. You can have none, partial or full persistence of your data on Redis. The best decision will be driven by the project's technical and business needs.
According to the Redis documentation about persistence you can set up your instance to save data into disk from time to time or on each query, in a nutshell. They provide two strategies/methods AOF and RDB (read the documentation to see details about then), you can use each one alone or together.
If you want a "SQL like persistence", they have said:
The general indication is that you should use both persistence methods if you want a degree of data safety comparable to what PostgreSQL can provide you.
The answer is generally yes, however a fuller answer really depends on what type of data you're trying to store. In general, the more complete short answer is:
Redis isn't the best fit for persistent storage as it's mainly performance focused
Redis is really more suitable for reliable in-memory storage/cacheing of current state data, particularly for allowing scalability by providing a central source for data used across multiple clients/servers
Having said this, by default Redis will persist data snapshots at a periodic interval (apparently this is every 1 minute, but I haven't verified this - this is described by the article below, which is a good basic intro):
http://qnimate.com/redis-permanent-storage/
TL;DR
From the official docs:
RDB persistence [the default] performs point-in-time snapshots of your dataset at specified intervals.
AOF persistence [needs to be explicitly configured] logs every write operation received by the server, that will be played again at server startup, reconstructing the
original dataset.
Redis must be explicitly configured for AOF persistence, if this is required, and this will result in a performance penalty as well as growing logs. It may suffice for relatively reliable persistence of a limited amount of data flow.
You can choose no persistence at all.Better performance but all the data lose when Redis shutting down.
Redis has two persistence mechanisms: RDB and AOF.RDB uses a scheduler global snapshooting and AOF writes update to an apappend-only log file similar to MySql.
You can use one of them or both.When Redis reboots,it constructes data from reading the RDB file or AOF file.
All the answers in this thread are talking about the possibility of redis to persist the data: https://redis.io/topics/persistence (Using AOF + after every write (change)).
It's a great link to get you started, but it is defenently not showing you the full picture.
Can/Should You Really Persist Unrecoverable Data/State On Redis?
Redis docs does not talk about:
Which redis providers support this (AOF + after every write) option:
Almost none of them - redis labs on the cloud does NOT provide this option. You may need to buy the on-premise version of redis-labs to support it. As not all companies are willing to go on-premise, then they will have a problem.
Other Redis Providers does not specify if they support this option at all. AWS Cache, Aiven,...
AOF + after every write - This option is slow. you will have to test it your self on your production hardware to see if it fits your requirements.
Redis enterpice provide this option and from this link: https://redislabs.com/blog/your-cloud-cant-do-that-0-5m-ops-acid-1msec-latency/ let's see some banchmarks:
1x x1.16xlarge instance on AWS - They could not achieve less than 2ms latency:
where latency was measured from the time the first byte of the request arrived at the cluster until the first byte of the ‘write’ response was sent back to the client
They had additional banchmarking on a much better harddisk (Dell-EMC VMAX) which results < 1ms operation latency (!!) and from 70K ops/sec (write intensive test) to 660K ops/sec (read intensive test). Pretty impresive!!!
But it defenetly required a (very) skilled devops to help you create this infrastructure and maintain it over time.
One could (falsy) argue that if you have a cluster of redis nodes (with replicas), now you have full persistency. this is false claim:
All DBs (sql,non-sql,redis,...) have the same problem - For example, running set x 1 on node1, how much time it takes for this (or any) change to be made in all the other nodes. So additional reads will receive the same output. well, it depends on alot of fuctors and configurations.
It is a nightmare to deal with inconsistency of a value of a key in multiple nodes (any DB type). You can read more about it from Redis Author (antirez): http://antirez.com/news/66. Here is a short example of the actual ngihtmare of storing a state in redis (+ a solution - WAIT command to know how much other redis nodes received the latest change change):
def save_payment(payment_id)
redis.rpush(payment_id,”in progress”) # Return false on exception
if redis.wait(3,1000) >= 3 then
redis.rpush(payment_id,”confirmed”) # Return false on exception
if redis.wait(3,1000) >= 3 then
return true
else
redis.rpush(payment_id,”cancelled”)
return false
end
else
return false
end
The above example is not suffeint and has a real problem of knowing in advance how much nodes there actually are (and alive) at every moment.
Other DBs will have the same problem as well. Maybe they have better APIs but the problem still exists.
As far as I know, alot of applications are not even aware of this problem.
All in all, picking more dbs nodes is not a one click configuration. It involves alot more.
To conclude this research, what to do depends on:
How much devs your team has (so this task won't slow you down)?
Do you have a skilled devops?
What is the distributed-system skills in your team?
Money to buy hardware?
Time to invest in the solution?
And probably more...
Many Not well-informed and relatively new users think that Redis is a cache only and NOT an ideal choice for Reliable Persistence.
The reality is that the lines between DB, Cache (and many more types) are blurred nowadays.
It's all configurable and as users/engineers we have choices to configure it as a cache, as a DB (and even as a hybrid).
Each choice comes with benefits and costs. And this is NOT an exception for Redis but all well-known Distributed systems provide options to configure different aspects (Persistence, Availability, Consistency, etc). So, if you configure Redis in default mode hoping that it will magically give you highly reliable persistence then it's team/engineer fault (and NOT that of Redis).
I have discussed these aspects in more detail on my blog here.
Also, here is a link from Redis itself.

How to share the APC user cache between CLI and Web Server instances?

I am using PHP's APC to store a large amount of information (with apc_fetch(), etc.). This information occasionally needs analyzed and dumped elsewhere.
The story goes, I'm getting several hundred hits/sec. These hits increase various counters (with apc_inc(), and friends). Every hour, I would like to iterate over all the values I've accumulated, and do some other processing with them, and then save them on disk.
I could do this as a random or time-based switch in each request, but it's a potentially long operation (may require 20-30 sec, if not several minutes) and I do not want to hang a request for that long.
I thought a simple PHP cronjob would do the task. However, I can't even get it to read back cahe information.
<?php
print_r(apc_cache_info());
?>
Yeilds a seemingly different APC memory segment, with:
[num_entries] => 1
(The single entry seems to be a opcode cache of itself)
While my webserver, powered by nginx/php5-fpm, yields:
[num_entries] => 3175
So, they are obviously not sharing the same chunk of memory. How can I either access the same chunk of memory in the CLI script (preferred), or if that is simply not possible, what would be the absolute safest way to execute a long running sequence on say, a random HTTP request every hour?
For the latter, would using register_shutdown_function() and immediately set_time_limit(0) and ignore_user_abort(true) do the trick to ensure execution completes and doesn't "hang" anyone's browser?
And yes, I am aware of redis, memcache, etc that would not have this problem, but I am stuck to APC for now as neither could demonstrate the same speed as APC.
This is really a design issue and a matter of selecting preferred costs vs. payoffs.
You are thrilled by the speed of APC since you do not spend time to persist the data. You also want to persist the data but now the performance hit is too big. You have to balance these out somehow.
If persistence is important, take the hit and persist (file, DB, etc.) on every request. If speed is all you care about, change nothing - this whole question becomes moot. There are cache systems with persistent storage that can optimize your disk writes by aggregating what gets written to disk and when but you will generally always have a payoff between the two with varying tipping points. You just have to choose which of those suits your objectives.
There will probably never exist an enduring, wholesome technological solution to the wolf being sated and the lamb being whole.
If you really must do it your way, you could have a cron that CURLs a special request to your application which would trigger persisting your cache to disk. That way you control the request, its timeout, etc., and don't have to worry about everything users might do to kill their requests.
Potential risks in this case, however, are data integrity (as you will be writing the cache to disk while it is being updated by other requests in the meantime) as well as requests being served while you are persisting the cache paying the performance hit of your server being busy.
Essentially, we introduced a bundle of hay to the wolf/lamb dilemma ;)