I am planning to implement a caching layer in my application using Redis. right now the application is fetching huge-sized data from DB whenever the user initiates certain plan load. this plan load behind the scene, triggers few heavyweight data accesses and orchestrate all calls in final results.
Data access is happening through JPA Repository right now to access my Oracle DB. when I introduced redis layer, it's not initiating the cache in first access, rather the application tried to fetch data from the empty cache.
My questions are
would my design work, since I want to keep the CRUD operations as is in JPA repositories. I just want to introduce redis for caching, no crud operations.
I have a huge amount of data(probably 2 GB) that should sit in the cache layer. how much can max data redis hold?
My questions are
would my design work, since I want to keep the CRUD operations as is in JPA repositories. I just want to introduce redis for caching, no crud operations.
It is gonna work however you will have main problem cache invalidation.
When you do CRUD operation your redis cache still will have old data and you will have inconsistency. The general way of using redis as cache is setting ttl(Time-To-live) for each key. But you can solve such inconsistency by introducing trigger which erases key in redis if you do any CRUD operation.
Depend on your workload you can meet case when you have low cache hit rate.
For example, if you rarely access to keys in the cache then all of them will be expired until next access. Frankly cache will not work effectively in this case. It could be avoided by warming cache or using redis not as cache but as second storage with replicating data.
I have a huge amount of data(probably 2 GB) that should sit in the cache layer. how much can max data redis hold?
Redis is extremely efficient and limited by your physical resources(RAM) and by size for key and for data stored by key, it is 512Mb.
You have to account that redis can fragment data among virtual memory than your source 2Gb of data represented by keys and data for it can occupy 3GB RAM.
I understand that Redis serves all data from memory, but does it persist as well across server reboot so that when the server reboots it reads into memory all the data from disk. Or is it always a blank store which is only to store data while apps are running with no persistence?
I suggest you read about this on http://redis.io/topics/persistence . Basically you lose the guaranteed persistence when you increase performance by using only in-memory storing. Imagine a scenario where you INSERT into memory, but before it gets persisted to disk lose power. There will be data loss.
Redis supports so-called "snapshots". This means that it will do a complete copy of whats in memory at some points in time (e.g. every full hour). When you lose power between two snapshots, you will lose the data from the time between the last snapshot and the crash (doesn't have to be a power outage..). Redis trades data safety versus performance, like most NoSQL-DBs do.
Most NoSQL-databases follow a concept of replication among multiple nodes to minimize this risk. Redis is considered more a speedy cache instead of a database that guarantees data consistency. Therefore its use cases typically differ from those of real databases:
You can, for example, store sessions, performance counters or whatever in it with unmatched performance and no real loss in case of a crash. But processing orders/purchase histories and so on is considered a job for traditional databases.
Redis server saves all its data to HDD from time to time, thus providing some level of persistence.
It saves data in one of the following cases:
automatically from time to time
when you manually call BGSAVE command
when redis is shutting down
But data in redis is not really persistent, because:
crash of redis process means losing all changes since last save
BGSAVE operation can only be performed if you have enough free RAM (the amount of extra RAM is equal to the size of redis DB)
N.B.: BGSAVE RAM requirement is a real problem, because redis continues to work up until there is no more RAM to run in, but it stops saving data to HDD much earlier (at approx. 50% of RAM).
For more information see Redis Persistence.
It is a matter of configuration. You can have none, partial or full persistence of your data on Redis. The best decision will be driven by the project's technical and business needs.
According to the Redis documentation about persistence you can set up your instance to save data into disk from time to time or on each query, in a nutshell. They provide two strategies/methods AOF and RDB (read the documentation to see details about then), you can use each one alone or together.
If you want a "SQL like persistence", they have said:
The general indication is that you should use both persistence methods if you want a degree of data safety comparable to what PostgreSQL can provide you.
The answer is generally yes, however a fuller answer really depends on what type of data you're trying to store. In general, the more complete short answer is:
Redis isn't the best fit for persistent storage as it's mainly performance focused
Redis is really more suitable for reliable in-memory storage/cacheing of current state data, particularly for allowing scalability by providing a central source for data used across multiple clients/servers
Having said this, by default Redis will persist data snapshots at a periodic interval (apparently this is every 1 minute, but I haven't verified this - this is described by the article below, which is a good basic intro):
http://qnimate.com/redis-permanent-storage/
TL;DR
From the official docs:
RDB persistence [the default] performs point-in-time snapshots of your dataset at specified intervals.
AOF persistence [needs to be explicitly configured] logs every write operation received by the server, that will be played again at server startup, reconstructing the
original dataset.
Redis must be explicitly configured for AOF persistence, if this is required, and this will result in a performance penalty as well as growing logs. It may suffice for relatively reliable persistence of a limited amount of data flow.
You can choose no persistence at all.Better performance but all the data lose when Redis shutting down.
Redis has two persistence mechanisms: RDB and AOF.RDB uses a scheduler global snapshooting and AOF writes update to an apappend-only log file similar to MySql.
You can use one of them or both.When Redis reboots,it constructes data from reading the RDB file or AOF file.
All the answers in this thread are talking about the possibility of redis to persist the data: https://redis.io/topics/persistence (Using AOF + after every write (change)).
It's a great link to get you started, but it is defenently not showing you the full picture.
Can/Should You Really Persist Unrecoverable Data/State On Redis?
Redis docs does not talk about:
Which redis providers support this (AOF + after every write) option:
Almost none of them - redis labs on the cloud does NOT provide this option. You may need to buy the on-premise version of redis-labs to support it. As not all companies are willing to go on-premise, then they will have a problem.
Other Redis Providers does not specify if they support this option at all. AWS Cache, Aiven,...
AOF + after every write - This option is slow. you will have to test it your self on your production hardware to see if it fits your requirements.
Redis enterpice provide this option and from this link: https://redislabs.com/blog/your-cloud-cant-do-that-0-5m-ops-acid-1msec-latency/ let's see some banchmarks:
1x x1.16xlarge instance on AWS - They could not achieve less than 2ms latency:
where latency was measured from the time the first byte of the request arrived at the cluster until the first byte of the ‘write’ response was sent back to the client
They had additional banchmarking on a much better harddisk (Dell-EMC VMAX) which results < 1ms operation latency (!!) and from 70K ops/sec (write intensive test) to 660K ops/sec (read intensive test). Pretty impresive!!!
But it defenetly required a (very) skilled devops to help you create this infrastructure and maintain it over time.
One could (falsy) argue that if you have a cluster of redis nodes (with replicas), now you have full persistency. this is false claim:
All DBs (sql,non-sql,redis,...) have the same problem - For example, running set x 1 on node1, how much time it takes for this (or any) change to be made in all the other nodes. So additional reads will receive the same output. well, it depends on alot of fuctors and configurations.
It is a nightmare to deal with inconsistency of a value of a key in multiple nodes (any DB type). You can read more about it from Redis Author (antirez): http://antirez.com/news/66. Here is a short example of the actual ngihtmare of storing a state in redis (+ a solution - WAIT command to know how much other redis nodes received the latest change change):
def save_payment(payment_id)
redis.rpush(payment_id,”in progress”) # Return false on exception
if redis.wait(3,1000) >= 3 then
redis.rpush(payment_id,”confirmed”) # Return false on exception
if redis.wait(3,1000) >= 3 then
return true
else
redis.rpush(payment_id,”cancelled”)
return false
end
else
return false
end
The above example is not suffeint and has a real problem of knowing in advance how much nodes there actually are (and alive) at every moment.
Other DBs will have the same problem as well. Maybe they have better APIs but the problem still exists.
As far as I know, alot of applications are not even aware of this problem.
All in all, picking more dbs nodes is not a one click configuration. It involves alot more.
To conclude this research, what to do depends on:
How much devs your team has (so this task won't slow you down)?
Do you have a skilled devops?
What is the distributed-system skills in your team?
Money to buy hardware?
Time to invest in the solution?
And probably more...
Many Not well-informed and relatively new users think that Redis is a cache only and NOT an ideal choice for Reliable Persistence.
The reality is that the lines between DB, Cache (and many more types) are blurred nowadays.
It's all configurable and as users/engineers we have choices to configure it as a cache, as a DB (and even as a hybrid).
Each choice comes with benefits and costs. And this is NOT an exception for Redis but all well-known Distributed systems provide options to configure different aspects (Persistence, Availability, Consistency, etc). So, if you configure Redis in default mode hoping that it will magically give you highly reliable persistence then it's team/engineer fault (and NOT that of Redis).
I have discussed these aspects in more detail on my blog here.
Also, here is a link from Redis itself.
According to this link which belongs to JBoss documentation, I understood that Infinispan is a better product than JBoss Cache and kind of improvement the reason for which they recommend to migrate from JBoss Cache to Infinispan, that is supported by JBoss as well. Am I right in what I understood? Otherwise, are there differences?
One more question : Talking about replication and distribution, can any one of them be better than the other according to the need?
Thank you
Question:
Talking about replication and distribution, can any one of them be better than the other according to the need?
Answer:
I am taking a reference directly from Clustering modes - Infinispan
Distributed:
Number of copies represents the tradeoff between performance and durability of data
The more copies you maintain, the lower performance will be, but also the lower the risk of losing data due to server outages
use of a consistent hash algorithm to determine where in a cluster entries should be stored
No need to replicate data on each node that takes more time than just communicating hash code
Best suitable if no of nodes are high
Best suitable if size of data stored in cache is high.
Replicated:
Entries added to any of these cache instances will be replicated to all other cache instances in the cluster
This clustered mode provides a quick and easy way to share state across a cluster
replication practically only performs well in small clusters (under 10 servers), due to the number of replication messages that need to happen - as the cluster size increases
Practical Experience:
I are using Infinispan cache in my running live application on Jboss server having 8 nodes. Initially I used replicated cache but it took much longer time to respond due to large size of data. Finally we come back to Distributed and now its working fine.
Use replicated or distributed cache only for data specific to any user session. If data is common regardless of any user than prefer Local cache that's created separately for each node.
At present we are using KahaDB store for message persistence in ActiveMQ and so far good.
As per the release notes of ActiveMQ5.6, LevelDB provides enhanced performance.
Has anyone tried usign LevelDB and if so could you provide the pros and cons?
FYI: Here's a link to the official docs for the ActiveMQ LevelDB Store
Cons:
It's a brand new store, so may still have some bugs left in it.
LevelDB indexes need to 'compact' occasionally which MIGHT stall out new writes.
You can't just delete the index and rebuild it from the data files like you can with KahaDB
KahaDB handles disk corruption much more gracefully, recovering what it can and discarding corrupted records.
Pros:
Append mostly disk access patterns improve perf on rotational disk.
Fewer disk syncs than KahaDB
Fewer index entries need to be inserted per message stored
Fewer index lookups needed to load a message from disk into memory
Uses Snappy compression to reduce the on disk size of index entries
Optional Snappy compression of data logs.
A send to a composite destination only stores the message on disk once.
Faster and more frequent data file GCs.
Has a 'Replicated' variation where it can self replicate to 'slave' brokers to ensure message level HA.
We've been using the levelDB store a month of two now in production on NFS (with standard file lock failover configured). We've had a corrupt store several times now in the last few weeks, with no errors in the logs... just queues piling up, and very low throughput. The only thing we could do to resolve this, is throw away the store, and start over.
So we've switched back to the old and reliable KahaDB store again for now.
Please see this link: https://github.com/fusesource/fuse-extra/tree/master/fusemq-leveldb#how-to-use-with-activemq-56
There's a small comparison for leveldb vs kahadb.
I am currently trying it out on a system with high message throughput , and I see better results already. I still need to see if it is stable, but so far good.
Most of the performance claims made for LevelDB appear to be empty claims. It is supposed to support high concurrency reads but multi-threaded testing shows no concurrency gains. https://github.com/ayende/raven.voron/pull/9#issuecomment-29764803
(In contrast, LMDB shows perfect linear performance gains for reads across multiple CPUs. https://github.com/ayende/raven.voron/pull/9#issuecomment-29780359 )
I did extensive testing of AMQ performance and was not able to gain any statistically significant difference between LevelDB vs. KahaDB in my tests: http://whywebsphere.com/2015/03/12/ibm-mq-vs-apache-activemq-performance-comparison-update/
I have a requirement to keep the RavenDB database running when the disk for main database and index storage is full. I know I can configure provide a drive for storage with config option - Raven/IndexStoragePath
But I need to design for the corner case of when the this disk is full. What is the usual pattern used in this situation. One way is to halt all access while shutting the service down and updating the config file programatically and then start the service - but it is a bit risky.
I am aware of sharding and this question is not related to that , assume that sharding is enables and I have multiple shards and I want to increase storage for each shard by adding a new drive to each. Is there an elegant solution to this?
user544550,
In a disk full scenario, RavenDB will continue to operate, but will refuse to accept further writes.
Indexing will fail as well and eventually mark the indexes as permanently failing.
What is your actual scenario?
Note that in RavenDB, indexes tend to be significantly smaller than the actual data size, so the major cause for disk space utilization is actually the main database, not the indexes.