Currently I'm evaluating Aerospike for my project. I require simple key/value store with strong consistency (under any circumstances no value version conflicts must occur) and maximized durability (data loss is extremely harmful) for the data size that doesn't fit into RAM. Aerospike seems to be one of the most suitable options. The only my concern is if strong consistency is really supported.
According to Aerospike whitepaper https://www.aerospike.com/docs/architecture/assets/AerospikeACIDSupport.pdf CP mode is not supported:
To
enable Aerospike to be used in more domains, we plan to add a configuration for
operating the cluster in CP mode in addition to the AP mode that is supported now
In the same time Aeospike provides different consistency guaranties http://www.aerospike.com/docs/architecture/consistency.html but it is not clear if e.g. write.commit_level=all will make inconsistencies impossible since it is more about durability rather than consistency.
So is there a way to use Aerospike cluster without value conflicts under any circumstances (such as replica failures, cluster partitioning, network latencies etc.) in single DC/region deployment? How the configuration should look like in this case?
CP mode is something that we're actively working on right now. What do you mean when you say 'strong consistency'? Right now we have conflict resolution policies such that in a split brain either the the generation of a record or the TTL (time to live) can be used to decide which record 'wins' the conflict.
Latest Aerospike Release (Ver 4.x +) has CP mode - its called Strong Consistency mode - evaluated by Jepsen. (See http://jepsen.io)
Related
I learned etcd for a few hours, but a question suddenly came into me. I found that redis is fully capable of covering functions which etcd owns.Like key/value CRUD && watch, and redis is very simple to use. why people choose etcd instead of redis?
why?
I googled a few posts, but no post told me the reason.
Thanks!
Redis stores data in memory, which makes it very high performance but not very durable. If the redis server dies, it's easy to lose data. Etcd stores data in files on disc, and performs fsync across multiple nodes before resolving to guarantee consistency, which makes it very durable but not very performant.
That's a good trade-off for kubernetes, which is using etcd for cluster state and configuration, not user data. It would not be a good trade-off for something like user session data which you might be using redis for in your app because you need extremely fast response times and can tolerate a bit of data loss or inconsistency.
A major difference which is affecting my choice of one vs the other is:
etcd keeps the data index in RAM and the data store on disk
redis keeps both data index and data store in RAM
Theoretically, this means etcd ought to be a good fit for large data / small memory scenarios, where redis would require large RAM.
In practice, etcd's current behaviour is that it allocates some memory per transaction when data is accessed. Under heavy load, the memory footprint of the etcd server balloons unboundedly (appears limited by the rate of read requests), and the Go runtime eventually OOM's, killing the server.
In contrast, the redis design requires a virtual address space sized in relation to the dataset, or to the partition of the dataset stored locally.
Memory footprint examples
Eg, with redis, a 8GB dataset partition with an index size of 0.5GB requires 8.5GB of virtual address space (ie, could be handled with 1GB of RAM and 7.5GB of swap), but not less, and the requirement has an upper bound.
The same 8GB dataset, with etcd, would require only 0.5GB of virtual address space, but not less (ie, could be handled with 500MB of RAM and no swap), in theory. In practice, under high load, etcd's memory use is unbounded.
Other considerations
There are other considerations like data consistency, or supported languages, that have to be evaluated separately.
In my case, the language the server is written in is a factor, as I have in-house C expertise, but no Go expertise. This means I can maintain/diagnose/customize redis (written in C) in-house if needed, but cannot do the same with etc (written in Go), I'd have to use it as released by the maintainers.
My conclusion
Unfortunately, the memory behaviour of etcd, whereby it needs to allocate memory to access the indexed data, negates the memory advantages it might have theoretically, and the risk of crash by OOM due to high load make it unsuitable in applications that might experience unexpected usage spikes. Github bug 14362, Github bug 14352, other OOM reports
Furthermore, the ability to customize the server in-house (ie, available C vs Go expertise) is a business consideration that weighs in redis's favour, in my case.
According to redis docs, it's advisable to disable Transparent Huge Pages.
Would the guidance be the same if the machine was shared between the redis server and the application.
Moreover, for other technologies, I've also read guidance that THP should be disabled for all production environments when setting up the server. Is this kind of pre-emptiveness applicable to redis as well, or one must first strictly monitor latency issues before deciding to turn off THP?
Turn it off. The problem lies in how THP shifts memory around to try and keep or create contiguous pages. Some applications can tolerate this, most databases cannot and it causes intermittent performance problems, some pretty bad. This is not unique to Redis by any means.
For your application, especially if it is JAVA, set up real HugePages and leave the transparent variety out of it. If you do that just make sure you alocate memory correctly for the app and redis. Though I have to say, I probably would not recommend running both the app and redis on the same instance/server/vm.
Turning off transparent hugepages is a bad idea, and redis no longer recommends it.
What you should do instead is make sure transparent_hugepage is not set to always. (This is what recent versions of redis check for.) You can check the current value of the setting with:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
And correct it like so:
# echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
Although no action is likely to be necessary, since madvise is typically the default setting in recent linux distros.
Some background:
transparent_hugepage = always: can force applications to use hugepages unless they opt out with madvise. This has a lot of problems and is rarely enabled.
transparent_hugepage = never: does not fulfill allocations with hugepages, even if the application requests it with madvise
transparent_hugepage = madvise: allows applications to opt-in to hugepages. This is normally a good idea because hugepages can improve performance in some applications, but this setting doesn't force them on applications that, like redis, don't opt in
It is rather annoying that searching for "transparent huge pages" yields top results about how to disable them because Redis and some other databases cannot handle transparent huge pages without performance degradation.
These applications should do any of:
Use prctl(PR_SET_THP_DISABLE, ...) call to opt out from using transparent huge pages.
Be started by a script which does this call for them prior to fork/exec the database process. PR_SET_THP_DISABLE get inherited by child processes/threads for exactly this scenario when an existing application cannot be modified.
prctl(PR_SET_THP_DISABLE, ...) has been available since Linux 3.15 circa 2014, so that there is little excuse for those databases to not mention this solution, instead of giving this poor/panic advice to their users to disable transparent huge pages for the entire system.
3 years after this question was asked, Redis got disable-thp config option to make prctl(PR_SET_THP_DISABLE, ...) call on its own, by default.
My production memory-intensive processes go 5-15% faster with /sys/kernel/mm/transparent_hugepage/enabled set to always. Many popular desktop applications benefit from always transparent huge pages immensely.
This is why I cannot appreciate those search results for "transparent huge pages" spammed with Redis adviŅe to disable them. That's a panic advice from Redis, not the best practice.
The overhead THP imposes occurs only during memory allocation, because of defragmentation costs.
If your redis instance has a (near-)constant memory footprint, you can only benefit from THP. Same applies to java or any other long-lived service that does its own memory management. Pre-allocate memory once and benefit.
why playing such echo-games when there is a kernel-param you can boot with?
transparent_hugepage=never
We have distributed cluster weblogic setup.
Our Use Case was whenever Device Contact our system need to compute Parameter and provision to the device. There can be concurrent request from devices. We cant reject any request from devices.So we are going with Async Processing approach.
Here problem we are facing is whenever device contacts we need to lock the source device as well as neighbor devices to provision optimized parameter.
Since we have cluster system, we require a distributed locking system which provides high performance.
Could you suggest us any framework/suggestion in java for distributed locking which suits to our requirement ?
Regards,
Sakumar
Typically, when you sense a need for distributed locking, that indicates a design flaw. Distributed locking is usually either slow or unsafe. It's slow when done correctly because strong consistency guarantees are required to ensure two processes can't hold the same lock at the same time, and unsafe when consistency constraints are relaxed in favor of performance gains.
Often you can find a better solution than distributed locking by doing something like consistent hashing to ensure related requests are handled by the same process. Similarly, leader election can be a more performant alternative to distributed locking if you can elect a leader and route related requests to the leader. But certainly there must be some cases where these solutions are not possible, and so I'd better answer your question...
Assuming fault tolerance is a requirement, and considering the performance and safety concerns mentioned above, Hazelcast may be a good option for your use case. It's a fast embedded in-memory data grid that has a distributed Lock implementation. Often it's nice to use an embedded system like Hazelcast rather than relying on another cluster, but Hazelcat does have the potential for consistency issues in certain partition scenarios, and that could result in two processes acquiring a lock. TBH I've heard more than a few complaints about locks in Hazelcast, but no doubt others have had positive experiences.
Alternatively, ZooKeeper is likely the most common system for distributed locking in Java. However, ZooKeeper tends to be significantly slower for writes than reads since its quorum based - though it is relatively fast and very mature - and locking is a write-heavy work load. Also, in contrast to Hazelcast, one major downside to ZooKeeper is that it's a separate cluster and thus a dependency on another external system. I think ZooKeeper's stability and maturity makes it worth a look.
There doesn't currently seem to be many proven projects in between Hazelcast (an embedded eventually strongly consistent framework) and ZooKeeper (a strongly consistent external service) which is why (disclaimer: self promotion incoming) I created Atomix to provide safe distributed locking and leader elections as an embedded system for Java. It's a decent option if you need a framework like Hazelcast that has the same (actually stronger) consistency guarantees as ZooKeeper.
If performance and scalability is paramount and you're expecting high rates of requests, you will likely have to sacrifice consistency and look at a Hazelcast or something similar.
Alternatively, if fault tolerance is not a requirement (I don't think you spshould cities that it is) you can even just use a Redis instance :-)
I have a small cluster of servers I need to keep in sync. My initial thought on this was to have one server be the "master" and publish updates using redis's pub/sub functionality (since we are already using redis for storage) and letting the other servers in the cluster, the slaves, poll for updates in a long running task. This seemed to be a simple method to keep everything in sync, but then I thought of the obvious issue: What if my "master" goes down? That is where I started looking into techniques to make sure there is always a master, which led me to reading about ideas like leader election. Finally, I stumbled upon Apache Zookeeper (through python binding, "pettingzoo"), which apparently takes care of a lot of the fault tolerance logic for you. I may be able to write my own leader selection code, but I figure it wouldn't be close to as good as something that has been proven and tested, like Zookeeper.
My main issue with using zookeeper is that it is just another component that I may be adding to my setup unnecessarily when I could get by with something simpler. Has anyone ever used redis in this way? Or is there any other simple method I can use to get the type of functionality I am trying to achieve?
More info about pettingzoo (slideshare)
I'm afraid there is no simple method to achieve high-availability. This is usually tricky to setup and tricky to test. There are multiple ways to achieve HA, to be classified in two categories: physical clustering and logical clustering.
Physical clustering is about using hardware, network, and OS level mechanisms to achieve HA. On Linux, you can have a look at Pacemaker which is a full-fledged open-source solution coming with all enterprise distributions. If you want to directly embed clustering capabilities in your application (in C), you may want to check the Corosync cluster engine (also used by Pacemaker). If you plan to use commercial software, Veritas Cluster Server is a well established (but expensive) cross-platform HA solution.
Logical clustering is about using fancy distributed algorithms (like leader election, PAXOS, etc ...) to achieve HA without relying on specific low level mechanisms. This is what things like Zookeeper provide.
Zookeeper is a consistent, ordered, hierarchical store built on top of the ZAB protocol (quite similar to PAXOS). It is quite robust and can be used to implement some HA facilities, but it is not trivial, and you need to install the JVM on all nodes. For good examples, you may have a look at some recipes and the excellent Curator library from Netflix. These days, Zookeeper is used well beyond the pure Hadoop contexts, and IMO, this is the best solution to build a HA logical infrastructure.
Redis pub/sub mechanism is not reliable enough to implement a logical cluster, because unread messages will be lost (there is no queuing of items with pub/sub). To achieve HA of a collection of Redis instances, you can try Redis Sentinel, but it does not extend to your own software.
If you are ready to program in C, a HA framework which is often forgotten (but can be quite useful IMO) is the one coming with BerkeleyDB. It is quite basic but support off-the-shelf leader elections, and can be integrated in any environment. Documentation can be found here and here. Note: you do not have to store your data with BerkeleyDB to benefit from the HA mechanism (only the topology data - the same ones you would put in Zookeeper).
I'm looking for a distributed key/value store that supports a balanced load of reads and writes.
Necessary Features:
Get, Set, Incr
Disk backed
Blazingly fast (i.e. eventual consistency is OK)
High availability (i.e. rebalancing load upon node failures)
Nice to have Features:
Overflow to disk (Assuming the load has nice locality properties)
Platform-agnostic (e.g. java based)
Because a lot of the distributed caching solutions support get/set but not incr, it looks like the only option that fits the requirements is terracotta. (Though Redis has a cluster model in their unstable branch).
Any Suggestions?
I can speak namely for redis.
Necessary Features:
Yes, support also for other advanced data structures like hashed, (ordered) sets and lists
Yes, by default redis saves snapshot of the data set on disk.
Yes.
Rebalancing load upon node failures is rather a partition tolerance than high availability in terms of CAP theorem. Redis support replication and cluster is in development.
Nice to have Features:
Read the article about virtual memory.
Most of the POSIX systems.
Maybe your can try to take a look also on membase or couchbase server.
http://www.basho.com/ Riak will do this for you.