Load balancing titan nodes - load-balancing

Let's say I have a Titan cluster with Cassandra as storage backend and an application that talks with RexPro (or the new Gremlin Server). How should I distribute the queries from my applications to the Titan nodes without knowing which node actually holds the data?
Is a simple round-robin a good choice here?

On a really large graph, round-robin isn't the best option, but there really aren't other strategies ready for use I'm afraid. Experiments were done with more advanced query routing with reasonable success, but none have made it to production Titan/TinkerPop yet.

Related

How to save memory from unpopular/cold Redis?

We have a lot of Redis instances, consuming TBs of memory and hundreds of machines.
With our business activities goes up and down, some Redis instances are just not used that frequent any more -- they are "unpopular" or "cold". But Redis stores everything in memory, so a lot of infrequent data that should have been stored in cheap disk are occupying expensive memory.
We are exploring a way to save the memory from these unpopular/cold Redis, as to reduce our machines usage.
We cannot delete data, nor can we migrate to other database. Are there some way to achieve our goals?
PS: We are thinking of some Redis compatible product that can "mix" memory and disk, i.e. it stores hot data in memory but cold in disk, and USING LIMITED RESOURCES. We know RedisLabs' "Redis on Flash(ROF)" solution, but it uses RocksDB, which is very memory unfriendly. What we want is a very memory restrained product. Besides, ROF is not open source :(
Thanks in advance!
ElastiCache Redis now supports data tiering. Data tiering provides a new cost optimal option for storing data in Redis by utilizing lower-cost local NVMe SSDs in each cluster node in addition to storing data in memory. It is ideal for workloads that access up to 20 percent of their overall dataset regularly, and for applications that can tolerate additional latency when accessing data on SSD. More details about data tiering can be found here.
Your problem might be solved by using an orchestrator approach: scaledown when not in use, scale up when in demand.
Implementation depends much on your infrastructure, but a base requirement is proper monitoring of Redis instances usage.
Based on that, if you are running on Kubernetes, you can leverage pod autoscaling.
Otherwise you can implement Consul and use HAProxy to handle the shutdown/spin-up logic. A starting point for that strategy is this article.
Of course Reiner's idea of using swap is a quick win if it works the intended way!

Zookeeper:how many nodes can zookeeper support?

Background:
In our team,we use zookeeper to process when configure changed.
Question:
1、Zookeeper can support how much nodes.
2、What factors can affect zookeeper supported nodes,cpu、disk or network?
Zookeeper can support as many nodes as you wish. It is distributed coordination service that doesn't get affected by the number of nodes you vary in the network. It only matters if the application involves more reads/writes. Zookeeper is most suited when there are more reads, although it works well with writes too.
Zookeeper doesn't get affected by the CPU or the Disk. Since the underlying consensus Protocol in Zookeeper is ZAB, which takes care of CP in CAP theorem. Partitioning of the network is tolerated along with maintaining the consistency using atomic broadcast protocol.

Redis performance on a multi core CPU

I am looking around redis to provide me an intermediate cache storage with a lot of computation around set operations like intersection and union.
I have looked at the redis website, and found that the redis is not designed for a multi-core CPU. My question is, Why is it so ?
Also, if yes, how can we make 100% utilization of CPU resources with redis on a multi core CPU's.
I have looked at the redis website, and found that the redis is not designed for a multi-core CPU. My question is, Why is it so?
It is a design decision.
Redis is single-threaded with epoll/kqueue and scales indefinitely in terms of I/O concurrency. --#antirez (creator of Redis)
A reason for choosing an event-driven approach is that synchronization between threads comes at a cost in both the software (code complexity) and the hardware level (context switching). Add to this that the bottleneck of Redis is usually the network or the *memory, not the CPU. On the other hand, a single-threaded architecture has its own benefits (for example the guarantee of atomicity).
Therefore event loops seem like a good design for an efficient & scalable system like Redis.
Also, if yes, how can we make 100% utilization of CPU resources with
redis on a multi core CPU's.
The Redis approach to scale over multiple cores is sharding, mostly together with Twemproxy.
However if for some reason you still want to use a multi-threaded approach, take a look at Thredis but make sure you understand the implications of what its author did (you can not use it as a replication master, for instance).
Redis server is a single threaded. But it allows to achieve 100% utilization of CPU resources using Redis nodes (master and/or slave).
Read operations could be scaled using Redis master/slave configuration with single master. One of CPU core used for master node and all others for slaves.
Write operations could be scaled using Redis multi-master cluster configuration. Multiple CPU cores used for master nodes and all others for slaves.
Redisson - Redis Java client which provides full support of Redis cluster. Works with AWS Elasticache and Azure Redis Cache. It includes master/slave discovery and topology update.

zookeeper vs redis server sync

I have a small cluster of servers I need to keep in sync. My initial thought on this was to have one server be the "master" and publish updates using redis's pub/sub functionality (since we are already using redis for storage) and letting the other servers in the cluster, the slaves, poll for updates in a long running task. This seemed to be a simple method to keep everything in sync, but then I thought of the obvious issue: What if my "master" goes down? That is where I started looking into techniques to make sure there is always a master, which led me to reading about ideas like leader election. Finally, I stumbled upon Apache Zookeeper (through python binding, "pettingzoo"), which apparently takes care of a lot of the fault tolerance logic for you. I may be able to write my own leader selection code, but I figure it wouldn't be close to as good as something that has been proven and tested, like Zookeeper.
My main issue with using zookeeper is that it is just another component that I may be adding to my setup unnecessarily when I could get by with something simpler. Has anyone ever used redis in this way? Or is there any other simple method I can use to get the type of functionality I am trying to achieve?
More info about pettingzoo (slideshare)
I'm afraid there is no simple method to achieve high-availability. This is usually tricky to setup and tricky to test. There are multiple ways to achieve HA, to be classified in two categories: physical clustering and logical clustering.
Physical clustering is about using hardware, network, and OS level mechanisms to achieve HA. On Linux, you can have a look at Pacemaker which is a full-fledged open-source solution coming with all enterprise distributions. If you want to directly embed clustering capabilities in your application (in C), you may want to check the Corosync cluster engine (also used by Pacemaker). If you plan to use commercial software, Veritas Cluster Server is a well established (but expensive) cross-platform HA solution.
Logical clustering is about using fancy distributed algorithms (like leader election, PAXOS, etc ...) to achieve HA without relying on specific low level mechanisms. This is what things like Zookeeper provide.
Zookeeper is a consistent, ordered, hierarchical store built on top of the ZAB protocol (quite similar to PAXOS). It is quite robust and can be used to implement some HA facilities, but it is not trivial, and you need to install the JVM on all nodes. For good examples, you may have a look at some recipes and the excellent Curator library from Netflix. These days, Zookeeper is used well beyond the pure Hadoop contexts, and IMO, this is the best solution to build a HA logical infrastructure.
Redis pub/sub mechanism is not reliable enough to implement a logical cluster, because unread messages will be lost (there is no queuing of items with pub/sub). To achieve HA of a collection of Redis instances, you can try Redis Sentinel, but it does not extend to your own software.
If you are ready to program in C, a HA framework which is often forgotten (but can be quite useful IMO) is the one coming with BerkeleyDB. It is quite basic but support off-the-shelf leader elections, and can be integrated in any environment. Documentation can be found here and here. Note: you do not have to store your data with BerkeleyDB to benefit from the HA mechanism (only the topology data - the same ones you would put in Zookeeper).

Distributed Cache that supports incr

I'm looking for a distributed key/value store that supports a balanced load of reads and writes.
Necessary Features:
Get, Set, Incr
Disk backed
Blazingly fast (i.e. eventual consistency is OK)
High availability (i.e. rebalancing load upon node failures)
Nice to have Features:
Overflow to disk (Assuming the load has nice locality properties)
Platform-agnostic (e.g. java based)
Because a lot of the distributed caching solutions support get/set but not incr, it looks like the only option that fits the requirements is terracotta. (Though Redis has a cluster model in their unstable branch).
Any Suggestions?
I can speak namely for redis.
Necessary Features:
Yes, support also for other advanced data structures like hashed, (ordered) sets and lists
Yes, by default redis saves snapshot of the data set on disk.
Yes.
Rebalancing load upon node failures is rather a partition tolerance than high availability in terms of CAP theorem. Redis support replication and cluster is in development.
Nice to have Features:
Read the article about virtual memory.
Most of the POSIX systems.
Maybe your can try to take a look also on membase or couchbase server.
http://www.basho.com/ Riak will do this for you.