Based on below latentcy comparisons given at https://gist.github.com/jboner/2841832 SSD Read is almost similar to Network Read in same datacenter in terms of cost.
I am trying to understand if Redis deployment on separate node/cluster will be performant due to network latency introduced? Won't deploying Redis on app nodes itself be a better option? This is assuming app nodes are using SSD disks and data is sharded across app nodes.
This is for a large deployment with more than 10 app nodes.
Obviously if you can run Redis on the same node as your app you'll get better latency than over the network (and you can also use Unix socket to reduce it more).
But the questions you need to ask your self:
How are you going to shard the data between the app nodes?
What about high availability?
Are there cases where one app node will need data from another node?
Can you be sure the load will be evenly distributed between the nodes so no Redis node will get out of memory?
What about scale out? How are you going to reshard the data?
Related
I am working on a migration project from Oracle to Redis, my Oracle DB size is 1 TB, can you please suggest the hardware configuration for Redis. I am planning to have a master with 2 slaves for the Redis server.
What is the best option for Redis to have high availability?
Is the master-slave architecture is fine? If yes can I have all the master and slaves on the same server? If yes what are the disadvantages will occur?
Please suggest me the best option for high availability for my Redis server.
Considering the data size you can utilize redis cluster to store your data.
When designed properly, this is expected to provide the high availability and partitioning your data among multiple masters in the cluster.
To identify its suitability, you need to perform some kind of benchmarks with the real data and real queries expected from your application.
You can use redis-benckmark utility provided by redis out of the box and simulate the expected data and calls to get a picture of what's expected
I was wondering how I could setup a hadoop cluster (say 5 nodes) through AWS. I know how to create the cluster on EC2 but I don't know how to face the following challenges.
What happens if I lose my spot instance. How do I keep the cluster going.
I am working with some datasets of Size 1TB. Would it be possible to setup the EBS accordingly. How can I access the HDFS in this scenario.
Any help will be great!
Depending on your requirements, these suggestions would change. However, assuming a 2 Master and 3 Worker setup, you can probably use r3 instances for Master nodes as they are memory intensive app optimized and go for d2 instances for the worker nodes. d2 instances have multiple local disks and thus can withstand some disk failures while still keeping your data safe.
To answer your specific questions,
treat Hadoop machines as any linux applications. What would happen if your general centOS spot instances are lost? Hwnce, generally it is advised to use reserved instances.
Hadoop typically stores data by maintaining 3 copies and distributing them across the worker nodes in forms of 128 or 256 MB blocks. So, you will have 3TB data to store across the three worker nodes. Obviously, you have to consider some overhead while calculating space requirements.
You can use AWS's EMR service - it is designed especially for Hadoop clusters on top of EC2 instances.
It it fully managed, and it comes pre-packed with all the services you need in Hadoop.
Regarding your questions:
There are three main types of nodes in hadoop:
Master - a single node, don't need to spot it.
Core - a node that handle tasks, and have part of the HDFS
Task - a node that handle tasks, but does not have any part of the HDFS
If Task nodes are lost (if they are spot instances) the cluster will continue to work with no problems.
Regarding storage, the default replication factor in EMR is as follows:
1 for clusters < four nodes
2 for clusters < ten nodes
3 for all other clusters
But you can change it - http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hdfs-config.html
I have created RabbitMQ cluster on single windows machine with HA policy to all and created two DISC and two RAM node and 1 STAT node. I then ran the PerfTest (rabbitmq client test utility), the result were disappointing, it was around 5000m/sec. But when I ran the same test with single RabbitMQ node it gave me good result i.e. 25000m/sec. I am unable to get what wrong is happening, its result should be impressive if run within cluster, but it is opposite. Anyone have encounter the same or if know the reason behind it.
Thanks
A RabbitMQ Cluster with Mirrored Queues won't go faster than a single node. Why? Clustering is there to improve reliability and fault tolerance, not to improve throughput.
What's the reason for this? When you enable mirrored queues, RabbitMQ needs to coordinate state between nodes, that is, it needs to coordinate publishes, consumers and acks, to not deliver the same message more than once, or to more than one consumer. All this coordination affects performance, but that's the tradeoff with this kind of replication.
If you need decentralised replication, then you could use the Federation Plugin
The throughput rate would depend on couple of factors. In our perf tests for RabbitMQ in a cluster we observed that the rate varied depending on RabbitMQ nodes were DISC or RAM, but a big chunk of the performance variation was observed when running RabbitMQ Cluster with Mirrored Queues vs without. With Mirroring enabled we were seeing a rate of 3500 m/sec, while without it was 5000 m/sec. Also what is your message size when you run your perftest.
As is typical with RabbitMQ, it really depends. Here are a few ways that I have found to improve performance with RabbitMQ clustering:
Push the messages to a set of appropriately sized memory nodes only using a load balancer
Keep the message size very small
Do not use amqp transactions or Publisher Confirms
Only use HA Mirrored queues for a small set of queues that you absolutely have to have the data saved
Set a TTL on all messages or queues using a policy
Just to addon to above comments.. Putting it as FYI
http://www.rabbitmq.com/blog/2012/05/11/some-queuing-theory-throughput-latency-and-bandwidth/
http://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance-measurements-part-2/
The problems is that you are running a cluster on the same machine with the same resources.
The purpose of a rabbit cluster is to scale out and not scale in.
In other words, to have more network connections available, more disk power of course more CPU power to handle more messages.
When adding nodes on a single machine you don't scale your resources plus you are adding overheads of using a cluster. (As stated above)
I'm reading about production topology for the Analytics part of Worklight 6.2.
https://www-01.ibm.com/support/knowledgecenter/api/content/SSZH4A_6.2.0/com.ibm.worklight.monitor.doc/monitor/t_setting_up_production_cluster.html
It explains that nodes can act both as Master Node or as Data Node or only as one of them.
My question is why we should configure dedicated nodes, Master OR Data instead of configuring all the nodes for both Master AND Data.
I assume the the node (only one) acting as master will provide worst performance in its Data role but on the other hand the configuration will be simpler and the high availability will be higher.
Thank you.
Your assumption is correct.
A master node is responsible for handling communication between the data nodes. The data nodes will be responsible for indexing data. Having dedicated master and data nodes will allow them to focus their processing time and memory on their specific tasks. However, as you mentioned, in some cases its not worth doing this to complicate the configuration.
Another reason is that its not necessary to put a master node on a high performing machine. You can reserve the better machines for the data nodes.
The analytics console uses Elasticsearch under the covers. It would be worth looking up the benefits and drawbacks of choosing master and data nodes in Elasticsearch since it is an open source library and there are several resources available for it.
Edit:
As you can imagine, there is no one size fits all configuration. The configuration depends on several factors such as:
How long you wish to keep data stored
How many machines you have to dedicate to analytics
How verbose your client logs have been set
Your preferences between availability and performance
In my personal tests, I typically keep each node as a data and master node. Its possible that in the future we will document how the different configurations affect performance.
According to this link which belongs to JBoss documentation, I understood that Infinispan is a better product than JBoss Cache and kind of improvement the reason for which they recommend to migrate from JBoss Cache to Infinispan, that is supported by JBoss as well. Am I right in what I understood? Otherwise, are there differences?
One more question : Talking about replication and distribution, can any one of them be better than the other according to the need?
Thank you
Question:
Talking about replication and distribution, can any one of them be better than the other according to the need?
Answer:
I am taking a reference directly from Clustering modes - Infinispan
Distributed:
Number of copies represents the tradeoff between performance and durability of data
The more copies you maintain, the lower performance will be, but also the lower the risk of losing data due to server outages
use of a consistent hash algorithm to determine where in a cluster entries should be stored
No need to replicate data on each node that takes more time than just communicating hash code
Best suitable if no of nodes are high
Best suitable if size of data stored in cache is high.
Replicated:
Entries added to any of these cache instances will be replicated to all other cache instances in the cluster
This clustered mode provides a quick and easy way to share state across a cluster
replication practically only performs well in small clusters (under 10 servers), due to the number of replication messages that need to happen - as the cluster size increases
Practical Experience:
I are using Infinispan cache in my running live application on Jboss server having 8 nodes. Initially I used replicated cache but it took much longer time to respond due to large size of data. Finally we come back to Distributed and now its working fine.
Use replicated or distributed cache only for data specific to any user session. If data is common regardless of any user than prefer Local cache that's created separately for each node.