apache hadoop, hbase and nutch components distribution for 4 servers cluster - apache

I have 4 systems. I want to crawl some data. For that first I need to configure cluster. I am confused about placement of components.
should I place all component (hadoop, hive, hbase, nutch) in one machine and add other machines as nodes in hadoop?
Should I place hbase in one machine, nutch in other and hadoop in third and add forth machine as slave of hadoop?
Should HBase be in pseudo distributed mode or full distributed.
How many slaves I sholud add in hbase if I run it as fully distributed mode.
What should be the best way. PLease guide step by step ( For hbase and hadoop)

Say you have 4 nodes n1, n2, n3 and n4.
You can install hadoop and hbase in distributed mode.
If you are using Hadoop 1.x -
n1 - hadoop master[Namenode and Jobtracker]
n2, n3 and n3 - hadoop slaves [datanodes and tasktrackers]
For HBase, you can choose n1 or any other node as Master node, Since Master node are usually not CPU/Memory intensive, all Masters can be deployed on single node on test setup, However in Production its good to have each Master deployment on a separate node.
Lets say n2 - HBase Master, remaining 3 nodes can act as regionservers.
Hive and Nutch can reside on any node.
Hope this helps; For a test setup this should be good to go.
Update -
For Hadoop 2.x, since your cluster size is small, Namenode HA deployment can be skipped.
Namenode HA would require two nodes one each for an active and standby node.
A zookeeper quorum which again requires odd number of nodes so a minimum of three nodes would be required.
A journal quorum again require a minimum of 3 nodes.
But for a cluster this small HA might not be a major concern. So you can keep
n1 - namenode
n2 - ResouceManager or Yarn
and remaining nodes can act as datanodes, try not to deploy anything else on the yarn node.
Rest of the deployment for HBase, Hive and Nutch would remain same.

In my opinion, you should install Hadoop in fully distributed mode, so the jobs could run in parallel manner and much faster, as the MapReduce tasks will be distributed in 4 machines. Of course, the Hadoop's master node should run in one single machine.
If you need to process big amount of data, it's a good choice to install HBase in one single machine and the Hadoop in 3.
You could make all the above very easy using tools/platforms with a very friendly GUI like Cloudera Manager and Hortonworks. They will help you to control and maintain your cluster better but they are also provide Health Monitoring, Cluster Analytics as well as E-Mail notifications for every error occurs in your cluster.
Cloudera Manager
http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-enterprise/cloudera-manager.html
Hortonworks
http://hortonworks.com/
In these two links, you can find more guidance about how you could costruct your cluster

Related

How can I maintain a list of constant masters and workers under conf/masters and conf/workers in a managed Scaling cluster?

I am using an AWS EMR cluster with Alluxio installed n every node. I want to now deploy Alluxio in High Availability.
https://docs.alluxio.io/os/user/stable/en/deploy/Running-Alluxio-On-a-HA-Cluster.html#start-an-alluxio-cluster-with-ha
I am following the above documentation, and see that "On all the Alluxio master nodes, list all the worker hostnames in the conf/workers file, and list all the masters in the conf/masters file".
My concern is that since I have an AWS-managed scaling cluster the worker nodes keep added and removed based on cluster loads. How can I maintain a list of constant masters and workers under conf/masters and conf/workers in a managed Scaling cluster?
this conf/workers and conf/masters conf file is only used for intiial setup through scripts. Once the cluster is running, you don’t need to update them any more.
E.g., in an say EMR cluster, you can add a new slave node as Alluxio worker and as long as you specify the correct Alluxio master address, this new Alluxio worker will be able to register itself and serve in the fleet like other workers,

redis cluster total size

I have a quick question about redis cluster.
I'm setting up a redis cluster on google cloud kubernetes engine. I'm using the n1-highmem-2 machine type with 13GB RAM, but I'm slightly confused how to calculate the total available size of the cluster.
I have 3 nodes with each 13GB ram. I'm running 6 pods (2 on each node), 1 master and 1 slave per node. This all works. I've assigned 6GB of RAM to each pod in my pod definition yaml file.
Is it correct to say that my total cluster size would be 18GB (3 masters * 6GB), or can I count the slaves size with the total size of the redis cluster?
Redis Cluster master-slave model
In order to remain available when a subset of master nodes are failing or are not able to communicate with the majority of nodes, Redis Cluster uses a master-slave model where every hash slot has from 1 (the master itself) to N replicas (N-1 additional slaves nodes).
So, slaves are replicas(read only) of masters(read-write) for availability, hence your total workable size is the size of your master pods.
Keep in mind though, that leaving masters and slaves on the same Kubernetes node only protects from pod failure, not node failure and you should consider redistributing them.
You didn't mention how are you installing Redis, But I'd like to mention Bitnami Redis Helm Chart as it's built for use even on production and deploys 1 master and 3 slaves providing good fail tolerance and have tons of configurations easily personalized using the values.yaml file.

Redis Sentinel with 2 App Servers and 1 Additional Sentinel Node Setup

We have 2 app/web servers running HA application, we need to setup redis with high availability/replication to support our app.
Considering the minimum sentinel setup requirement of 3 nodes.
We are planning to prepare the first app serve with redis master and 1 sentinel, the second app server will have the redis slave and 1 sentinel, we plan to add one additional server to hold the third sentinel node to achieve the 2 quorum sentinel setup.
Is this a valid setup ? what could be the risks ?
Thanks ,,,
Well it looks its not recommended to put the redis nodes on the app servers (where it is recommended to put the sentinel nodes there).
We ended with a setup for KeyDB (a fork from Redis) which claimed to be faster and support high availability/replication (and much more) to create two nodes within the app servers.
Of course We had to modify little in the client side to support some advance Lua scripts (There is some binary serialized data not getting replicated to the other node).
But after some effort, it worked ! as expected.
Hope this helps ...

Redis advantages of Sentinel and Cluster

I'm planning to create a high available Redis Cluster. After reading many articles about building Redis cluster i'm confused. So what exactly are
the advantages of a Redis Sentinel Master1 Slave1 Slave2 Cluster? Is it more reliable as a Redis Multinode Sharded Cluster?
the advantages of a Redis Multinode Sharded Cluster? Is it more reliable as a Redis Sentinel Master1 Slave1 Slave2 Cluster?
Further questions to the Redis Sentinel Master1 Slave1 Slave2 Cluster:
when i have 1 Master and the two Slaves and traffic is getting higher and higher so this cluster will be to small how can i make the cluster bigger?
Further questions to the Redis Multinode Sharded Cluster:
why are there so many demos with running a cluster on a single instance but on different ports? That makes no sense to me.
when i have a cluster with 4 masters and 4 replicas, how can an application or a client be sure to write to the cluster? When Master1 and Slave1 are dying but my application is writing always to the IP of Master1 then it will not work anymore. Which solutions are out there to implement a sharded cluster well to make it available for applications to find it with a single ip and port? Keepalived? HAproxy?
when i juse for a 4 master setup with e.g. Keepalived - doesn't that cancel out the different masters?
furthermore i need to understand why the multinode cluster is only for solutions where more data will need to be written as memory is available. Why? For me a multi master setup sounds good to be scaleable.
is it right that the the sharded cluster setup does not support multikey operations when the cluster is not in caching mode?
I'm unsure if these two solutions are the only ones. Hopefully you guys can help me to understand the architectures of Redis. Sorry for so many questions.
I will try to answer some of your questions but first let me describe the different deployment options of Redis.
Redis has three basic deployments: single node, sentinel and cluster.
Single node - The basic solution where you run single process running Redis.
It is not scalable and not highly available.
Redis Sentinel - Deployment that consist of multiple nodes where one is elected as master and the rest are slaves.
It adds high availability since in case of master failure one of the slaves will be automatically promoted to master.
It is not scalable since the master node is the only node that can write data.
You can configure the clients to direct read requests to the slaves, which will take some of the load from the master. However, in this case slaves might return stale data since they replicate the master asynchronously.
Redis Cluster - Deployment that consist of at least 6 nodes (3 masters and 3 slaves). where data is sharded between the masters. It is highly available since in case of master failure, one of his slaves will automatically be promoted to master. It is scalable since you can add more nodes and reshard the data so that the new nodes will take some of the load.
So to answer your questions:
The advantages of Sentinel over Redis Cluster are:
Hardware - You can setup fully working Sentinel deployment with three nodes. Redis Cluster requires at least six nodes.
Simplicity - usually it is easier to maintain and configure.
The advantages of Redis Cluster over Sentinel is that it is scalable.
The decision between that two deployment should be based on your expected load.
If your write load can be managed with a single Redis master node, you can go with Sentinel deployment.
If one node cannot handle your expected load, you must go with Cluster deployment.
Redis Sentinel deployment is not scalable so making the cluster bigger will not improve your performance. The only exception is that adding slaves can improve your read performance (in case you direct read requests to the slaves).
Redis Cluster running on a single node with multiple ports is only for development and demo purposes. In production it is useless.
In Redis Cluster deployment clients should have network access to all nodes (and node only Master1). This is because data is sharded between the masters.
In case client try to write data to Master1 but Master2 is the owner of the data, Master1 will return a MOVE message to the client, guiding it to send the request to Master2.
You cannot have a single HAProxy in front of all Redis nodes.
Same answer as in 5, in the cluster deployment clients should have direct connection to all masters and slaves not through LB or Keepalived.
Not sure I totally understood your question but Redis Cluster is the only solution for Redis that is scalable.
Redis Cluster deployment support multikey operations only when all keys are in the same node. You can use "hash tags" to force multiple keys to be handled by the same master.
Some good links that can help you understand it better:
Description on the different Redis deployment options: https://blog.octo.com/en/what-redis-deployment-do-you-need
Detailed explanation on the architecture of Redis Cluster: https://blog.usejournal.com/first-step-to-redis-cluster-7712e1c31847

minimum activemq cluster size with replicated leveldb store

What is the rationale behind requiring at least 3 ActiveMQ instances and 3 ZooKeeper servers for running master/slave setup with replicated LevelDB storage? If the requirement is imposed by the usage of ZooKeeper which requires at least 3 servers, what is the rationale for ZooKeeper to require at least 3 servers to provide reliability?
Is it for guaranteeing consistency in cases of network partitions (by sacrificing availability on the smaller smaller partition) as in a 2-node primary backup configuration it is impossible distinguish between a failed peer or both nodes being in different network partitions?
Is it for providing tolerance against Byzantine failures where you need 2f+1 nodes to survive f faulty nodes (considering ONLY crash failures requires only f+1 nodes to survive f faults)?
Or is there any other reason?
Thanks!
Zookeeper requires at least 3 servers because of how it elects a new Activemq Master. Zookeeper requires a majority (n/2+1) to elect a new master. If it does not have that majority, no master will be selected and the system will fail. This is the same reason for why you use an odd number of Zookeepers servers. (EG. 3 servers gives you the same failure rate as 4 because of majority, can still only lose 1 server.)
For Activemq, the necessity of at least 3 servers is derived from how the messages are synced, and the fact that when a new master is elected, it requires atleast a quorum of nodes (N/2+1) to be able to identify the latest updates. ActiveMQ will sync messages with 1 slave, and then respond with an OK. It will then sync asynchronously with all other slaves. If a quorum is not present when a node fails, then Zookeeper has no way to distinguish which node is the most currently updated. This is what happens when you have only 2 nodes originally, so at least 3 is recommended.
From ActiveMQ site, under How it Works:
All messaging operations which require a sync to disk will wait for the update to be replicated to a quorum of the nodes before completing. So if you configure the store with replicas="3" then the quorum size is (3/2+1)=2. The master will store the update locally and wait for 1 other slave to store the update before reporting success. Another way to think about it is that store will do synchronous replication to a quorum of the replication nodes and asynchronous replication replication to any additional nodes.
When a new master is elected, you also need at least a quorum of nodes online to be able to find a node with the lastest updates. The node with the lastest updates will become the new master. Therefore, it's recommend that you run with at least 3 replica nodes so that you can take one down without suffering a service outage.