i have a hortonworks 3.1 cluster. Originally, i had 6 data nodes, 3 master nodes, the 6 data nodes all have 96gb of memory. I was able to spin up 3 llap nodes just fine. now i have 3 new nodes, 256gb of memory, i want to use those for the llap nodes now.
I added a node label "llap" to the 3 new nodes. then i assigned the llap queue that same label. (all other nodes are default label)
for some reason, when i start up llap, i get the following error:
Failed: org.apache.hadoop.yarn.exceptions.YarnException: Component llap: specified memory size (224256) is larger than configured max container memory size (94208)
i've tried everything i can think of, but it seems that llap want to spin things up on the old node, not the new one...
i've googled my life away, but i keep finding the same refrences from cloudera / hortonworks, which i've messed with over and over and over again... any help would be appreciated!
Your error is due to the configuration parameter of YARN which determines the maximum container size:
yarn.scheduler.maximum-allocation-mb
In your case you, you are trying to alocate to a single container 224256MB, your configuration of yarn.scheduler.maximum-allocation-mb is 94208MB.
There is a very good article detailing how to setup your LLAP configuration:
https://community.cloudera.com/t5/Community-Articles/LLAP-sizing-and-setup/ta-p/247425
Related
Ignite cluster was running with 3 replicas and having persistence data storage of 80 GB. Backup as one. CPU 3 cores RAM is 6 GB.
Half of the caches are created with Partitioned mode and collocated. One of this cache is loaded with 70% of data. Remaining caches are replicated.
Scaled up the ignite cluster to 5 with 10 GB RAM. All nodes were up and the newly created nodes were not participating in the existing topology.
Observed one node was using more CPU and RAM and other nodes were using very less.
Can you help us to overcome this scenario?
When you add nodes to a cluster with persistence enabled, you need to adjust the "baseline topology." There are multiple ways, but the easiest is to use the control.sh script:
./control.sh --baseline add consistentId1,consistentId2 --yes
We have a CDH cluster (version 5.14.4) with 6 worker servers with a total of 384 vcores (64 cores per server).
We are running some ETL processes using dask version 2.8.1, dask-yarn version 0.8 with skein 0.8 .
Currently we are having problem allocating the maximum number of workers .
We are not able to run a job with more the 18 workers! (we can see the actual number of workers in the dask dashboad.
The definition of the cluster is as follows:
cluster = YarnCluster(environment = 'path/to/my/env.tar.gz',
n_workers = 24,
worker_vcores = 4,
worker_memory= '64GB'
)
Even when increasing the number of workers to 50 nothing changes, although when changing the worker_vcores or worker_memory we can see the changes in the dashboard.
Any suggestions?
update
Following #jcrist answer I realized that I didn't fully understand the termenology between the Yarn web UI application dashboard and the Yarn Cluster parameters.
From my understanding:
a Yarn Container is equal to a dask worker.
When ever a Yarn cluster is generated there are 2 additional workers/containers that are running (one for a Schedualer and one for a logger - each with 1 vCore)
The limitation between the n_workers * worker_vcores vs. n_workers * worker_memory that I still need fully grok.
There is another issue - while optemizing I tried using cluster.adapt(). The cluster was running with 10 workers each with 10 ntrheads with a limit of 100GB but in the Yarn web UI there was only displayed 2 conteiners running (my cluster has 384 vCorres and 1.9TB so there is still plenty of room to expand). probably worth to open a different question.
There are many reasons why a job may be denied more containers. Do you have enough memory across your cluster to allocate that many 64 GiB chunks? Further, does 64 GiB tile evenly across your cluster nodes? Is your YARN cluster configured to allow jobs that large in this queue? Are there competing jobs that are also taking resources?
You can see the status of all containers using the ApplicationClient.get_containers method.
>>> cluster.application_client.get_containers()
You could filter on state REQUESTED to see just the pending containers
>>> cluster.application_client.get_containers(states=['REQUESTED'])
this should give you some insight as to what's been requested but not allocated.
If you suspect a bug in dask-yarn, feel free to file an issue (including logs from the application master for a problematic run), but I suspect this is more an issue with the size of containers you're requesting, and how your queue is configured/currently used.
Ignite Version: 2.5
Ignite Cluster Size: 10 nodes
One of our spark job is writing data to ignite cache in every one hour. The total records per hour is 530 million. Another spark job read the cache but when it try to read it, we are getting the error as "Failed to Execute the Query (all affinity nodes left the grid)
Any pointers will be helpful.
If you are using "embedded mode" deployment it will mean that nodes are started when jobs are ran and stopped when jobs finish. If you do not have enough backups you will lose data when this happens. Any chance this may be your problem? Be sure to connect to Ignite cluster with client=true.
I am a new user of Apache Hadoop. There is one moment which I do not understand. I have a simple cluster (3 nodes). Every node have about 30GB free space. When I look at Overview site of Hadoop I see DFS Remaining: 90.96 GB. I set the Replication factor to 1.
Then I create one file 50GB and try to upload it to HDFS. But space is out. Why? Do I can't upload file which more than free space one node of cluster?
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
I am doing data base scaling using postgresql.
Currently i am using pg_shard for scaling and able to do sharding and replication. i have tested the example that mentioned in Readme file of pg_shard.
But i need dynamically scale a cluster as new machines are added or old ones are retired.I am using google cloud VM to setup database .So once one VM is filled with data i want to setup new instance with same configuration.
ie,if the current machine size is 4GB and is of out of memory then it should create one more VM with 4GB size and next entries should come there.
I have gone through http://slideplayer.com/slide/4896815/ and after reading this i understood that it is possible to do but the steps are not mentioned anywhere.
How to achieve this using pg_shard?
I got the answer myself.
We can use CitusDB for this.
CitusDB is installed with an extension called "shard_rebalancer", which helps you to move the shards around when new nodes are added to the cluster. For this, you need to follow the installation instructions for CitusDB.
In this documentation, you can find about the related information for the shard rebalancer functions (i.e., rebalance_table_shards and replicate_table_shards)
With simpler words, you must follow the steps:
Add CitusDB node(s) to the cluster
Add the IPs (or host names) to pg_worker_list.conf
Reload the master node configuration, so that the master becomes aware of the new worker node(s)
Run "SELECT rebalance_table_shards('tablename')" on the master node.