Yarn Resource manager not utilising entire cluster resources - hadoop-yarn

We recently moved the Yarn resource manager daemon to a cloud machine from a static machine. Same configurations are being used. yarn-site.xml and capacity-scheduler.xml as earlier with the machine name change included.
Rest of the cluster is a mix of both static and cloud machines.
The resource manager UI shows all nodes as live and available capacity of memory and vcores is accurately reported.
root.default stats: default: capacity=0.9,
absoluteCapacity=0.9,
usedResources=<memory:12288, vCores:6>
usedCapacity=0.043290164,
absoluteUsedCapacity=0.038961038,
numApps=7,
numContainers=6
Queue configuration details :
Queue State: RUNNING
Used Capacity: 3.6%
Absolute Used Capacity: 3.2%
Absolute Capacity: 90.0%
Absolute Max Capacity: 95.0%
Used Resources: <memory:10240, vCores:5>
Num Schedulable Applications: 5
Num Non-Schedulable Applications: 0
Num Containers: 5
Max Applications: 9000
Max Applications Per User: 9000
Max Schedulable Applications: 29260
Max Schedulable Applications Per User: 27720
Configured Capacity: 90.0%
Configured Max Capacity: 95.0%
Configured Minimum User Limit Percent: 100%
Configured User Limit Factor: 1.0
Active users: thirdeye <Memory: 10240 (100.00%), vCores: 5 (100.00%), Schedulable Apps: 5, Non-Schedulable Apps: 0>
At any point of time only 4-6 containers are being allocated. These are being allocated both on static and cloud node managers of the cluster.
What could be the reason for the under utilization of the cluster resources.
Submitted jobs are getting piled up(right now at 7K)

Related

Kubernetes - Benefit of Reducing pods

What would be the main benefits of reducing the number of replicas:
deployment:
replicaCount: 100
maxReplicaCount: 1000
rollingUpdate:
maxSurge: 150
Throughput (rpm) is not that big and I'm planning to reduce replica count.
in my case who's in dev stage, you can save up some resource for your prioritized services or spin up new deployment for new features.

Downtime of volume backed live migration between two compute nodes (different version: liberty-mitaka) is too high

I'm working on upgrading OpenStack from Liberty to Mitaka. I've upgraded my controller to mitaka. Mitaka controller will manage liberty computes and mitaka computes. After that I do live migration VMs from liberty compute to mitaka compute. when live migrate between two computes different version, I recognized downtime was too high (30 ICMP packets loss with 200ms interval) than two computes same version (5 ICMP packets loss with 200ms interval), summary:
live-migration between liberty-liberty computes: 5 ICMP packets loss with 200ms interval
live-migration between liberty-mitaka computes: 30 ICMP packets loss with 200ms interval
I don't know why it happened
My ENV:
1 controller mitaka
2 compute liberty
1 compute mitaka
OVS ML2 plugin with DVR
Ceph Backend Storage
Thanks
Issue is not related to openstack version.
The following variables may affect live migration speed:
The number of modified pages on the virtual machine to be migrated—the
larger the number of modified pages, the longer the virtual machine will remain in a migrating state.
Available network bandwidth between source and destination servers.
Hardware configuration of source and destination servers.
Load on source and destination servers.
Available bandwidth between servers and shared storage.

Hive LLAP low Vcore allocation

Problem Statment:
Hive LLAP Daemons not consuming Cluster VCPU allocation. 80-100 cores available for LLAP daemon, but only using 16.
Summary:
I am testing Hive LLAP on Azure using 2 D14_v2 head nodes, 16 D14_V2 Worker Nodes, and 3 A series Zookeeper nodes. (D14_V2 = 112GB Ram/12vcpu)
The 15 nodes of the 16 node Cluster is dedicated to LLAP
The Distribution is HDP 2.6.3.2-14
Currently the cluster has a total of 1.56TB of Ram Available and 128vcpu. The LLAP Daemons are allocated the proper amount of memory, but the LLAP Daemons only uses 16vcpus total ( 1 vcpu per daemon + 1 vcpu for slider).
Configuration:
My relevant hive configs are as follows:
hive.llap.daemon.num.executors = 10 (10 of the 12 available vcpu per
node)
Yarn Max Vcores per container - 8
Other:
I have been load testing the cluster but unable to get any more vcpus engaged in the process. Any thoughts or insights would be greatly appreciated.
Resource Manager UI will only show you query co-ordinator and slider's core and memory allocation, each query co-ordinator in LLAP occupy 1 core and mininum alloted Tez-AM memory (tez.am.resource.memory.mb). To check realtime core usage by LLAP service for HDP 2.6.3 version, follow below steps:
Ambari -> Hive -> Quick Links -> Grafana -> Hive LLAP overview ->
Total Execution Slots

Celery halts after 16000 tasks

I am using Celery 3.1.17, RabbitMQ server as broker, backend disabled.
Celery settings:
CELERY_CREATE_MISSING_QUEUES = True
CELERY_IGNORE_RESULT = True
CELERY_DISABLE_RATE_LIMITS = True
CELERY_ACKS_LATE = True
CELERY_TASK_RESULT_EXPIRES = 1
The broker is hosted on a separate server.
RabbitMQ server configuration : 2GB RAM, dual core
Codebase server : 8 GB RAM, quad core
Celery worker settings:
Workers : 4
Concurrency : 12
Memory Consumption in the codebase server : 2GB (max)
CPU Load in codebase server : 4.5, 2.1, 1.5
Tasks Fired : 50,000
Publishing Rate : 1000/s
Problem :
After every 16,000 tasks, the worker halts for about 1-2 minutes and then restarts.
Also about 200-300 tasks had failed.
CPU and memory consumption are NOT bottlenecks in this case.
Is it a ulimits thing?
How do I ensure that constant execution rate of tasks and prevent messages from getting lost?

WSO2 ESB High Availability/ clustering Environment system requirements

I want information on the WSO2 ESB clustering system requirements for production deployment on Linux.
Went through the following link :ESB clustering
Understand that more than 1 copy of the WSO2 ESB would be extracted and set up on single server for Worker nodes and similarly on the other server for Manager (DepSyn and admin) , worker nodes .
Can someone suggest what would be the system requirements of each server in this case ?
system prerequisites link suggests
Memory - 2 GB , 1 GB Heap size
Disk - 1 GB
assuming to handle one ESB instance (worker or manager node).
Thanks in advance,
Sai.
As a minimum, the system requirement would be 2 GB for the ESB worker JVM (+appropriate memory for the OS: assume 2GB for Linux in this case) which would be 4 GB minimum. Of course based on the type of work done and load, this requirement might increase.
The worker manager separation is for the separation of concerns. Hence in a typical production deployment, you might have a single manager node (same specs) and 2 worker nodes where only the worker nodes would handle traffic.