Hazelcast is using high number of JVM threads - jvm

I am using hazelcast in JVM in my application which is running 2 replicas in kubernetes. Hazelcast in both comtainers has formed a cluster and sync is working perfectly fine.
But my application has started using 20% more threads after starting to using hazelcast. On aanalyzing thread dump it is found that hazelcast is using that extra 20%
Is it okay for hazelcast to use this many number of threads or if this can be reduced, how can I go about this?

Hazelcast will self-size the number of threads it uses, based on the number of processors available to it.
(In Java, see Runtime.availableProcessors() )
How many does your container have allocated ?
You can override the threading if you are sure it's inappropriate. Look for system properties like hazelcast.*.thread.count from here. There are many options and it's not a casual task just to reduce or increase, If you tune numbers down, you risk performance being very poor.

Related

profiling Rebus with rabbitmq has high CPU usage

I'm using Rebus to communicate with rabbitmq and I've configured to use 4 workers with 4 max parallelism. Also I notice that the prefetch count is 1 (probably to have an even balancing between consumers).
My issue is happening when I have spike with let's say 1000 messages for instance, i notice high CPU usage on a kubernetes environment with 2 containers where the CPU is limited to let's say 0.5 CPU.
I've profiled the application with dotTrace and it's showing that the cause is at cancenationToken.WaitHandle.wait method inside the DefaultBackOffStrategy class.
My question is if the initial setup of workers/parallelism is making this happen or i need to tweek something in Rebus. I've also tried to change the prefetch count for each consumer but on the rabbitmq management UI this doesn't change the default value (which is one).
Also with the CPU profiler from visual studio and looking at the .net counters I notice some lock contention counters (can this be related)
Why is the CPU usage so high is my question at this point and a way to solve this properly.
Thanks for any help given.

How does the YARN container use the allocated CPU?

I am struggling to understand how yarn containers are limited to allocated resources, especially the CPU.
I am running Spark or Flink jobs in the YARN cluster. Each executor or task manager requests a yarn container that has 1 CPU. Basically, the number of containers is equal to the number of CPUs available in the host.
I understand that YARN monitors the memory usage, and if the container exceeds the limit, it sends a kill signal. I am wondering about how CPU scheduling really works.
My JVM job in the YARN container(1CPU) can try to create multiple CPU-bound work threads. Will JVM be limited to 1 CPU core to execute those threads, or will it steal resources from other containers? Can technically a YARN container affect other containers' CPU performance?
Let's say I have 10 CPU in the host and I created a single container. Will that containers CPU performance be 10% of the host CPU performance?
By Default, yarn only allocates resources by RAM. so by default it hopes everyone plays nicely and you can get affected by CPU hungry jobs. You can change this:
From Apache:
yarn.scheduler.capacity.resource-calculator The ResourceCalculator
implementation to be used to compare Resources in the scheduler. The
default i.e.
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator only
uses Memory while DominantResourceCalculator uses Dominant-resource to
compare multi-dimensional resources such as Memory, CPU etc. A Java
ResourceCalculator class name is expected.
In general it's enough to estimate by Memory. Most people actually estimate they're requirements for memory and threads very poorly. It's usually best to ignore [threads] unless you encounter issues. If it maintains to be an issue then maybe consider looking at DominantResourceCalculator. If/when you turn on resourceDominantCalculator be ready for a lot of people to feel the impact. You may have grossly over allocated threads and when we start counting threads, they will suddenly have to account for what they've asked for. (Or at least this was my experience.) This could grossly appear to shrink capacity of your cluster as space is reserved where it wasn't before.
TLDF: Don't touch this unless you have a good reason. (Wait until it's a problem, don't optimize until there is a bottleneck ). Users can make innocent mistakes in their resource estimation and it can be painful to grow their ability to correctly estimate what they need.

The mechanics behind the mapping of redis instances to separate CPU cores

It's documented that separate redis instances map to separate CPU cores. If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
1) What happens if I scale this machine down to 4 cores?
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
3) Is there any way to control the behavior? If so, to what extent?
Would love to understand the technicals behind this, and an illustrative example is most welcome. I run an app hosted in the cloud which uses redis as a back-end. Scaling up (and down) the machine's CPU cores is one of the things I have to do, but I'd like to know what I'm first getting into.
Thanks in advance!
There is no magic. Since redis is single-threaded, a single instance of redis will only occupy a single core at once. Running multiple instances creates the possibility that more than one of them will be executing at once, on different cores (if you have them). How this is done is left entirely up to the operating system. redis itself doesn't do anything to "map" instances to specific cores.
In practice, it's possible that running 8 instances on 8 cores might give you something that looks like a direct mapping of instances to cores, since a smart OS will spread processes across cores (to maximize available resources), and should show some preference for running a process on the same core that it recently vacated (to make best use of cache). But at best, this is only true for the simple case of a 1:1 mapping, with no other processes on the system, all processes equally loaded, no influence from network drivers, etc.
In the general case, all you can say is that the OS will decide how to give CPU time to all of the instances that you run, and it will probably do a pretty good job, because the scheduling parts of the OS were written by people who know what they're doing.
Redis is a (mostly) single-threaded process, which means that an instance of the server will use a single CPU core.
The server process is mapped to a core by the operating system - that's one of the main tasks that an OS is in charge of. To reiterate, assigning resources, including CPU, is an OS decision and a very complex one at that (i.e. try reading the code of the kernel's scheduler ;)).
If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
Perhaps, that's up to the OS' discretion. There is no guarantee that every instance will get a unique core, and it is possible that one core may be used by several instances.
1) What happens if I scale this machine down to 4 cores?
Scaling down like this means a restart. Once the Redis servers are restarted, the OS will assign them with the available cores.
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
There are no changes involved - every process, Redis or not, gets a core. Cores are shared between processes, with the OS orchestrating the entire thing.
3) Is there any way to control the behavior? If so, to what extent?
Yes, most operating systems provide interfaces for controlling the allocation of resources. Specifically, the taskset Linux command can be used to set or get a process's CPU affinity.
Note: you should leave CPU affinity setting to the OS - it is supposed to be quite good at that. Instead, make sure that you provision your server correctly for the load.

Dataproc set number of vcores per executor container

I'm building a spark application which will run on Dataproc. I plan to use ephemeral clusters, and spin a new one up for each execution of the application. So I basically want my job to eat up as much of the cluster resources as possible, and I have a very good idea of the requirements.
I've been playing around with turning off dynamic allocation and setting up the executor instances and cores myself. Currently I'm using 6 instances and 30 cores a pop.
Perhaps it's more of a yarn question, but I'm finding the relationship between container vCores and my spark executor cores a bit confusing. In the YARN application manager UI I see that 7 containers are spawned (1 driver and 6 executors) and each of these use 1 vCore. Within spark however I see that the executors themselves are using the 30 cores I specified.
So I'm curious if the executors are trying to do 30 tasks in parallel on what is essentially a 1 core box. Or maybe the vCore displayed in the AM gui is erroneous?
If its the former, wondering what the best way is to set this application up so I end up with one executor per worker node, and all the CPUs are used.
The vCore displayed in the YARN GUI is erroneous; this is a not-well-documented but a known issue with the capacity-scheduler, which is Dataproc's default. Notably, with the default settings on Dataproc, YARN is only doing resource bin-packing based on memory rather than CPUs; the benefit is that this is more versatile for oversubscribing CPUs to varying degrees as desired per-workload, especially if something is IO bound, but the downside is that YARN won't be responsible for carving out CPU usage in a fixed manner.
See https://stackoverflow.com/a/43302303/3777211 for some discussion of changing to fair-scheduler to see the vcores allocation accurately represented in YARN. However, in your case there's probably no benefit to doing so; making YARN do bin-packing across both dimensions is more of a "shared multitenant cluster" issue, and only complicates the scheduling problem.
In your case, the best way to set your application up is just to ignore what YARN says about vcores; if you want just one executor per worker node, then set the executor memory size to the maximum that will fit in YARN per node, and make cores per executor equal to the total number of cores per node.

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.