JVM cannot use 8 CPUon Linux - jvm

I have observed that JVM cannot user 8 CPU advantage. Because when a thread runs more than 1 secs, other threds are waiting for it. there is no lock beetween these threds is there any jvm option for this ?

The JVM should have no internal locks that inhibit scaling like this. There are many benchmarks (specifically SPECjbb2000 and SPECjbb2005) that show single JVMs scaling to a great number of cores. I would say that you ARE somehow locking between threads, even if you don't know how.
You don't list your JVM level, vendor, or OS. Additionally, the evidence showing lack of scaling would be good. All of those would be necessary to answer the question.

Related

How does the YARN container use the allocated CPU?

I am struggling to understand how yarn containers are limited to allocated resources, especially the CPU.
I am running Spark or Flink jobs in the YARN cluster. Each executor or task manager requests a yarn container that has 1 CPU. Basically, the number of containers is equal to the number of CPUs available in the host.
I understand that YARN monitors the memory usage, and if the container exceeds the limit, it sends a kill signal. I am wondering about how CPU scheduling really works.
My JVM job in the YARN container(1CPU) can try to create multiple CPU-bound work threads. Will JVM be limited to 1 CPU core to execute those threads, or will it steal resources from other containers? Can technically a YARN container affect other containers' CPU performance?
Let's say I have 10 CPU in the host and I created a single container. Will that containers CPU performance be 10% of the host CPU performance?
By Default, yarn only allocates resources by RAM. so by default it hopes everyone plays nicely and you can get affected by CPU hungry jobs. You can change this:
From Apache:
yarn.scheduler.capacity.resource-calculator The ResourceCalculator
implementation to be used to compare Resources in the scheduler. The
default i.e.
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator only
uses Memory while DominantResourceCalculator uses Dominant-resource to
compare multi-dimensional resources such as Memory, CPU etc. A Java
ResourceCalculator class name is expected.
In general it's enough to estimate by Memory. Most people actually estimate they're requirements for memory and threads very poorly. It's usually best to ignore [threads] unless you encounter issues. If it maintains to be an issue then maybe consider looking at DominantResourceCalculator. If/when you turn on resourceDominantCalculator be ready for a lot of people to feel the impact. You may have grossly over allocated threads and when we start counting threads, they will suddenly have to account for what they've asked for. (Or at least this was my experience.) This could grossly appear to shrink capacity of your cluster as space is reserved where it wasn't before.
TLDF: Don't touch this unless you have a good reason. (Wait until it's a problem, don't optimize until there is a bottleneck ). Users can make innocent mistakes in their resource estimation and it can be painful to grow their ability to correctly estimate what they need.

Is it useful to create many threads for a specific process to increase the chance of this process executed?

This is an interview question I encountered today. I have some knowledge about OS but not really proficient at it. I think maybe there are limited threads for each process can create?
Any ideas will help.
This question can be viewed [at least] in two ways:
Can your process get more CPU time by creating many threads that need to be scheduled?
or
Can your process get more CPU time by creating threads to allow processing to continue when another thread(s) is blocked?
The answer to #1 is largely system dependent. However, any rationally-designed system is going to protect again rogue processes trying this. Generally, the answer here is NO. In fact, some older systems only schedule processes; not threads. In those cases, the answer is always NO.
The answer to #2 is generally YES. One of the reasons to use threads is to allow a process to continue processing while it has to wait on some external event.
The number of threads that can run in parallel depends on the number of CPUs on your machine
It also depends on the characteristic of the processes you're running, if they're consuming CPU - it won't be efficient to run more threads than the number of CPUs on your machine, on the other hand, if they do a lot of I/O, or any other kind of tasks that blocks a lot - it would make sense to increase the number of threads.
As for the question "how many" - you'll have to tune your app, make measurements and decide based on actual data.
Short answer: Depends on the OS.
I'd say it depends on how the OS scheduler is implemented.
From personal experience with my hobby OS, it can certainly happen.
In my case, the scheduler is implemented with a round robin algorithm, per thread, independent on what process they belong to.
So, if process A has 1 thread, and process B has 2 threads, and they are all busy, Process B would be getting 2/3 of the CPU time.
There are certainly a variety of approaches. Check Scheduling_(computing)
Throw in priority levels per process and per thread, and it really depends on the OS.

Shinking JVM memory and Swap

Virtual Machine:
4CPU
10GB RAM
10GB swap
Java 1.7
-Xms=-Xmx=6144m
Tomcat 7
We observed a very strange behaviour with the JVM. The JVm resident memory began to shrink and the swap usage shot up to over 50%.
Please see below stats from monitoring tools.
http://i44.tinypic.com/206n6sp.jpg
http://i44.tinypic.com/m99hl0.jpg
Any pointers to understand this is grateful.
Thanks!
Or maybe your Java program was idle and it didn't need that memory, and you have high swappiness? In such situation your OS would free RAM just in case and leave only used part.
In my opinion, that is actually good behaviour, why should you waste RAM for process that won't use it?
Unless you run only this one process on VM, then it would be quite good idea to set swappiness to 0 or other small number - this memory was given to this single process, so we may disable swapping it.
Thanks for the response. Yes this is more close to a system troubleshooting than Java but I thought this the right forum to initiate this topic incase anybody has seen such a phenomena with JVM.
Anyways, I had already checked the top and no there was no other process than Java which was hungry for memory. Actually the second top process was utilizing 72MB (RSS).
No the swappiness is not aggressive set on this system but at default 60. One additional information I missed to share is we have 4 app servers in cluster and all showed this behaviour exactly at the same time. AFAIK, JVM does not swap out but the OS would. But all of it is what confusing me.
All these app servers are production and busy serving request so not idle. The used Heap size was at Avg 5 GB used of the the 6GB.
The other interesting thing I found out were some failed messages in the Vmware logs at the same time which is what I'm investigating.

is there a limit for running multiple 32 bit JVMs?

I am working on a remote server with 64G of ram, I am using a platform which is using 32bit JVM and what I have to do is to create multiple JVMs (around 500). what happens is that after creating 190 or so I get the OOM error from java which says unable to create new native thread. Each JVM occupies around 20M of RAM so 20*190 is around 4G.
So is there any limit on the memory used by all the JVMs together? BTW my process limit in Linux is around 10000 and the limit in /proc/sys/kernel/pid_max is 65000, and also I don't get this lack of resources with other processes. Another point, changing the heap size doesn't help either. Any thoughts?
Your problem is not related to heap size. It is related to the number of threads you are able to create.
When you run a JVM, you have a lot of threads that are created (and active). I can count at least 25 of them. For instance, there are threads for Timer tasks, compiler threads, Finalizer threads and of course GC threads.
Apart from SerialGC, every garbage collector creates a number of thread proportional to the number of cores you have, so it can have a huge impact on the number of threads per JVM.
Some things to do :
Increase your process limit
Set a maximum number of threads (-XX:ConcGCThreads=N, -XX:ParallelGCThreads=N)
Do some thread dumps to check the number of threads in a JVM and deduce the right number for your platform
More JVM options : http://jvm-options.tech.xebia.fr/
Hope that helps !

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.