I am working on a remote server with 64G of ram, I am using a platform which is using 32bit JVM and what I have to do is to create multiple JVMs (around 500). what happens is that after creating 190 or so I get the OOM error from java which says unable to create new native thread. Each JVM occupies around 20M of RAM so 20*190 is around 4G.
So is there any limit on the memory used by all the JVMs together? BTW my process limit in Linux is around 10000 and the limit in /proc/sys/kernel/pid_max is 65000, and also I don't get this lack of resources with other processes. Another point, changing the heap size doesn't help either. Any thoughts?
Your problem is not related to heap size. It is related to the number of threads you are able to create.
When you run a JVM, you have a lot of threads that are created (and active). I can count at least 25 of them. For instance, there are threads for Timer tasks, compiler threads, Finalizer threads and of course GC threads.
Apart from SerialGC, every garbage collector creates a number of thread proportional to the number of cores you have, so it can have a huge impact on the number of threads per JVM.
Some things to do :
Increase your process limit
Set a maximum number of threads (-XX:ConcGCThreads=N, -XX:ParallelGCThreads=N)
Do some thread dumps to check the number of threads in a JVM and deduce the right number for your platform
More JVM options : http://jvm-options.tech.xebia.fr/
Hope that helps !
Related
I'm experimenting with c++ AMP, one thing thats unclear from MS documentation is this:
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do. Then there was some vague stuff on warps and tiles out of which I got this impression:
Regardless of how the threads are tiled together (or not at all), the whole group must finish before taking on new tasks so if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true? Also, how do I find out what this internal groups size is?
During my experimentation, it certainly appears to have some truth to it because assigning more even amounts of work to the threads seems to speed things up, even if there is slightly more work overall.
The reason I'm trying to figure it out is because I'm deciding whether or not to engage in another lengthy experiment that would be based on threads getting uneven amounts of work (sometimes by the factor of 10x) but all the threads would be independent so data wise, the cores would be free to pick up another thread.
In practice, the underlying execution model of AMP on GPU is the same as CUDA, OpenCL, Compute Shaders, etc. The only thing that changes is the naming of each concept. So if you feel that the AMP documentation is lacking, consider reading up on CUDA or OpenCL. Those are significantly more mature APIs and the knowledge you gain from them applies as well to AMP.
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do.
Maybe. From the high-level view of parallel_for_each, you don't have to care about this. The threads may as well be executed sequentially, one at a time.
If you launch 1000 threads without specifying a tile size, the AMP runtime will choose a tile size for you, based on the underlying hardware. If you specify a tile size, then AMP will use that one.
GPUs are made of multiprocessors (in CUDA parlance, or compute units in OpenCL), each composed of a number of cores.
Tiles are assigned per multiprocessor: all threads within the same tile will be ran by the same multiprocessor, until all threads within that tile run to completion. Then, the multiprocessor will pick another available tile (if any) and run it, until all tiles are executed. Multiprocessors can execute multiple tiles simultaneously.
if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true?
Not necessarily. As mentionned earlier, a multiprocessor may have multiple active tiles. It may therefore schedule threads from other tiles to remain busy.
Important note: On GPU, threads are not executed on a granularity of 1. For example, NVIDIA hardware executes 32 threads at once.
To not make this answer needlessly lengthy, I encourage you to read up on the concept of warp.
The GPU certainly won't run 1000 threads at the same time, but it also won't complete them 300 at a time.
It uses multithreading, which means that just like in a CPU, it will share run time among the 1000 threads allowing them to complete seemingly at the same time.
Keep in mind creating a lot of threads may be not interesting for several reasons. For instance, if you must complete all 1000 tasks in step 1 before doing step 2, you might aswell distribute them on a number of threads equal to the number of cores in your GPU and no more than that.
Using more threads than the number of cores only makes sense if you want to dispatch tasks that are not being waited on, or because you felt like doing your code this way is easier. But keep in mind thread management is time-costly too and may drag down your performance.
Sorry for the vagueness, but I'm just trying to understand websphere memory management at a high level.
This is really a question about JVM behavior. As far as I know, there are no JVMs that will block a thread waiting for another thread to finish if it is holding a large amount of memory. I expect both threads to continuously consume memory, and if both are able to allocate memory at the same rate, I would expect them both to get OutOfMemoryError as soon as their combined allocations exceed the max heap size.
We're trying to figure out the optimum number of threads to use for our NServiceBus service. We're running it on a machine with 2 quad cores. We've been having problems with the queue backing up. We started with 100 threads then bumped it to 200 and things got worse. We backed it down to 75, then 50 and it seemed even better. Is there some optimal number based on how many CPU's we have or some rule of thumb that we should use to determine the number of threads to run?
Every thread you have running has an overhead attached to it. If you have 2 quad cores then you will be able to have exactly 8 threads running at any one time. Each thread will be consuming a core.
If you have more than 8 threads then there is a chance you will start to do LESS useful work, not more. This is because every time windows decides to give one of the threads not currently consuming a core a turn at doing something it needs to store the state of one of the running threads and then restore the old state of the thread that is about to run - then let the thread go at it. If you have a huge number of threads you're going to spend a large amount of time just switching between the threads and doing nothing useful.
If you have a bunch of threads that are blocked waiting for IO (for instance a message to finish writing to disk so it can be got at) then you might be able to run more threads than you have cores and still get something useful done as a number of those threads will be sitting waiting for something else to complete. It's a complex subject and there is no real answer to 'how many threads should I use'. A good rule of thumb is have a thread for every core and then try and play with it a bit if you want to achieve more throughput. Testing it under real conditions is the only real way to find the sweet spot. You might find that you only need one thread to process the messages and half the time that thread is blocked waiting for a message to come in....
Obviously, even what I've described is oversimplified. Windows needs access to the cores to do OSy things so even if you have 8 cores all of your 8 threads wont always be running because the windows threads are having a turn... then you have IO threads etc....
a.why is there so many wait states in the in the vms/vax process states ?
All of the waits except one have to do with memory swapping or thread swapping.
The VAX architecture had virtual addressing. A program could access up to 1 gigabyte of address space, which was huge in 1977. If I remember correctly, 32 or 64 megabytes of memory was the standard. This meant that programs could access more memory than the machine actually had. VAX managed this virtual memory by paging memory to and from a disk drive.
Multiple users could use the VAX. This was accomplished with multiple user threads. Since the processor could only execute one instruction at a time, only one thread could be active at a time. Generally, a thread would run until an I/O instruction was encountered. The thread would be swapped out, and other threads allowed to execute, while the I/O instruction completed.
If you want to really feel what it was like back in the olden days, read Tracy Kidder's "Soul of a New Machine". It's the story of the team that developed the Data General Eclipse MV/8000.
Because each one of them has its own purpose...
I have a WebSphere Portal application running four instances on a single box and after about 7 days of runtime there is only 130-150mb of address space free in native memory (using PMAP). Somewhere in another 7-10 days the figure drops well below 100mb (which we deem dangerous and we start to recycle the JVM). If we don't do the recycle, the JVM will eventually crash with a SIGSEGV signal.
We've done some initial investigation into class counts and the size of JIT code. Class counts grow, but slowly from 50k onwards...about a couple hundred per day. JITC sizes get to about 210 MB after 7 days and grow about 1 MB per day after that so. In our previous experience we don't find these to be sinister values.
What we need to to be able to break down what is in the native heap, whether it is threads (all thread counts appear normal and we have fixed thread pools), String pools, constant pools, bytecode, or whatever else.
One lead we are trying now is reducing the reflection threshold to 0 (shutting off the bytecode accessors for reflectively created classes). This app uses a lot of pointcutting and a lot of reflection, so we're hoping there's a good chance this helps.
Any advice is welcome.
Might be a bit of back-and-forth, but have you GC logged and ensured it's not growing Java heap over time? Looked at your perm space? The SIGSEGV is an interesting one though, I'd expect a more JVM-ish crash for any in-Java mem issues.
After lengthy investigation, this ended up being a WebSphere bug: PK72252: CALLS TO CLASSLOADER.GETRESOURCEASSTREAM ARE SLOW. Fixed in 6.0.2.33.