I have a question with regards to resources and threads (it's not clear to me from the documentation).
Are the resources per thread ?
That's the case with various HPC job submission systems. E.g.: that's for example how jobs work on LSF's bsub:
If I request 64 threads, with 1024MiB each, bsub will schedule a job with 64 process, each consuming 1024MiB individually, and thus consuming 64GiB in total.
(That total memory may or may not be on the same machine, as the 64 processes may or may not be on the same machine depending on the host[span=n] parameters. For openMPI uses it might well be 64 different machines each allocating it's own local 1024MiB chunk. But with host[span=1], it's going to be a single machine with 64 threads and 64GiB memory).
When looking at the LSF profile, mem_mb seems to passed with only unit versions but otherwise the same value from ressources to bsub
thus it seems that snakemake and LSF both assume that total_memory = threads * mem_mb.
I just wanted to make sure this assumption is correct.
Upon further analysis, the resources accounting in jobs.py is in contactiction of the above.
Filing a bug request
Related
This is an interview question I encountered today. I have some knowledge about OS but not really proficient at it. I think maybe there are limited threads for each process can create?
Any ideas will help.
This question can be viewed [at least] in two ways:
Can your process get more CPU time by creating many threads that need to be scheduled?
or
Can your process get more CPU time by creating threads to allow processing to continue when another thread(s) is blocked?
The answer to #1 is largely system dependent. However, any rationally-designed system is going to protect again rogue processes trying this. Generally, the answer here is NO. In fact, some older systems only schedule processes; not threads. In those cases, the answer is always NO.
The answer to #2 is generally YES. One of the reasons to use threads is to allow a process to continue processing while it has to wait on some external event.
The number of threads that can run in parallel depends on the number of CPUs on your machine
It also depends on the characteristic of the processes you're running, if they're consuming CPU - it won't be efficient to run more threads than the number of CPUs on your machine, on the other hand, if they do a lot of I/O, or any other kind of tasks that blocks a lot - it would make sense to increase the number of threads.
As for the question "how many" - you'll have to tune your app, make measurements and decide based on actual data.
Short answer: Depends on the OS.
I'd say it depends on how the OS scheduler is implemented.
From personal experience with my hobby OS, it can certainly happen.
In my case, the scheduler is implemented with a round robin algorithm, per thread, independent on what process they belong to.
So, if process A has 1 thread, and process B has 2 threads, and they are all busy, Process B would be getting 2/3 of the CPU time.
There are certainly a variety of approaches. Check Scheduling_(computing)
Throw in priority levels per process and per thread, and it really depends on the OS.
I have been confused about the issue of context switches between processes, given round robin scheduler of certain time slice (which is what unix/windows both use in a basic sense).
So, suppose we have 200 processes running on a single core machine. If the scheduler is using even 1ms time slice, each process would get its share every 200ms, which is probably not the case (imagine a Java high-frequency app, I would not assume it gets scheduled every 200ms to serve requests). Having said that, what am I missing in the picture?
Furthermore, java and other languages allows to put the running thread to sleep for e.g. 100ms. Am I correct in saying that this does not cause context switch, and if so, how is this achieved?
So, suppose we have 200 processes running on a single core machine. If
the scheduler is using even 1ms time slice, each process would get its
share every 200ms, which is probably not the case (imagine a Java
high-frequency app, I would not assume it gets scheduled every 200ms
to serve requests). Having said that, what am I missing in the
picture?
No, you aren't missing anything. It's the same case in the case of non-pre-emptive systems. Those having pre-emptive rights(meaning high priority as compared to other processes) can easily swap the less useful process, up to an extent that a high-priority process would run 10 times(say/assume --- actual results are totally depending on the situation and implementation) than the lowest priority process till the former doesn't produce the condition of starvation of the least priority process.
Talking about the processes of similar priority, it totally depends on the Round-Robin Algorithm which you've mentioned, though which process would be picked first is again based on the implementation. And, Windows and Unix have same process scheduling algorithms. Windows and Unix does utilise Round-Robin, but, Linux task scheduler is called Completely Fair Scheduler (CFS).
Furthermore, java and other languages allows to put the running thread
to sleep for e.g. 100ms. Am I correct in saying that this does not
cause context switch, and if so, how is this achieved?
Programming languages and libraries implement "sleep" functionality with the aid of the kernel. Without kernel-level support, they'd have to busy-wait, spinning in a tight loop, until the requested sleep duration elapsed. This would wastefully consume the processor.
Talking about the threads which are caused to sleep(Thread.sleep(long millis)) generally the following is done in most of the systems :
Suspend execution of the process and mark it as not runnable.
Set a timer for the given wait time. Systems provide hardware timers that let the kernel register to receive an interrupt at a given point in the future.
When the timer hits, mark the process as runnable.
I hope you might be aware of threading models like one to one, many to one, and many to many. So, I am not getting into much detail, jut a reference for yourself.
It might appear to you as if it increases the overhead/complexity. But, that's how threads(user-threads created in JVM) are operated upon. And, then the selection is based upon those memory models which I mentioned above. Check this Quora question and answers to that one, and please go through the best answer given by Robert-Love.
For further reading, I'd suggest you to read from Scheduling Algorithms explanation on OSDev.org and Operating System Concepts book by Galvin, Gagne, Silberschatz.
I have a long running (5-10 hours) Mac app that processes 5000 items. Each item is processed by performing a number of transforms (using Saxon), running a bunch of scripts (in Python and Racket), collecting data, and serializing it as a set of XML files, a SQLite database, and a CoreData database. Each item is completely independent from every other item.
In summary, it does a lot, takes a long time, and appears to be highly parallelizable.
After loading up all the items that need processing it, the app uses GCD to parallelize the work, using dispatch_apply:
dispatch_apply(numberOfItems, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^(size_t i) {
#autoreleasepool {
...
}
});
I'm running the app on a Mac Pro with 12 cores (24 virtual). So I would expect to have 24 items being processed at all times. However, I found through logging that the number of items being processed varies between 8 and 24. This is literally adding hours to the run time (assuming it could work on 24 items at a time).
On the one hand, perhaps GCD is really, really smart and it is already giving me the maximum throughput. But I'm worried that, because much of the work happens in scripts that are spawned by this app, maybe GCD is reasoning from incomplete information and isn't making the best decisions.
Any ideas how to improve performance? After correctness, the number one desired attribute is shortening how long it takes this app to run. I don't care about power consumption, hogging the Mac Pro, or anything else.
UPDATE: In fact, this looks alarming in the docs: "The actual number of tasks executed by a concurrent queue at any given moment is variable and can change dynamically as conditions in your application change. Many factors affect the number of tasks executed by the concurrent queues, including the number of available cores, the amount of work being done by other processes, and the number and priority of tasks in other serial dispatch queues." (emphasis added) It looks like having other processes doing work will adversely affect scheduling in the app.
It'd be nice to be able to just say "run these blocks concurrently, one per core, don't try to do anything smarter".
If you are bound and determined, you can explicitly spawn 24 threads using the NSThread API, and have each of those threads pull from a synchronized queue of work items. I would bet money that performance would get noticeably worse.
GCD works at its most efficient when the work items submitted to it never block. That said, the workload you're describing is rather complex and rife with opportunities for your threads to block. For starters, you're spawning a bunch of other processes. Right here, this means that you're already relying on the OS to divvy up time/resources between your master task and these slave tasks. Other than setting the OS priority of each subprocess, the OS scheduler has no way to know which processes are more important than others, and by default, your subprocesses are going to have the same priority as their parent. That said, it doesn't sound like you have anything to gain by tweaking process priorities. I'm assuming you're blocking the master task thread that's waiting for the slave tasks to complete. That is effectively parking that thread -- it can do no useful work. But like I said, I don't think there's much to be gained by tweaking the OS priorities of your slave tasks, because this really sounds like it's an I/O bound workflow...
You go on to describe three I/O-heavy operations ("serializing it as a set of XML files, a SQLite database, and a CoreData database.") So now you have all these different threads and processes vying for what is presumably a shared bulk storage device. (i.e. unless you're writing to 24 different databases, on 24 separate hard drives, one for each core, your process is ultimately going to be serialized at the disk accesses.) Even if you had 24 different hard drives, writing to a hard drive (even an SSD) is comparatively slow. Your threads are going to be taken off of the CPU they were running on (so that another thread that's waiting can run) for virtually any blocking disk write.
If you wanted to maximize the performance you're getting out of GCD, you would probably want to rewrite all the stuff you're doing in subtasks in C/C++/Objective-C, bringing them in-process, and then conducting all the associated I/O using dispatch_io primitives. For API where you don't control the low-level reads and writes, you would want to carefully manage and tune your workload to optimize it for the hardware you have. For instance, if you have a bunch of stuff to write to a single, shared SQLite database, there's no point in ever having more than one thread trying to write to that database at once. You'd be better off making one thread (or a serial GCD queue) to write to SQLite and submitting tasks to that after pre-processing is done.
I could go on for quite a while here, but the bottom line is that you've got a complex, seemingly I/O bound workflow here. At the highest-level, CPU utilization or "number of running threads" is going to be a particularly poor measure of performance for such a task. By using sub-processes (i.e. scripts), you're putting a lot of control into the hands of the OS, which knows effectively nothing about your workload a priori, and therefore can do nothing except use its general scheduler to divvy up resources. GCD's opaque thread pool management is really the least of your problems.
On a practical level, if you want to speed things up, go buy multiple, faster (i.e. SSD) hard drives, and rework your task/workflow to utilize them separately and in parallel. I suspect that would yield the biggest bang for your buck (for some equivalence relation of time == money == hardware.)
I have a program that takes about 1 second to run and takes a file as input and produces another file as output. Problem is I have to be able to process about 30 files a second. The files to process will be available as a queue (implemented over memcached) and don't have to be processed exactly in order, so basically an instance of the program checks out a file to process and does so. I could use a process manager that automatically launches instances of the program when system resources are available.
At the simple end, "system resources" will simply mean "up to two processes at a time," but if I move to a different machine make this could be 2 or 10 or 100 or whatever. I could use a utility to handle this, at least. And at the complex end, I would like to bring up another process whenever CPU is available since these machines will be dedicated. CPU time seems to be the constraining resource - the program isn't memory intensive.
What tool can accomplish this sort of process management?
Storm - Without knowing more details, I would suggest Backtype Storm. But it would probably mean a total rewrite of your current code. :-)
More details at Tutorial, but it basically takes tuples of work and distributed them through a topology of worker nodes. A "spout" emits work into the topology and a "'bolt" is a step/task in the graph where some bit of work takes place. When a bolt finish it's work, it emits same/new tuple back into the topology. Bolts can do work in parallel or series.
TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.