How to allocate more CPU and RAM to SUMO (Simulation of Urban Mobility) - sumo

I have downloaded and unzipped sumo-win64-0.32.0 and running sumo.exe this on a powerful machine (64GB ram, Xeon CPU E5-1650 v4 3.6GHz) for about 140k trips, 108k edges, and 25k vehicles types which are departed in the first 30 min of simulation. I have noticed that my CPU is utilized only 30% and Memory only 38%, Is there any way to increase the speed by forcing sumo to use more CPU and ram, or possibly run in parallel? From "Can SUMO be run in parallel (on multiple cores or computers)?
The simulation itself always runs on a single core."
it appears that parallel processing is not possible t, but what about dedicating more CPU and ram?

Windows usually shows the CPU utilization such that 100% means all cores are used, so 30% is probably already more than one core and there is no way of increasing that with a single threaded application as sumo. Also if your scenario fits within RAM completely there is no point of increasing that. You might want to try one of the several parallelization approaches SUMO has but none of them got further than some toy examples (and none is in the official distribution) and the speed improvements are sometimes only marginal. Probably the best you can do is to do some profiling and find the performance bottlenecks and/or send your results to the developers.

Related

Does low GPU utilization indicate bad fit for GPU acceleration?

I'm running some GPU-accelerated PyTorch code and training it against a custom dataset, but while monitoring the state of my workstation during the process, I see GPU usage along the following lines:
I have never written my own GPU primitives, but I have a long history of doing low-level optimizations for CPU-intensive workloads and my experience there makes me concerned that while pytorch/torchvision are offloading the work to the GPU, it may not be an ideal workload for GPU acceleration.
When optimizing CPU-bound code, the goal is to try and get the CPU to perform as much (meaningful) work as possible in a unit of time: a supposedly CPU-bound task that shows only 20% CPU utilization (of a single core or of all cores, depending on whether the task is parallelizable or not) is a task that is not being performed efficiently because the CPU is sitting idle when ideally it would be working towards your goal. Low CPU usage means that something other than number crunching is taking up your wall clock time, whether it's inefficient locking, heavy context switching, pipeline flushes, locking IO in the main loop, etc. which prevents the workload from properly saturating the CPU.
When looking at the GPU utilization in the chart above, and again speaking as a complete novice when it comes to GPU utilization, it strikes me that the GPU usage is extremely low and appears to be limited by the rate at which data is being copied into the GPU memory. Is this assumption correct? I would expect to see a spike in copy (to GPU) followed by an extended period of calculations/transforms, followed by a brief copy (back from the GPU), repeated ad infinitum.
I notice that despite the low (albeit constant) copy utilization, the GPU memory is constantly peaking at the 8GB limit. Can I assume the workload is being limited by the low GPU memory available (i.e. not maxing out the copy bandwidth because there's only so much that can be copied)?
Does that mean this is a workload better suited for the CPU (in this particular case with this RTX 2080 and in general with any card)?

Why does higher core count lead to higher CPI?

I'm looking at a chart that shows that, in reality, increasing the core count on a CPI usually results in higher CPI for most instructions, as well as that it usually increases the total amount of instructions the program executes. Why is this happening?
From my understanding CPI should increase only when increasing the clock frequency, so the CPI increase doesn't make much sense to me.
What chart? What factors are they holding constant while increasing core count? Perhaps total transistor budget, so each core has to be simpler to have more cores?
Making a single core larger has diminishing returns, but building more cores has linear returns for embarrassingly parallel problems; hence Xeon Phi having lots of simple cores, and GPUs being very simple pipelines.
But CPUs that also care about single-thread performance / latency (instead of just throughput) will push into those diminishing returns and build wider cores. Many problems that we run on CPUs are not trivial to parallelize, so lots of weak cores is worse than fewer faster cores. For a given problem size, the more threads you have the more of its total time the thread will be communicating with other threads (and maybe waiting for data from them).
If you do keep each core identical when adding more cores, their CPI generally stays the same when running the same code. e.g. SPECint_rate scales nearly linearly with number of cores for current Intel/AMD CPUs (which do scale up by adding more of the same cores).
So that must not be what your chart is talking about. You'll need to clarify the question if you want a more specific answer.
You don't get perfectly linear scaling because cores do compete with each other for memory bandwidth, and space in the shared last-level cache. (Although most current designs scale up the size of last-level cache with the number of cores. e.g. AMD Zen has clusters of 4 cores sharing 8MiB of L3 that's private to those cores. Intel uses a large shared L3 that has a slice of L3 with each core, so the L3 per core is about the same.)
But more cores also means a more complex interconnect to wire them all together and to the memory controllers. Intel many-core Xeon CPUs notably have worse single-thread bandwidth than quad-core "client" chips of the same microarchitecture, even though the cores are the same in both. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

Optimizing Tensorflow for a 32-cores computer

I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.
I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)
TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:
The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.
You can find useful suggestions here
Use the timeline to see what is executed when

How fast is PhysX on GPU compared to physics engines on CPU?

I have an application that is written to use the Bullet physics engine. I am running it on an Intel i7 2600K CPU with 8 cores. The application has to process millions of chunks of physics work, each of which can be done independently. It currently runs with 8 processes, each process working through its quota of the total independently. In summary, this work has a lot of easy parallelism.
Assuming that I can acquire the best NVIDIA consumer graphics card (say Titan), what is the ballpark improvement in the physics engine performance I can see by switching from Bullet on CPU to Physx on GPU? That is, approximately how much faster will this application run if rewritten for Physx?
I found a few papers that compare the result quality between Bullet and Physx, but could not find anything about the performance comparison.
Pierre Terdimann has done an extensive series of performance comparisons between Bullet 2.81 and PhysX 2.8.4, 3.2 and 3.3 here. These are comparisons between Bullet and PhysX, both running on CPU. It can be seen that the performance difference between the two is dependent on what features of the engine are being used. For a few features, the performance is about the same, while for most others there is a 3-5x speedup.
He also mentions in the addendum that not all physics features have been ported to PhysX on GPU. Cloth and particles can be accelerated on GPU, while rigid bodies is being currently ported to GPU, in a feature called GPU Rigid Bodies (GRB). If there is a feature that is GPU accelerated, then you can expect it to be faster than on CPU, but by how much is not clear.
I found this, it's not a comparison against any specific CPU physics engine but one hopes they are comparing like with like and running PhysX on the CPU.
So it's rather unspecific and from a FAQ by the makers of PhysX so take with a pinch of salt.
From here:
Running PhysX on a mid-to-high-end GeForce GPU will enable 10-20 times
more effects and visual fidelity than physics running on a high-end
CPU.
Lets say physx is doing particle interactions such as gravity of fluid movement. Then the cache control is very important since they are emberassingly parallel. You cannot directly control your CPU's cache but you can access to cache of titan which makes it maybe 100x faster than a 8-thread cpu.
If it is not so parallel and has many branching and doesnt have exhausting computations then it is around 10x-5x speedup(or whatever bandwidth ratio of graphics ram /main RAM).

Hadoop cluster. 2 Fast, 4 Medium, 8 slower machines?

We're going to purchase some new hardware to use just for a Hadoop cluster and we're stuck on what we should purchase. Say we have a budget of $5k should we buy two super nice machines at $2500/each, four at around $1200/each or eight at around $600 each? Will hadoop work better with more slower machines or fewest much faster machines? Or, as like most things "it depends"? :-)
You're generally better off with Hadoop getting a few extra machines that are less beefy. You almost never see datanodes with more than 16GB ram and dual quad-core CPUs, and often they are smaller than that.
You always have to run one as the namenode (master), and generally you don't also run a datanode (worker/slave) on the same box, although you could since your cluster is small. Assuming you don't, though, getting 2 machines will leave you only 1 worker node, which somewhat defeats the purpose. (Not entirely, because you can still run 4-8 jobs in parallel on the slave, but still.)
At the same time, you don't want to have a cluster of 1000 486s. If your budget is $5k, I would strike a balance and do 4 $1200 machines. Those will provide a decent baseline in terms of individual performance, you'll have 3 datanodes to distribute work to, and you'll have room to grow your cluster if you need.
Things to keep in mind: you'll want to run multiple map or reduce tasks per datanode, and that means multiple JVMs running simultaneously. I would try to get at least 4GB, and preferably 8GB ram. CPU is less important as most MR jobs are IO bound. You could likely get a machine like this for your $1200 price target, so that's my vote.
In a nutshell, you want to max out the number of processor cores and disks. You can sacrifice reliability and quality, but don't get the cheapest hardware out there, as you will have too many reliability problems.
We went with Dell 2xCPU 4-core dell servers, so 8 cores per box. 16GB of memory per box, which is 2GB per core, a bit low as you need memory both for your tasks and for disk buffering. 5x500GB hard drives, and I wish we'd gone for terabyte or higher drives instead.
For drives, my opinion is to buy more cheap, slow, unreliable, high-capacity drives as opposed to more expensive, faster, smaller, reliable drives. If you're having problems with disk throughput, more memory will help with buffering.
This is probably a beefier configuration than you're looking at, but maxing out cores and drives versus buying more boxes is generally a good choice - less power costs, easier to administer, and faster for some operations.
More drives means more simultaneous disk throughput per core, so having as many drives as cores is a good thing. Benchmarking seems to indicate that RAID configurations are slower than JBOD configuration (just mounting the drives and having Hadoop spread load across them) and JBOD is also more reliable.
LAST! Be sure to get ECC memory. Hadoop pushes terabytes of data through memory, and some users have found that non-ECC memory configurations can occasionally introduce single bit errors in terabyte-sized datasets. Debugging these errors is a nightmare.
I recommend having a look at this presentation: http://www.cloudera.com/hadoop-training-thinking-at-scale
Here the various pro's and con's are described.
I think the answer also depends on Your expectations of the cluster grow and networking technology You are using. If you are ok with 1GB ethernet - then type of machines is less significant. In the same time - if you want 10GBit ethernet - you should opt to smaller number of better machines to reduce the cost of networking.
another reference : http://hadoopilluminated.com/hadoop_book/Hardware_Software.html
(disclaimer : I am a co-author of this free hadoop book)