Designing a Computer for Spatially-Explicit Modeling in NetLogo - hardware

I have done various searches and have yet to find a forum or article that discusses how to approach building a modeling computer for use with NetLogo. I was hoping to start such a discussion, and since the memory usage of NetLogo is proportional to the size of the world and number of simulations run in parallel with BehaviorSpace, it seems reasonable that a formula exists relating sufficient hardware to NetLogo demands.
As an example, I am planning to run a metapopulation model in a landscape approximately 12km x 12km, corresponding to a NetLogo world of 12,000x12,000 at a pixel size of 1, for a 1-meter resolution (relevant for the animal's movement behavior). An earlier post described a large world (How to model a very large world in NetLogo?), and provided a discussion for potential ways to reduce needing large worlds (http://netlogo-users.18673.x6.nabble.com/Re-Rumors-of-Relogo-td4869241.html#a4869247). Another post described a world of 3147x5141 and was using a Linux computer with 64GB of RAM (http://netlogo-users.18673.x6.nabble.com/Java-OutofMemory-errors-for-large-NetLogo-worlds-suggestions-requested-td5002799.html). Clearly, the capability of computers to run large NetLogo worlds is becoming increasingly important.
Presumably, the "best" solution for researchers at universities with access to Windows-based machines would be to run 16GB to 64GB of RAM with a six- or eight-core processor such as the Intel Xeon capable of hyperthreading for running multiple simulations in parallel with BehaviorSpace. As an example, I used SELES (Fall & Fall 2001) on a machine with a 6-core Xeon processor with hyperthreading enabled and 8GB of RAM to run 12,000 replicates of a model with a 1-meter resolution raster map of 1580x1580. This used the computer to its full capacity and it took about a month to run the simulations.
So - if I were to run 12,000 replicates of a 12,000x12,000 world in NetLogo, what would be the "best" option for a computer? Without reaching for the latest and greatest processing power out there, I would presume the most fiscally-reasonable option to be a server board with dual Xeon processors (likely 8-core Ivy bridge) with 64GB of RAM. Would this be a sufficient design, or are there alternatives that are cheaper (or not) for modeling at this scale? And additionally, do there exist "guidelines" of processor/RAM combinations to cope with the increasing demand of NetLogo on memory as the size of worlds and the number of parallel simulations increase?

Related

Is There An Exhaustive Profiler?

Relatively often I see people benchmark/profile (or advise someone else to benchmark/profile) a specific piece of code in one specific circumstance on one specific CPU in one specific computer; and then (potentially falsely) assume that this result applies to code in different circumstances (e.g. other logical CPUs in the same core under different loads) on a wide variety of very different CPUs (e.g. "all 64-bit 80x86") in a wide variety of different computers (e.g. with different RAM timings, etc).
What I'm looking for is a kind of profiler that is able to generate profiling results for many CPUs under many conditions (primarily through interpreting the code rather than direct measurement); and then combine all of the results using weighting factors (where weighting factors represent how much the user cares about each measured case) to create a result that is actually useful and not misleading.
Is there any profiling tool that fits this description?
I think there is no universal performance forecasting tool published in Internet; but there may be some inside CPU vendors to optimize next microarchitectures.
There is valgrind binary instrumentation platform with callgrind/cachegrind (slow) simple model profilers. Callgrind counts basic block executions in model like 1 instruction is something like 1 cpu clock; cachegrind additionally instruments models memory accesses with some 2 level cache model and also may model simple branch predictor. Both tools has no knowledge/models of wide decode/execution/retire capabilities of modern OOO CPUs from Vendor 1 and Vendor 2 of "all 64-bit 80x86"-compatible CPUs (and OOO cpus are similar in basic OOO capabilities and performance).
There were several open-source projects of OOO CPU simulators (from slow to very slow) like: MARSSx86 (http://marss86.org/, 2012) based on PTLsim (http://www.facom.ufms.br/~ricardo/Courses/CompArchII/Tools/PTLSim/PTLsimManual.pdf, 2007), or Sniper Multi-Core Simulator (with help of Graphite framework). (There is also DRAMSim/DRAMSim2 memory simulator which is needed for accurate system simulation and it is used in several other simulator projects; it can be optionally used in RISC-V Rocket-Chip simulator)
You may be interested in some (very-very slow - tens KIPS) cycle-accurate simulator / microarchitecture simulator, but there are not too many open source variants of them. There are some commercial simulators (for example in ARM world - ARM Cycle Models / CPAKs; ARC nSIM, ...); or simplescalar.com. There are also in-house proprietary simulators (no access to them for us).
The only public approximation to microarchitecture/cycle simulator is Vendor 1's IACA: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer (inexact partial model of OOO port planning for short code sequences like inner loops without any memory hierarchy modeling). And there is other tool "SDE" from Vendor 1 to estimate/debug some future CPU instruction extensions with older CPUs and PIN binary rewriting tool: https://software.intel.com/en-us/articles/intel-software-development-emulator.

Which kinds of low level facilities aren't typically supported on multi-core machines?

I'm looking at some optimized, low level, cross platform, concurrency code designed to run on multi-core machines, and want to check some of its assumptions.
Support for hardware optimizations of some kinds aren't, probably, supported on multi core designs (for example, Out of Order Execution support [wikipedia] seems like a good candidate - it takes a lot of surface area to implement, and can be a power hog). Does anyone have a list of other such facilities - ones typically available on single or small number of core machines, but typically left out from machines with larger number of cores on them?
Today, multicore machines are warmed-over die shrinks of uniprocessors. You could almost imagine sawing a 4-core die into 4 1-core dice. I exaggerate only a little bit.
In future, multicore machines will be more thoughtfully designed for energy efficiency and area efficiency. You may see the same ISA, but with different mixes of resources (more or fewer numbers of duplicated functional units), and even with some sharing of resources between cores (e.g. AMD Bulldozer). And, as you say, backing off from the complexity and energy overhead of no-holds-barred out-of-order execution. This will most likely be perceived as different instruction-per-clock (IPC) differences (more or less performance) on the same instruction set architecture.
Also as vendors have to juggle a hypothetical portfolio of big out-of-order serial performance optimized cores and small in-order or less-out-of-order (OoO) and narrower, more energy efficient "throughput" cores, they will be challenged to keep these different implementations in sync with the evolutions of their ISAs. Some cores may support new instructions, new state, new coprocessors, virtualization, security, etc. earlier than others. This leads to a challenge of coding to the common denominator while also lighting up the new facilities for better perf or energy efficiency (or whatever) on those cores that have the new capabilities.
So to answer your specific question, all the traditional computer architecture techniques for trading gates for expressive-power, or performance, or energy efficiency may be rethought and selectively removed in future small throughput-oriented cores.
Hardware multithreading
Aggressive OoO -> humble OoO or even in-order execution
High degrees of microarchitectural speculation
Fancy branch predictors
Big TLBs
Fancy memory prefetchers
Deep pipelines
Wide issue / many copies of functional units
Big caches, wide buses to caches
...
But it goes both ways. It may also be that the new small throughput-optimized energy-optimized cores have new features not present in the older OoO cores. For example, the Larrabee New Instructions (LRBni) (http://www.drdobbs.com/high-performance-computing/216402188) were proposed for a machine with dozens of simpler cores. As another example, the small cores may turn to hardware multithreading to afford better memory latency tolerance to compensate for smaller private caches.
Also, having lots of small energy frugal cores means you may be willing to dedicate and therefore customize some of the cores to optimize performance for particular valuable workloads. For example, the Tensilica custom processors and tools anticipate that some of your small cores will have additional instructions and custom problem-specific datapaths (accelerating an inner loop of video decoding, for example). So in these cases the little core may (counter-intuitively) have much better performance than the much larger core.
Makes sense?
Happy hacking!

CPU Cards for Parallel Computation?

I remember reading some time ago that there were cpu cards for systems to add additional processing power to do mass parallelization. Anyone have any experience on this and any resources to get looking into the hardware and software aspects of the project? Is this technology inferior to a traditional cluster? Is it more power conscious?
There are two cool options. one is the use of GPU's as Mitch mentions. The other is to get a PS/3, which has a multicore Cell processor.
You can also set up multiple inexpensive motherboard PCs and run Linux and Beowulf.
GPGPU is probably the most practical option for an enthusiast. However, DSPs are another option, such as those made by Texas Instruments, Freescale, Analog Devices, and NXP Semiconductors. Granted, most of those are probably targeted more towards industrial users, but you might look into the Storm-1 line of DSPs, some of which are supposed to go for as low as $60 a piece.
Another option for data parallelism are Physics Processing Units like the Nvidia (formerly Ageia) PhysX. The most obvious use of these coprocessors are for games, but they're also used for scientific modeling, cryptography, and other vector processing applications.
ClearSpeed Attached Processors are another possibility. These are basically SIMD co-processors designed for HPC applications, so they might be out of your price range, but I'm just guessing here.
All of these suggestions are based around data parallelism since I think that's the area with the most untapped potential. A lot of currently CPU-intensive applications could be performed much faster at much lower clock rates (and using less power) by simply taking advantage of vector processing and more specialized SIMD instruction sets.
Really, most computer users don't need more than an Intel Atom processor for the majority of their casual computing needs: e-mail, browsing the web, and playing music/video. And for the other 10% of computing tasks that actually do require lots of processing power, a general-purpose scalar processor typically isn't the best tool for the job anyway.
Even most people who do have serious processing needs only need it for a narrow range of applications; a physicist doesn't need a PC capable of playing the latest FPS; a sound engineer doesn't need to do scientific modeling or perform statistical analysis; and a graphic designer doesn't need to do digital signal processing. Domain-specific vector processors with highly specialized instruction sets (like modern GPUs for gaming) would be able to handle these tasks much more efficiently than a high power general-purpose CPU.
Cluster computing is no doubt very useful for a lot of high end industrial applications like nuclear research, but I think vector processing has much more practical uses for the average person.
Have you looked at the various GPU Computing options. Nvidia (and probably others) are offering personal supercomputers based around utilising the power of graphics cards.
OpenCL - is an industry wide standard for doing HPC computing across different vendors and processor types, single-core, multi-core, graphics cards, cell, etc... see http://en.wikipedia.org/wiki/OpenCL.
The idea is that using a simple code base you can use all spare processing capacity on the machine regardless of type of processor.
Apple has implemented this standard in its next version Mac OS X. There will also be offerings from nVIDIA, ATI, Intel etc.
Mercury Computing offers a Cell Accelerator Board, it's a PCIe card that has a Cell processor, and runs Yellow Dog Linux, or Mercury's flavor of YDL. Fixstars offers a more powerful Cell PCIe board called the GigaAccel. I called up Mercury, they said their board is about $5000 USD, without software. I'd guess the GigaAccel is up to twice as expensive.
I found one of the Mercury boards used, but it didn't come with a power cable, so I haven't been able to use it yet, sadly.

What makes a modern commodity cluster?

Would would be the most cost effective way of implementing a terabyte distributed memory cache using commodity hardware these days? What would class as a piece of commodity hardware?
Commodity hardware is considered hardware that
Is off the shelf (nothing custom)
Is available in substantially similar version from many manufacturers.
There are many motherboards that can hold 8 or 16 GB of RAM. Fewer server motherboards can hold 32 and even 64GB.
But they fit the definition of commodity, therefore can be made into very large clusters for a very large sum of money.
Note, however, that in many access patterns a striped RAID HD array doesn't go much slower than a gigabit ethernet link - so a RAM cluster might not have significant improvement (except in latency) depending on how you're actually using it.
-Adam

Hardware requirements for a Virtual Server

We have decided to go with a virtualization solution for a few of our development servers. I have an idea of what the hardware specs would be like if we bought separate physical servers, but I have no idea how to consolidate that information into the specification for a generalized virtual server.
I know intuitively that the specs are not additive - I shouldn't just add up all the RAM requirements from each machine to get the RAM required for the virtual server. I can't really treat them as parallel systems either because no matter how good the virtualization software is, it can't abstract away two servers trying to peg the CPU at the same time.
So my question is - is there a standard method to estimating the hardware requirements for a virtualized system given hardware requirement estimations for the underlying virtual machines? Is there a +C constant for VMWare/MS Virtual Server overhead (and if so, what is C?)?
P.S. I promise to move this over to serverfault once it goes into beta (Promise kept)
Yes add 25% additional resources to manage the VM. So if I need 4 servers that are equal to single core 2 ghz machines with 2 gigs of ram I will need 10 ghz processing power plus 10 gigs of ram. This will allow all systems to redline and still be ok.
In the real world this will never happen though, all your servers will not always be running all the time. You can get a feel for usage by profiling your current servers and determine their exact requirements and then adding an additional 25% in resources.
Check out this software for profiling utilization http://confluence.atlassian.com/display/JIRA/Profiling+Memory+and+CPU+usage+with+YourKit
The requirements are in fact additive. You should add up the memory requirements for each VM, and the disk requirements, and have at least one processor core per VM. Then add on whatever you need for the host system.
VMs can share a CPU, to some extent, if you have really low performance requirements, but they cannot share disk space or memory.
Answers above are far too high, second (1 core per VM) is closer. You can either 1) plan ahead and probably over-purchase 2) add just-in-time. Do you have some reason that you must know well ahead (yearly budget? your chosen host platform doesn't cluster hosts, so you can't add later?)
Unless you have an incredible simple usage profile, it will be hard to predict before and you'll over purchase. The answer above (+25%) would be several times more than you need for an modern server virtualization software (VMware, Zen, etc) that manages resources smartly. It's accurate only for desktop products like VPC. I chose to rough it out on a napkin and profile my first environment (set of machines) on the host. I'm happy.
Examples of things that will confound your estimation
Disk space, Some systems (Lab
Manager) use only the difference in
space from the base template. 10
deployed machines with 10 GB drives
using about 10 GB (template) + 200MB.
Disk space: You'll then find you
don't like the deltas in specific
scenarios.
CPU / Memory: This is dev
shop - so you'll have erratic load.
Smart hosts don't reserve memory and CPU.
CPU / Memory: But then you'll
want to do perf testing, and want to
reserve CPU cycles (not all hosts can
do that)
We all virtualize for different reasons. Many of the guests
in our environment don't have much work. We want them there to see how something behaves with a cluster of 3 servers of type X. Or, we have a bundle of weird client desktops waiting around, being used one at time by a tester. They rarely consume many host resources.
So, if you are using something like that doesn't do delta disks, disk space might be somewhat calculable. If lab manager (delta disk), disk space is really hard to predict.
Memory and processor usage: You'll have to profile or over-purchase heavily. I have many more guest CPUs than host CPUS, and don't have perf problems - but that's because of the choppy usage in our QA environments.