What makes a modern commodity cluster? - hardware

Would would be the most cost effective way of implementing a terabyte distributed memory cache using commodity hardware these days? What would class as a piece of commodity hardware?

Commodity hardware is considered hardware that
Is off the shelf (nothing custom)
Is available in substantially similar version from many manufacturers.
There are many motherboards that can hold 8 or 16 GB of RAM. Fewer server motherboards can hold 32 and even 64GB.
But they fit the definition of commodity, therefore can be made into very large clusters for a very large sum of money.
Note, however, that in many access patterns a striped RAID HD array doesn't go much slower than a gigabit ethernet link - so a RAM cluster might not have significant improvement (except in latency) depending on how you're actually using it.
-Adam

Related

Performance Counter for DRAM Per-Rank Memory Access

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. I need to retrieve the number of accesses to each DRAM rank, over time, to estimate its power consumption. Based on page 261 of the chipset documentation (i.e., Datasheet, volume 2 (M- and H-processor lines)), I could use the 32-bit value in register, RAM—DRAM_ENERGY_STATUS, as a DRAM energy estimation. But I need rank-level energy estimates. I could also use core and offcore DRAM access performance counters to estimate power consumption, but, as mentioned before, I need per-rank statistics. Besides that, they report whole-system stats, while energy is calculated per-rank. They also do not report many DRAM accesses.
Therefore, IMC counters (which are uncore counters) should be the ideal choice. Perf does not support per-rank counters. I tried to use PCM-Memory to access IMC counter information. But /sys/bus/event_source/devices/uncore_imc is not mounted by the kernel (the version is 5.0.0-37-generic) and the tool does not detect the CPU. I tried to access uncore performance counters, manually. Whole-system DRAM access counters are documented, here (They were not documented in the above-mentioned chipset manual). I can retrieve total DRAM read and write accesses using these counters. But, there is no information about channel or rank-level access stats. How can I find the offset associated with these counters? Should I use trial and error?
P.S.: This question is also asked at Intel Software Tuning, Performance Optimization & Platform Monitoring Forum.
The MSR_DRAM_ENERGY_STATUS always reports an estimate of the energy consume by all memory channels. There is no easy way to break it into per-rank energy. This register reports a highly accurate estimate on Haswell.
The 5.0.0-37-generic kernel is an Ubuntu kernel and does support the uncore_imc/data_reads/ and uncore_imc/data_writes/ events on Haswell, which represent a data read CAS command and a data write CAS command from the IMC, respectively. A full cache-line read and a full cache-line write transactions cause a single bursty 64-byte transaction on the memory bus to a single rank. A partial read is also executed as a single full-line read on the bus, but a partial write may require a full line read followed by a full line write due to restrictions in the protocol. Partial writes are generally negligible.
The uncore_imc/data_reads/ and uncore_imc/data_writes/ events occur for requests targeting DRAM memory generated by any unit, not just cores. These names are given by perf and they correspond to UNC_IMC_DRAM_DATA_READS and UNC_IMC_DRAM_DATA_WRITES, respectively, which are mentioned in the Intel article you've cited. The other three events mentioned there allow you count requests (not CAS commands!) for each of the three possible sources separately (GT, IA, and IO). You won't find them listed under /sys/bus/event_source/devices/uncore_imc/events on your old kernel. They are supported in perf starting with mainstream kernel v5.9-rc2.
By the way, PCM does support these events as well, which it uses to report read and write bandwdith over all channels, but you should use the tool pcm.x, not pcm-memory.x, which only works on server processors.
A Haswell H-processor line processor has a single on-die memory controller with two DDR3L 64-bit channels. Each channel can contain zero, one, or two DIMMs with a total capacity of up to 32 GBs over all channels. Moreover, each DIMM can contain up to two ranks, so a single channel can contain anywhere between zero and four ranks. The i7-4720HQ is a high-end mobile processor. You're probably on a laptop with 8 GBs or 16 GBs of memory. If the memory topology was not changed since purchase, it probably has only two 4GB or 8GB DIMMs, one in each channel, with one remaining free slot per channel available for expansion if desired by the user. This means that there are either one or two ranks per channel.
You can approximate the number of accesses to each rank given the knowledge of how physical addresses are mapped to ranks. If each channel is populated with a single rank DIMM of the same capacity, the mapping is simple on your processor. Bit 6 of the physical address (i.e., the seventh bit) determines which channel, and therefore which rank, a request is mapped to. You can collect a set of samples of physical addresses of requests at the IMC by running perf record on MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM with the --phys-data option. Obviously this set of samples may only be representative of core-originated retired loads that reach the IMC, which are a small subset of all requests at the IMC.
It appears to me that you want to measure the number of memory accesses per rank in order to estimate the the per-rank energy from the total DRAM energy, but this is not trivial at all for the following reasons:
Not all CAS commands are of the same energy cost. Precharge and activate commands are not counted by any event and may consume significant energy, especially with high row buffer miss rates.
Even if there are zero requests in the IMC, as long as there is at least one active core, the memory channels are powered and do consume energy.
The amount of time it takes to process a request of the same type and to the same address may vary depending surrounding requests due to timing delays required by rank-to-rank turnarounds and read-write switching.
Despite of all of that, I imagine it may be possible to build a good model of upper and lower bounds on per-rank energy given a representative estimate of the number of requests to each rank (as discussed above).
The bottom line is that there is no easy way to get the luxury of per-rank counting like on server processors.

Which hardware to choose in Neo4j

I'm beginner in neo4j and I would like to store more than 500 millions nodes and more than 20 billions relationships.
Which hardware is the best to deal with all this data ?
Thanks a lot.
Maxime
Neo4j does not restrict users to use certain hardware specifications. However it recommends minimum specifications for RAM, CPU and disk. That are as follows:
RAM:
Must have at least 2 GB
Good to have around 16 GB
CPU:
Must have an Intel Core I3 processor
Good to have an Intel Core I7 processor
Disk:
Must have SATA drives with 15k RPM
Good to have SSDs
Also have a look on these as well Neo4j : Advices for hardware sizing and config and https://neo4j.com/developer/guide-sizing-and-hardware-calculator/
Just for general recommendations, the top two things to look for are plenty of memory and fast SSDs (especially for larger graphs).
Neo4j has a pagecache for caching node and relationship graph topography, and the more of this you can fit into the pagecache the better. We typically recommend between 8 to 31 GB heap in addition to the pagecache depending on the volume and kind of queries you expect to run.
SSDs aid in Neo4j's index-free adjacency structure, as this involves pointer chasing across the disk. This is mostly for when you can't fit all of the graph in pagecache, but this also aids in lookup of node and relationship properties.

Designing a Computer for Spatially-Explicit Modeling in NetLogo

I have done various searches and have yet to find a forum or article that discusses how to approach building a modeling computer for use with NetLogo. I was hoping to start such a discussion, and since the memory usage of NetLogo is proportional to the size of the world and number of simulations run in parallel with BehaviorSpace, it seems reasonable that a formula exists relating sufficient hardware to NetLogo demands.
As an example, I am planning to run a metapopulation model in a landscape approximately 12km x 12km, corresponding to a NetLogo world of 12,000x12,000 at a pixel size of 1, for a 1-meter resolution (relevant for the animal's movement behavior). An earlier post described a large world (How to model a very large world in NetLogo?), and provided a discussion for potential ways to reduce needing large worlds (http://netlogo-users.18673.x6.nabble.com/Re-Rumors-of-Relogo-td4869241.html#a4869247). Another post described a world of 3147x5141 and was using a Linux computer with 64GB of RAM (http://netlogo-users.18673.x6.nabble.com/Java-OutofMemory-errors-for-large-NetLogo-worlds-suggestions-requested-td5002799.html). Clearly, the capability of computers to run large NetLogo worlds is becoming increasingly important.
Presumably, the "best" solution for researchers at universities with access to Windows-based machines would be to run 16GB to 64GB of RAM with a six- or eight-core processor such as the Intel Xeon capable of hyperthreading for running multiple simulations in parallel with BehaviorSpace. As an example, I used SELES (Fall & Fall 2001) on a machine with a 6-core Xeon processor with hyperthreading enabled and 8GB of RAM to run 12,000 replicates of a model with a 1-meter resolution raster map of 1580x1580. This used the computer to its full capacity and it took about a month to run the simulations.
So - if I were to run 12,000 replicates of a 12,000x12,000 world in NetLogo, what would be the "best" option for a computer? Without reaching for the latest and greatest processing power out there, I would presume the most fiscally-reasonable option to be a server board with dual Xeon processors (likely 8-core Ivy bridge) with 64GB of RAM. Would this be a sufficient design, or are there alternatives that are cheaper (or not) for modeling at this scale? And additionally, do there exist "guidelines" of processor/RAM combinations to cope with the increasing demand of NetLogo on memory as the size of worlds and the number of parallel simulations increase?

Which kinds of low level facilities aren't typically supported on multi-core machines?

I'm looking at some optimized, low level, cross platform, concurrency code designed to run on multi-core machines, and want to check some of its assumptions.
Support for hardware optimizations of some kinds aren't, probably, supported on multi core designs (for example, Out of Order Execution support [wikipedia] seems like a good candidate - it takes a lot of surface area to implement, and can be a power hog). Does anyone have a list of other such facilities - ones typically available on single or small number of core machines, but typically left out from machines with larger number of cores on them?
Today, multicore machines are warmed-over die shrinks of uniprocessors. You could almost imagine sawing a 4-core die into 4 1-core dice. I exaggerate only a little bit.
In future, multicore machines will be more thoughtfully designed for energy efficiency and area efficiency. You may see the same ISA, but with different mixes of resources (more or fewer numbers of duplicated functional units), and even with some sharing of resources between cores (e.g. AMD Bulldozer). And, as you say, backing off from the complexity and energy overhead of no-holds-barred out-of-order execution. This will most likely be perceived as different instruction-per-clock (IPC) differences (more or less performance) on the same instruction set architecture.
Also as vendors have to juggle a hypothetical portfolio of big out-of-order serial performance optimized cores and small in-order or less-out-of-order (OoO) and narrower, more energy efficient "throughput" cores, they will be challenged to keep these different implementations in sync with the evolutions of their ISAs. Some cores may support new instructions, new state, new coprocessors, virtualization, security, etc. earlier than others. This leads to a challenge of coding to the common denominator while also lighting up the new facilities for better perf or energy efficiency (or whatever) on those cores that have the new capabilities.
So to answer your specific question, all the traditional computer architecture techniques for trading gates for expressive-power, or performance, or energy efficiency may be rethought and selectively removed in future small throughput-oriented cores.
Hardware multithreading
Aggressive OoO -> humble OoO or even in-order execution
High degrees of microarchitectural speculation
Fancy branch predictors
Big TLBs
Fancy memory prefetchers
Deep pipelines
Wide issue / many copies of functional units
Big caches, wide buses to caches
...
But it goes both ways. It may also be that the new small throughput-optimized energy-optimized cores have new features not present in the older OoO cores. For example, the Larrabee New Instructions (LRBni) (http://www.drdobbs.com/high-performance-computing/216402188) were proposed for a machine with dozens of simpler cores. As another example, the small cores may turn to hardware multithreading to afford better memory latency tolerance to compensate for smaller private caches.
Also, having lots of small energy frugal cores means you may be willing to dedicate and therefore customize some of the cores to optimize performance for particular valuable workloads. For example, the Tensilica custom processors and tools anticipate that some of your small cores will have additional instructions and custom problem-specific datapaths (accelerating an inner loop of video decoding, for example). So in these cases the little core may (counter-intuitively) have much better performance than the much larger core.
Makes sense?
Happy hacking!

Hadoop cluster. 2 Fast, 4 Medium, 8 slower machines?

We're going to purchase some new hardware to use just for a Hadoop cluster and we're stuck on what we should purchase. Say we have a budget of $5k should we buy two super nice machines at $2500/each, four at around $1200/each or eight at around $600 each? Will hadoop work better with more slower machines or fewest much faster machines? Or, as like most things "it depends"? :-)
You're generally better off with Hadoop getting a few extra machines that are less beefy. You almost never see datanodes with more than 16GB ram and dual quad-core CPUs, and often they are smaller than that.
You always have to run one as the namenode (master), and generally you don't also run a datanode (worker/slave) on the same box, although you could since your cluster is small. Assuming you don't, though, getting 2 machines will leave you only 1 worker node, which somewhat defeats the purpose. (Not entirely, because you can still run 4-8 jobs in parallel on the slave, but still.)
At the same time, you don't want to have a cluster of 1000 486s. If your budget is $5k, I would strike a balance and do 4 $1200 machines. Those will provide a decent baseline in terms of individual performance, you'll have 3 datanodes to distribute work to, and you'll have room to grow your cluster if you need.
Things to keep in mind: you'll want to run multiple map or reduce tasks per datanode, and that means multiple JVMs running simultaneously. I would try to get at least 4GB, and preferably 8GB ram. CPU is less important as most MR jobs are IO bound. You could likely get a machine like this for your $1200 price target, so that's my vote.
In a nutshell, you want to max out the number of processor cores and disks. You can sacrifice reliability and quality, but don't get the cheapest hardware out there, as you will have too many reliability problems.
We went with Dell 2xCPU 4-core dell servers, so 8 cores per box. 16GB of memory per box, which is 2GB per core, a bit low as you need memory both for your tasks and for disk buffering. 5x500GB hard drives, and I wish we'd gone for terabyte or higher drives instead.
For drives, my opinion is to buy more cheap, slow, unreliable, high-capacity drives as opposed to more expensive, faster, smaller, reliable drives. If you're having problems with disk throughput, more memory will help with buffering.
This is probably a beefier configuration than you're looking at, but maxing out cores and drives versus buying more boxes is generally a good choice - less power costs, easier to administer, and faster for some operations.
More drives means more simultaneous disk throughput per core, so having as many drives as cores is a good thing. Benchmarking seems to indicate that RAID configurations are slower than JBOD configuration (just mounting the drives and having Hadoop spread load across them) and JBOD is also more reliable.
LAST! Be sure to get ECC memory. Hadoop pushes terabytes of data through memory, and some users have found that non-ECC memory configurations can occasionally introduce single bit errors in terabyte-sized datasets. Debugging these errors is a nightmare.
I recommend having a look at this presentation: http://www.cloudera.com/hadoop-training-thinking-at-scale
Here the various pro's and con's are described.
I think the answer also depends on Your expectations of the cluster grow and networking technology You are using. If you are ok with 1GB ethernet - then type of machines is less significant. In the same time - if you want 10GBit ethernet - you should opt to smaller number of better machines to reduce the cost of networking.
another reference : http://hadoopilluminated.com/hadoop_book/Hardware_Software.html
(disclaimer : I am a co-author of this free hadoop book)