Virtualize specific environment (CPU, cache, clock) - virtual-machine

I have written some code that's supposed to run on a certain hardware-setup. I'd like to test it to get some preliminary metrics, but without buying the hardware setup, since it's very expensive.
At first I, naively, thought I could set some specifications to the platform when creating a virtual machine through a manager such as VMware Workstation, but it seems like it's not possible.
What ways do you believe would be the best to emulate a certain environment? Of course, RAM, disk space and OS should be fairly easy, but limiting the CPU seems to be the general issue.
I'm trying to simulate the Intel AtomĀ® Processor E3845, so I have some requirements to the maximum cores, cache size and of course the clock frequency.
The closest I've found so far would be to install WMware ESXi on a piece of hardware and limit the CPU. But I'm unsure if this is the best way. Further, I've never really worked with this before, why I'm unsure if I can limit the cache and so forth. Simply "down-scaling" the metrics does not feel like a good solution when we are rather dependant on the cache (that is, we've seen issues with certain sizes and speeds).
I Would love to hear some inputs if you have any.

Related

How would I simulate running code on different hardware while using only one machine?

As the title says I'm looking to approximate the performance of a piece of code on different hardware setups. Are there any tools out there to do this?
I'm looking to run my code and perform measurements by limiting the resources available to the process. I would like to control things such as total memory available as well as cpu usage, but it would be better if I had more granularity. Are there any tools out there that would allow me to emulate different speeds of RAM, rate limit the cpu (to say X gigaflops), slow down disk reads, etc?
I've already been looking at the setrlimit command in linux, but I don't think it will let me emulate things like latency. I considered using VMs to run the code and just tweaking the memory and cpu but I'm not sure its granular enough. I also considered things like hooking some of the syscalls and just spinning for x nanoseconds before allowing a read/write syscall, but it feels kind of clunky. The other issue is that this code primarily runs on Windows, and if possible it would be preferable to do this on Windows.
Just for some background, I'm trying to provide some reasonably accurate estimates of things like runtime and resource utilization on different hardware setups without having to actually buy, assemble, and test said hardware.
Thanks for any help you can provide.
If you wish to get very detailed control of every possible part of a machine, use a software emulated machine such as Bochs. Bochs will emulate, in software, an x86 CPU, hard drive, video card, network card, everything.
In order to do what you want to do you would need to build your own version of Bochs with changes to the emulator to control the speed of the different pieces.

Why is RDTSC a virtualized instruction on modern processors?

I am studying RDTSC and learning about how it is virtualized for the purposes of virtual machines like VirtualBox and VMWare. Why did Intel/AMD go to all the trouble of virtualizing this instruction?
I feel like it can be easily simulated with a trap and it's not exactly a super-common instruction (I tested and there's no noticable slow-down for general usage in a virtual machine where hardware RDTSC virtualization is disabled).
However, I know Intel/AMD wouldn't have gone to all the trouble to add this instruction to the virtualizing hardware unless it was important to able to execute very fast.
Does anyone know why?
Its common to use RDTSC to get fine-grained timing information, where the overhead of a virtualization trap would be quite significant. Most common use is to have two RDTSC instructions with a small amount of code between them, taking the difference of the times as the elapsed time (number of cycles) for the code sequence. So even the overhead of pipeline drains/flushes is quite significant.
Also, since all the instruction does is read a continuously running counter, virtualizing it is quite easy -- the hardware only needs to allow saving/reloading the counter value on VM context switches, and not anything special for the RDTSC instruction itself.
VMs should be able to have separate TSCs because they start up at different times. The physical CPU just has one, so something is needed to at least get individual, per-VM TSC offsets.
Also, since VMs don't own the underlying physical CPUs fully, that is, they don't get to execute on them all the time, their TSCs should also somehow reflect the "on/off" periods and it is desirable that they don't change abruptly in value w.r.t. actual time, which the VMs should still get right from the host OS, because there's a lot of software that is virtualization-unready and can break when the numbers are too off.
I think these are the reasons why RDTSC is virtualized. But whatever you do, meeting conflicting requirements is tough and they complicate matters. You can't hide virtualization and have VMs run at near-native speed at the same time. There are trade offs and some things have to give in.

Virtual memory beyond page tables

I am working on a research project to develop an OS for a many-core(1000+) chip. we are looking into implementing a virtual memory type system for memory permissions (read/write/execute) that would allow memory to be safely shared across cores.
basically we want a system that would allow us to mark a 'page' as being readable by some subset of cores writeable by another...etc. we are not going to be doing address translation (at least at this point) but we need a way to efficiently set and query permissions. it is going to be a software filled datastructure with a simple TLB style cache.
Our intuition is that simply replicating page tables for each core will be too expensive (in terms of memory usage).
what datastructures would be efficient for this kind of problem?
thanks
Have you looked at how common multi-core (2-12 core) CPU's address this problem?
Do you know where/when/why/how the solution that is used in these common multi-core CPU's -- will not scale to a 1,000+ cores?
In other words -- can you quantify what's wrong with the existing solution, which is working, and has been working, with common CPU's whose core count <= 12 ?
If you know that -- then the answer is closer than you think, because it just requires understanding how AMD/Intel solved the problem on a lesser scale -- and what's needed to make their solution work on a greater one (Maybe more memory for tables, algorithm tweaks, etc.)
Look at AMD's/Intel's data structures -- then build a software simulator for 1,000+ cores with those data structures, and see where/when/why and how your simulation fails -- if it fails...
Ideally build your simulator with a user-selectable number of cores, then TEST, TEST, TEST with different amounts of cores -- working your way up, noting bottlenecks along the way.
Your simulator should work EXACTLY as well as AMD (if you're using AMD data structures) or Intel (if you're using Intel data structures) -- at the same core count as one of their chips... because it should prove that THEY (AMD/Intel) are doing what they're doing correctly (because they are), and because that will help prove that your simulation program is doing it's simulation correctly -- at a specific number of cores.
Wishing you luck!

How To Simulate Lower CPU Processor Machines For Browser Testing

We have some users which are using lower-CPU powered machines and they're encountering slow response times using our web application. Is there any way for me to do testing so that I can simulate lower CPU rates?
For example, I have 2.3 Ghz computing power, can I lower it to 1.6 Ghz or lower so that I may be able to test it?
BTW, our customers are using Windows. I have to simulate low computing power on Internet Explorer as browser.
Most new CPUs multiplier can easily be lowered (Intel: Speedstep, AMD: PowerNow!). This is used to save power. With RMclock you can manually adjust your multiplier and thus lower your frequency and make your pc slower. I use this tool myself so I can tell you that it works.
http://cpu.rightmark.org/products/rmclock.shtml
The virtual machine Bochs(pronounced boxes) allows you to set a instructions per second directive. It's probably the slowest emulator out there as it is though...
Create some virtual machines.
You can use VirtualPC or VirtualBox both are free.
I would recommend to start something on the background which eats up all your processor cycles.
A program which finds primenumbers or something similar.
Another slight option in addition to those above is to boot windows in a lower resource config. Go to the start menu,, select run and type MSCONFIG. You can go to the boot tab, click on advanced options and limit the memory and number of of processsors. It's not as robust as the above, but it does give you another option.
Lowering the CPU clock doesn't always give expected results.
Newer CPUs feature architecture improvements which make them more efficient on an equvialent clock basis than older chips. Incidentally, because of this virtual machines are a bad way of testing performance for "older" tech as well.
Your best bet is to simply buy a couple of older machines. Using similar RAM (types and amounts), processor, motherboard chipsets, hard drives, and video cards. All of which feed into the total performance of the machine itself.
I bring the other components up because changing just one of them can have an impact on even browser performance. A prime example is memory. If your clients are constrained to something like 512MB of RAM, the machines could be performing a lot of hard drive access for VM swaps, even for just running the browser. In this situation downgrading the clock speed on your processor while still retaining your 2GB (assuming) of RAM would still not perform anywhere near the same even if everything else was equal.
Isak Savo'sanswer works, but can be a bit finicky, as the modern tpl is going to try and limit cpu load as much as possible. When I tested it out, It was hard (though possible with some testing) to consistently get the types of cpu usages I wanted.
Then I remembered, http://www.cpukiller.com/, which does this already. Highly recommended. As an aside, I found this util from playing old 90s games on modern machines, back when frame rate was pegged to cpu clock time, making playing them on modern computers way too fast. Great utility.
Another big difference between high-performance and low-performance CPUs is the number of cores available. This can realistically differ by a factor of 4, way more than the difference in clock frequency you're likely to encounter.
You can solve this by setting the thread affinity. Even IE6 will use 13 threads just to show google.com. That means it will benefit from a multi-core CPU. But if you set the thread affinity to one core only, all 13 IE threads will have to share that one core.
I understand that this question is pretty old, but here are some receipts I personally use (not only for Web development):
BES. I'm getting some weird results while using it.
Go to Control Panel\All Control Panel Items\Power Options\Edit Plan Settings\Change Advanced Power Settings, then go to the "Processor" section and set it's maximum state to 5% (or something else). It works only if your processor supports dynamic multiplier change and ACPI driver is installed correctly.
Run Task Manager and set processor affinity to a single core (or whatever number of cores you want) for your browser's (or any other's) process. Not a best practice for browsers, because JavaScript implementations are usually single-threaded, but, as far as I see, modern browsers actually DO use multiple cores.
There are a few different methods to accomplish this.
If you're using VirtualBox, go into the Settings for the VM you want to slow the CPU speed for. Go to System > Processor, then set the Execution Cap. The percentage controls how slow it will go: lower values are slower relative to the regular speed. In practice, I've noticed the results to be choppy, although it does technically work.
It is also possible to set the CPU speed for the whole system. In the Windows 10 Settings app, go to System > Power & Sleep. Then click Additional Power Settings on the right hand side. Go to Change Plan Settings for the currently selected plan, then click Change Advanced Power Plan Settings. Scroll down to Processor Power Management and set the Maximum Processor State. Again, this is a percentage. Although this does work, I find that in practice, it doesn't have a big impact even when the percentage is set very low.
If you're dealing with a videogame that uses DirectX or OpenGL and doesn't have a framerate cap, another common method is to force Vsync on in your graphics driver settings. This will usually slow the rendering to about 60 FPS which may be enough to play at a reasonable rate. However, it will only work for applications using 3D hardware rendering specifically.
Finally: if you'd rather not use a VM, and don't want to change a system global setting, but would rather simulate an old CPU for one specific process only, then I have my own program to do that called Old CPU Simulator.
The main brain of the operation is a command line tool written in C++, but there is also a GUI wrapper written in C#. The GUI requires .NET Framework 4.0. The default settings should be fine in most cases - just select the CPU you'd like to simulate under Target Rate, then hit New and browse for the program you'd like to run.
https://github.com/tomysshadow/OldCPUSimulator (click the Releases tab on the right for binaries.)
The concept is to suspend and resume the process at a precise rate, and because it happens so quickly the process will appear to just be running slowly. For example, by suspending a process for 3 milliseconds, then resuming it for 1 millisecond, it will appear to be running at 25% speed. By controlling the ratio of time suspended vs. time resumed, it is possible to simulate different speeds. This is completely API agnostic (it doesn't hook DirectX, OpenGL, etc. it'll work with a command line program if you want.)
Old CPU Simulator does not ask for a percentage, but rather, the clock speed to simulate (which it calls the Target Rate.) It then automatically determines, based on your CPU's real clock speed, the percentage to use. Although clock speed is not the only factor that has improved computer performance over time (there are also SSDs, faster GPUs, more RAM, multithreaded performance, etc.) it's a good enough approximation to get fairly consistent results across machines given the same Target Rate. It also supports other options that may help with consistency, such as setting the process affinity to one.
It implements three different methods of suspending and resuming a process and will use the best available: NtSuspendProcess, NtQuerySystemInformation, or Toolhelp Snapshots. It also uses timeBeginPeriod and timeEndPeriod to achieve high precision timing without busy looping. Note that this is not an emulator; the binary still runs natively. If you like, you can view the source to see how it's implemented - it's not a large project. On my machine, Old CPU Simulator uses less than 1% CPU and less than 1 MB of memory, so the program itself is quite efficient (unlike running intensive programs to intentionally slow the CPU.)

Multi core programming

I want to get into multi core programming (not language specific) and wondered what hardware could be recommended for exploring this field.
My aim is to upgrade my existing desktop.
If at all possible, I would suggest getting a dual-socket machine, preferably with quad-core chips. You can certainly get a single-socket machine, but dual-socket would let you start seeing some of the effects of NUMA memory that are going to be exacerbated as the core counts get higher and higher.
Why do you care? There are two huge problems facing multi-core developers right now:
The programming model Parallel programming is hard, and there is (currently) no getting around this. A quad-core system will let you start playing around with real concurrency and all of the popular paradigms (threads, UPC, MPI, OpenMP, etc).
Memory Whenever you start having multiple threads, there is going to be contention for resources, and the memory wall is growing larger and larger. A recent article at arstechnica outlines some (very preliminary) research at Sandia that shows just how bad this might become if current trends continue. Multicore machines are going to have to keep everything fed, and this will require that people be intimately familiar with their memory system. Dual-socket adds NUMA to the mix (at least on AMD machines), which should get you started down this difficult road.
If you're interested in more info on performance inconsistencies with multi-socket machines, you might also check out this technical report on the subject.
Also, others have suggested getting a system with a CUDA-capable GPU, which I think is also a great way to get into multithreaded programming. It's lower level than the stuff I mentioned above, but throw one of those on your machine if you can. The new Portland Group compilers have provisional support for optimizing loops with CUDA, so you could play around with your GPU even if you don't want to learn CUDA yourself.
Quad-core, because it'll permit you to do problems where the number of concurrent processes is > 2, which often non-trivializes problems.
I would also, for sheer geek squee, pick up a nice NVidia card and use the CUDA API. If you have the bucks, there's a stand-alone CUDA workstation that plugs into your main computer via a cable and an expansion slot.
It depends what you want to do.
If you want to learn the basics of multithreaded programming, then you can do that on your existing single-core PC. (If you have 2 threads, then the OS will switch between them on a single-core PC. Then when you move to a dual-core PC they should automatically run in parallel on separate cores, for a 2x speedup). This has the advantage of being free! The disadvantages are that you won't see a speedup (in fact a parallel implementation is probably slightly slower due to overheads), and that buggy code has a slightly higher chance of working.
However, although you can learn multithreaded programming on a single-core box, a dual-core (or even HyperThreading) CPU would be a great help.
If you want to really stress-test the code you're writing, then as "blue tuxedo" says, you should go for as many cores as you can easily afford, and if possible get hyperthreading too.
If you want to learn about algorithms for running on graphics cards - which is a very different area to x86 multicore - then get CUDA and buy a normal nVidia graphics card that supports it.
I'd recommend at least a quad-core processor.
You could try tinkering with CUDA. It's free, not that hard to use and will run on any recent NVIDIA card.
Alternatively, you could get a PlayStation 3 and the Linux SDK and work out how to program a Cell processor. Note that the next cheapest option for Cell BE development is an order of magnitude more expensive than a PS3.
Finally, any modern motherboard that will take a Core Quad or quad-core Opteron (get a good one from Asus or some other reputable manufacturer) will let you experiment with a multi-core PC system for a reasonable sum of money.
The difficult thing with multithreaded/core programming is that it opens a whole new can of worms. The bugs you'll be faced with are usually not the one you're used to. Race conditions can remain dormant for ages until they bite and your mainstream language compiler won't assist you in any way. You'll get random data and/or crashes that only happen once a day/week/month/year, usually under the most mysterious conditions...
One things remains true fortunately : the higher the concurrency exhibited by a computer, the more race conditions you'll unveil.
So if you're serious about multithreaded/core programming, then go for as many cpu cores as possible. Keep in mind that neither hyperthreading nor SMT allow for the level of concurrency that multiple cores provide.
I would agree that, depending on what you ultimately want to do, you can probably get by with just your current single-core system. Multi-core programming is basically multi-threaded programming, and you can certainly do that on a single-core chip.
When I was a student, one of our projects was to build a thread-safe implementation the malloc library for C. Even on a single core processor, that was more than enough to cure me of my desire to get into multi-threaded programming. I would try something small like that before you start thinking about spending lots of money.
I agree with the others where I would upgrade to a quad-core processor. I am also a BIG FAN of ASUS Motherboards (the P5Q Pro is excellent for Core2Quad and Core2Duo processors)!
The draw for multi-core programming is that you have more resources to get things done faster. If you are serious about multi-core programming, then I would absolutely get a quad-core processor. I don't believe that you should get the new i7 architecture from Intel to take advantage of multi-core processing because anything written to take advantage of the Core2Duo or Core2Quad will just run better on the newer architecture.
If you are going to dabble in multi-core programming, then I would get a good Core2Duo processor. Remember, it's not just how many cores you have, but also how FAST the cores are to process the jobs. My Core2Duo running at 4GHz routinely completes jobs faster than my Core2Quad running at 2.4GHz even with a multi-core program.
Let me know if this helps!
JFV