How To Simulate Lower CPU Processor Machines For Browser Testing - testing

We have some users which are using lower-CPU powered machines and they're encountering slow response times using our web application. Is there any way for me to do testing so that I can simulate lower CPU rates?
For example, I have 2.3 Ghz computing power, can I lower it to 1.6 Ghz or lower so that I may be able to test it?
BTW, our customers are using Windows. I have to simulate low computing power on Internet Explorer as browser.

Most new CPUs multiplier can easily be lowered (Intel: Speedstep, AMD: PowerNow!). This is used to save power. With RMclock you can manually adjust your multiplier and thus lower your frequency and make your pc slower. I use this tool myself so I can tell you that it works.
http://cpu.rightmark.org/products/rmclock.shtml

The virtual machine Bochs(pronounced boxes) allows you to set a instructions per second directive. It's probably the slowest emulator out there as it is though...

Create some virtual machines.
You can use VirtualPC or VirtualBox both are free.

I would recommend to start something on the background which eats up all your processor cycles.
A program which finds primenumbers or something similar.

Another slight option in addition to those above is to boot windows in a lower resource config. Go to the start menu,, select run and type MSCONFIG. You can go to the boot tab, click on advanced options and limit the memory and number of of processsors. It's not as robust as the above, but it does give you another option.

Lowering the CPU clock doesn't always give expected results.
Newer CPUs feature architecture improvements which make them more efficient on an equvialent clock basis than older chips. Incidentally, because of this virtual machines are a bad way of testing performance for "older" tech as well.
Your best bet is to simply buy a couple of older machines. Using similar RAM (types and amounts), processor, motherboard chipsets, hard drives, and video cards. All of which feed into the total performance of the machine itself.
I bring the other components up because changing just one of them can have an impact on even browser performance. A prime example is memory. If your clients are constrained to something like 512MB of RAM, the machines could be performing a lot of hard drive access for VM swaps, even for just running the browser. In this situation downgrading the clock speed on your processor while still retaining your 2GB (assuming) of RAM would still not perform anywhere near the same even if everything else was equal.

Isak Savo'sanswer works, but can be a bit finicky, as the modern tpl is going to try and limit cpu load as much as possible. When I tested it out, It was hard (though possible with some testing) to consistently get the types of cpu usages I wanted.
Then I remembered, http://www.cpukiller.com/, which does this already. Highly recommended. As an aside, I found this util from playing old 90s games on modern machines, back when frame rate was pegged to cpu clock time, making playing them on modern computers way too fast. Great utility.

Another big difference between high-performance and low-performance CPUs is the number of cores available. This can realistically differ by a factor of 4, way more than the difference in clock frequency you're likely to encounter.
You can solve this by setting the thread affinity. Even IE6 will use 13 threads just to show google.com. That means it will benefit from a multi-core CPU. But if you set the thread affinity to one core only, all 13 IE threads will have to share that one core.

I understand that this question is pretty old, but here are some receipts I personally use (not only for Web development):
BES. I'm getting some weird results while using it.
Go to Control Panel\All Control Panel Items\Power Options\Edit Plan Settings\Change Advanced Power Settings, then go to the "Processor" section and set it's maximum state to 5% (or something else). It works only if your processor supports dynamic multiplier change and ACPI driver is installed correctly.
Run Task Manager and set processor affinity to a single core (or whatever number of cores you want) for your browser's (or any other's) process. Not a best practice for browsers, because JavaScript implementations are usually single-threaded, but, as far as I see, modern browsers actually DO use multiple cores.

There are a few different methods to accomplish this.
If you're using VirtualBox, go into the Settings for the VM you want to slow the CPU speed for. Go to System > Processor, then set the Execution Cap. The percentage controls how slow it will go: lower values are slower relative to the regular speed. In practice, I've noticed the results to be choppy, although it does technically work.
It is also possible to set the CPU speed for the whole system. In the Windows 10 Settings app, go to System > Power & Sleep. Then click Additional Power Settings on the right hand side. Go to Change Plan Settings for the currently selected plan, then click Change Advanced Power Plan Settings. Scroll down to Processor Power Management and set the Maximum Processor State. Again, this is a percentage. Although this does work, I find that in practice, it doesn't have a big impact even when the percentage is set very low.
If you're dealing with a videogame that uses DirectX or OpenGL and doesn't have a framerate cap, another common method is to force Vsync on in your graphics driver settings. This will usually slow the rendering to about 60 FPS which may be enough to play at a reasonable rate. However, it will only work for applications using 3D hardware rendering specifically.
Finally: if you'd rather not use a VM, and don't want to change a system global setting, but would rather simulate an old CPU for one specific process only, then I have my own program to do that called Old CPU Simulator.
The main brain of the operation is a command line tool written in C++, but there is also a GUI wrapper written in C#. The GUI requires .NET Framework 4.0. The default settings should be fine in most cases - just select the CPU you'd like to simulate under Target Rate, then hit New and browse for the program you'd like to run.
https://github.com/tomysshadow/OldCPUSimulator (click the Releases tab on the right for binaries.)
The concept is to suspend and resume the process at a precise rate, and because it happens so quickly the process will appear to just be running slowly. For example, by suspending a process for 3 milliseconds, then resuming it for 1 millisecond, it will appear to be running at 25% speed. By controlling the ratio of time suspended vs. time resumed, it is possible to simulate different speeds. This is completely API agnostic (it doesn't hook DirectX, OpenGL, etc. it'll work with a command line program if you want.)
Old CPU Simulator does not ask for a percentage, but rather, the clock speed to simulate (which it calls the Target Rate.) It then automatically determines, based on your CPU's real clock speed, the percentage to use. Although clock speed is not the only factor that has improved computer performance over time (there are also SSDs, faster GPUs, more RAM, multithreaded performance, etc.) it's a good enough approximation to get fairly consistent results across machines given the same Target Rate. It also supports other options that may help with consistency, such as setting the process affinity to one.
It implements three different methods of suspending and resuming a process and will use the best available: NtSuspendProcess, NtQuerySystemInformation, or Toolhelp Snapshots. It also uses timeBeginPeriod and timeEndPeriod to achieve high precision timing without busy looping. Note that this is not an emulator; the binary still runs natively. If you like, you can view the source to see how it's implemented - it's not a large project. On my machine, Old CPU Simulator uses less than 1% CPU and less than 1 MB of memory, so the program itself is quite efficient (unlike running intensive programs to intentionally slow the CPU.)

Related

Virtualize specific environment (CPU, cache, clock)

I have written some code that's supposed to run on a certain hardware-setup. I'd like to test it to get some preliminary metrics, but without buying the hardware setup, since it's very expensive.
At first I, naively, thought I could set some specifications to the platform when creating a virtual machine through a manager such as VMware Workstation, but it seems like it's not possible.
What ways do you believe would be the best to emulate a certain environment? Of course, RAM, disk space and OS should be fairly easy, but limiting the CPU seems to be the general issue.
I'm trying to simulate the Intel AtomĀ® Processor E3845, so I have some requirements to the maximum cores, cache size and of course the clock frequency.
The closest I've found so far would be to install WMware ESXi on a piece of hardware and limit the CPU. But I'm unsure if this is the best way. Further, I've never really worked with this before, why I'm unsure if I can limit the cache and so forth. Simply "down-scaling" the metrics does not feel like a good solution when we are rather dependant on the cache (that is, we've seen issues with certain sizes and speeds).
I Would love to hear some inputs if you have any.

Is it possible to change the guest wall clock speed in a virtualized environment?

We're undertaking a large project that is focused on delivering automated testing of the software that we produce.
We have a lot of "events" that trigger certain behavior at specific times. Ideally, we would be able to exercise these tests in an automated fashion without the need to move the system clock in intervals to specific points in time.
To that end, I'm wondering if there is a way (with VMWare, or any other virtualization software) to increase the speed of the system clock of the guest operating system. I'm not interested in measuring performance in these tests, only functionality.
Is there anything out there that would allow for this behavior?
It works for VirtualBox:
VBoxManage setextradata "VM name" "VBoxInternal/TM/WarpDrivePercentage" x
where x is the percentage you want (for instance, 200 is doubling, 50 is halving)
You can also more information here, on the section "Accelerate or slow down the guest clock". Regards.
I was able to work around this using the Win32 API SetSystemTimeAdjustment()
This allows you to increase the amount of time added to the system clock for each OS tick interval. It's meant generally for addressing clock skew, but can be used outside of that particular context.
I don't see what the benefits are of testing this in a fast-forwarding VM instead of unit testing the event trigger using a mock implementation of the date/time dependency.
The only thing you "gain" by testing this in a fast-forwarding VM is that you test both the system's and the programming language's date/time implementation, which I think you are save to trust because it is used, developed and tested by so many for such a long time.

How would I simulate running code on different hardware while using only one machine?

As the title says I'm looking to approximate the performance of a piece of code on different hardware setups. Are there any tools out there to do this?
I'm looking to run my code and perform measurements by limiting the resources available to the process. I would like to control things such as total memory available as well as cpu usage, but it would be better if I had more granularity. Are there any tools out there that would allow me to emulate different speeds of RAM, rate limit the cpu (to say X gigaflops), slow down disk reads, etc?
I've already been looking at the setrlimit command in linux, but I don't think it will let me emulate things like latency. I considered using VMs to run the code and just tweaking the memory and cpu but I'm not sure its granular enough. I also considered things like hooking some of the syscalls and just spinning for x nanoseconds before allowing a read/write syscall, but it feels kind of clunky. The other issue is that this code primarily runs on Windows, and if possible it would be preferable to do this on Windows.
Just for some background, I'm trying to provide some reasonably accurate estimates of things like runtime and resource utilization on different hardware setups without having to actually buy, assemble, and test said hardware.
Thanks for any help you can provide.
If you wish to get very detailed control of every possible part of a machine, use a software emulated machine such as Bochs. Bochs will emulate, in software, an x86 CPU, hard drive, video card, network card, everything.
In order to do what you want to do you would need to build your own version of Bochs with changes to the emulator to control the speed of the different pieces.

Which takes longer time? Switching between the user & kernel modes or switching between two processes?

Which takes longer time?
Switching between the user & kernel modes (or) switching between two processes?
Please explain the reason too.
EDIT : I do know that whenever there is a context switch, it takes some time for the dispatcher to save the status of the previous process in its PCB, and then reload the next process from its corresponding PCB. And for switching between the user and the kernel modes, I know that the mode bit has to be changed. Isn't it all, or is there more to it?
Switching between processes (given you actually switch, not run them in parallel) by an order of oh-my-god.
Trapping from userspace to kernelspace used to be done with a processor interrupt earlier. Around 2005 (don't remember the kernel version), and after a discussion on the mailing list where someone found that trapping was slower (in absolute measures!) on a high-end xeon processor than on an earlier Pentium II or III (again, my memory), they implemented it with a new cpu instruction sysenter (which had actually existed since Pentium Pro I think). This is done in the Virtual Dynamic Shared Object (vdso) page in each process (cat /proc/pid/maps to find it) IIRC.
So, nowadays, a kernel trap is basically just a couple of cpu instructions, hence rather few cycles, compared to tenths or hundreds of thousands when using an interrupt (which is really slow on modern CPU's).
A context switch between processes is heavy. It means storing all processor state (registers, etc) to RAM (at a magic memory location in the user process space actually, guess where!), in practice dirtying all cached memory in the cpu, and reading back the process state for the new process. It will (likely) have nothing still in the cpu cache from last time it ran, so each memory read will be a cache miss, and needed to be read from RAM. This is rather slow. When I was at the university, I "invented" (well, I did come up with the idea, knowing that there is plenty of dye in a CPU, but not enough cool if it's constantly powered) a cache that was infinite size although unpowered when unused (only used on context switches i.e.) in the CPU, and implemented this in Simics. Implemented support for this magic cache I called CARD (Context-switch Active, Run-time Drowsy) in Linux, and benchmarked rather heavily. I found that it could speed-up a Linux machine with lots of heavy processes sharing the same core with about 5%. This was at relatively short (low-latency) process time slices, though.
Anyway. A context switch is still pretty heavy, while a kernel trap is basically free.
Answer to at which memory location in user-space, for each process:
At address zero. Yep, the null pointer! You can't read from this entire page from user-space anyway :) This was back in 2005, but it's probably the same now unless the CPU state information has grown larger than a page size, in which case they might have changed the implementation.

Why is RDTSC a virtualized instruction on modern processors?

I am studying RDTSC and learning about how it is virtualized for the purposes of virtual machines like VirtualBox and VMWare. Why did Intel/AMD go to all the trouble of virtualizing this instruction?
I feel like it can be easily simulated with a trap and it's not exactly a super-common instruction (I tested and there's no noticable slow-down for general usage in a virtual machine where hardware RDTSC virtualization is disabled).
However, I know Intel/AMD wouldn't have gone to all the trouble to add this instruction to the virtualizing hardware unless it was important to able to execute very fast.
Does anyone know why?
Its common to use RDTSC to get fine-grained timing information, where the overhead of a virtualization trap would be quite significant. Most common use is to have two RDTSC instructions with a small amount of code between them, taking the difference of the times as the elapsed time (number of cycles) for the code sequence. So even the overhead of pipeline drains/flushes is quite significant.
Also, since all the instruction does is read a continuously running counter, virtualizing it is quite easy -- the hardware only needs to allow saving/reloading the counter value on VM context switches, and not anything special for the RDTSC instruction itself.
VMs should be able to have separate TSCs because they start up at different times. The physical CPU just has one, so something is needed to at least get individual, per-VM TSC offsets.
Also, since VMs don't own the underlying physical CPUs fully, that is, they don't get to execute on them all the time, their TSCs should also somehow reflect the "on/off" periods and it is desirable that they don't change abruptly in value w.r.t. actual time, which the VMs should still get right from the host OS, because there's a lot of software that is virtualization-unready and can break when the numbers are too off.
I think these are the reasons why RDTSC is virtualized. But whatever you do, meeting conflicting requirements is tough and they complicate matters. You can't hide virtualization and have VMs run at near-native speed at the same time. There are trade offs and some things have to give in.