How do computers switch between different processes (is it mainly an OS thing)? - process

Early computers, such as the ENIAC, had to have their program memory (a.k.a. instruction memory) changed manually in order for different programs to run. This would involve changing the tape or punchcard on which instructions were stored, so that every time a new program was to be run, the tape or punchcard had to be changed.
This limitation of early computers was in part due to the low informational density of rolls of tape compared to modern HDDs, but it was also partly due to the idea (and please correct me if I'm wrong) that each roll of tape was supposed to store only a single program.
In contrast, modern computers can switch between many different programs without having to physically replace ROM. It's easy to switch from one active window to another, or to start running a new program with a few mouse clicks. We now have HDDs and SSDs rather than punchcards and tape, so that we can simply have all the programs we want to run stored on a single SDD and a few HDDs which are connected to our computers all the time. And we never need to change our memory-storage devices until they break.
I hope the above is enough to motivate the following question:
What are some typical low-level features (w.r.t. hardware and/or software) that enable modern computers to switch between different processes or programs stored in ROM, as opposed to simply treating ROM as a container for single programs as did the computers of yore?

HDD and SSD's are not equivalent to ROM. There is a distinction between a program and a process. a process is a program in execution. Multiple programs are stored in HDD and SDD's. When a program is loaded into RAM it becomes a process. A loader which is part of OS does that. Multiple program can be loaded at the same time in RAM. Context switch is an OS function. Context switch requires not only movement of program or instructions but data as well. A lot of low level features are involved. It is not possible to list all of them. In sort, Yes it is mainly an OS thing.

Related

Does laptop Embedded Controller have limited writes?

I am wondering if I should be worried about excessive writes to the embedded controller registers on my laptop. I am guessing that if they are true registers, they probably act more like RAM rather than flash memory so this isn't a problem.
However, I have a script to modify the registers in my laptop's EC to better control the fan speed curve. It has to be re-applied after each power change event such as sleep/wake as well as power cable events, so it happens fairly often. I just want to make sure I am not burning out my chips in the process.
The script I am using to write to the EC is located here:
https://github.com/RayfenWindspear/perl-acpi-fanspeed
Well, it seems you're writing to ACPI registers. Registers here do not refer to any specific hardware; it just means its a specific address that you can reach using a specific bus. It's however highly unlikely that something that you have to re-write after every power cycle is overwriting permanent storage, so for all practical aspects I'd assume that you can rely on this for as long as your laptop lives.
Hardware peripherals are almost universally implemented as SRAM cells. They will not wear out first. The fan you are controlling will have a limited number of start/stop cycles. So it is much more likely that the act of toggling these registers will wear something else out prematurely (than the SRAM type memory cell itself).
To your particular case, correctly driving a fan/motor can significantly improve it's life time. Over driving a fan/motor does not always make it go faster, but instead creates heat. The heat weakens the wiring and eventually the coils will short reducing drive and eventually wearing out. That said, the element being cooled can be damaged by excess heat so tuning things just to reduce sound may not make sense.
Background details
Generally, the element is called a Flip-Flop with various forms. SystemRDL is an example as well as SystemC and others where digital engineers will model these. In digital hardware, the flip-flops have default or reset values. This is fixed like ROM on each chip and is not normally re-programmable, uses EEPROM technologyNote1 or is often configured via input lines which the hardware designer can pull them high/low with a resistor or connect them to another elements 'GPIO'.
It is analogous to 'initdata'. Program values that aren't zero get copied from flash, disk, etc to memory at program startup. So the flip-flops normally do not hold state over a power cycle; something else does this.
The 'Flash' technology is based off of a floating gate and uses 'quantum tunnelling' to program the floating gates. This process is slightly destructive. It was invented by Fowler and Nordheim in 1967, but wide spread electronics industry did not start to produce them until the early 90s with NOR flash followed by NAND flash and many variants. But the underlying physics is the same; just the digital connections are different. So as well as this defect you are concerned about, the flash technology actually followed many hardware chips such as 68k, i386, etc. So 'flip-flops' were well established and typically the 'register' part of the logic is not that great of a typical chip and a flip-flop uses the same logic (gates) as the rest of the chip logic. Meaning that using flash would be an extra overhead with little benefit.
Some additional take-away is that the startup up and shutdown of chips is usually the most destructive time. Often poor hardware designers do not put proper voltage supervision and some lines maybe floating with the expectation that system programs will set them immediately. Reset events, ESD, over heating, etc will all be more harmful than just the act of writing a peripheral register.
Note 1: EEPROM typically has 100,000+ cycles. These features are typically only used once at manufacture time to set a chip configuration for the system. These are actually quite rare, but possible.
The MLC (multi-level) NAND flash in SSD has pathetically low cycles like 8,000 in some cases. The SLC (single level) old school flash have 10,000+ cycles, but people demand large data formats.

Bootloader Working

I am working on Uboot bootloader. I have some basic question about the functionality of Bootloader and the application it is going to handle:
Q1: As per my knowledge, bootloader is used to download the application into memory. Over internet I also found that bootloader copies the application to RAM and then the application runs from RAM. I am confused with the working of Bootloader...When application is provided to bootloader through serial or TFTP, What happens next, whether Bootloader copies it to RAM first or whether it writes directly to Flash.
Q2: Why there is a need for Bootloader to copy application to RAM and then run the application from RAM? What difficulties we will face if our application runs from FLASH?
Q3: What is the meaning of statement "My application is running from RAM/FLASH"? Is it mean that our application's .text segment or .code segment is in RAM/FLASH? And we are not concerned about .bss section because it is designed to be in RAM.
Thanks
Phogat
When any hardware system is designed, the designer must consider where the executable code will be located. The answer depends on the microcontroller, the included memory types, and the system requirements. So the answer varies from system to system. Some systems execute code located in RAM. Other systems execute code located in flash. You didn't tell us enough about your system to know what it is designed to do.
A system might be designed to execute code from RAM because RAM access times are faster than flash so code can execute faster. A system might be designed to execute code from flash because flash is plentiful and RAM may not be. A system might be designed to execute code from flash so that it boots more quickly. These are just some examples and there are other considerations as well.
RAM is volatile so it does not retain code through a power cycle. If the system executes code located in RAM then a bootloader is required to obtain and write the code to RAM at powerup. Flash is non-volatile so execution can start right away at powerup and a bootloader is not necessary (but may still be useful).
Regarding Q3, the answer is yes. If the system is running from RAM then the .text will be located in RAM (but not until after the bootloader has copied it to there). If the system is running from flash then the .text section will be located in flash. The .bss section is variables and will be in RAM regardless of where the .text section is.
Yes, in general a bootloader boots the system, but it might also provide a mechanism for interrupting the default boot path and allow alternate firmware to be downloaded and run instead, as well as other features (like flashing).
Traditional rom had a traditional ram like interface, address, data, chip select, read/write, etc. And you can still buy rom that way, but it is cheaper from a pin real estate perspective to use something spi or i2c based, which is slower. Not desireable to run from, but tolerable to read once then run from ram. newer flash technologies can/have had problems with read-disturb, where if your code is in a tight loop reading the same instructions or for any other reason the flash is being read too fast, the charge can drop such that a read returns the wrong data, potentially causing the program to change course or crash. Also your PC and other linux platforms are used to copying the kernel from NV storage (hard disk) to ram and then running from there so the copy from flash to ram and run from ram has a comfort level, and is often faster than flash. So there are many potential reasons to not use flash, but depending on the system it may be possible to run from flash just fine (some systems the flash in question is not accessible directly and not executable, of course SOME rom in that system needed to be executable/bootable).
It simplifies the coding challenges if you program the flash with something that is in ram. You can create and debug the code one time that reads from ram and writes to flash and reads from flash and writes to ram. DONE. Now you can work on separate code that receives data from serial to ram, or from ram to serial. DONE. Then work on code that does the same over ethernet or usb or whatever DONE. You dont have to deal with inventing a protocol or solving the problem of timing. Flash writing is very slow, and even xmodem at a moderate speed can be way too fast, so you have to buffer that data in ram anyway, might as well make the tasks completely separate, instead of an xmodem or any other serial based flash loader with a big ram based fifo, just move the data to ram, then separately go from ram to flash. Same for other interfaces. It is technically possible to buffer the data and give the illusion of going from the download interface straight to flash, and depending on the protocol it is technically possible to hold off the sender so that as little as one flash page is required in ram before programming flash. With the older parallel flashes you could do something pretty cool which I dont think most people figured out. When you stop writing to the flash page for some known period of time the flash would automatically start to program that page and you have to wait for 10ms or something like that before it is done. What folks assumed was you had to program sequential addresses and had to get the new data for the next address in that period of time and would demand high serial port speeds, etc, the reality is you can pound the same address over and over again with the same data and the flash wont start to program the page, and the download interface can be infinitely slow. Serial flashes work differently and either dont need tricks or have different tricks.
RAM/FLASH is not some industry term. It likely means that .text is in rom (flash) and .data and .bss are in ram. A copy of the initial state of .data will probably be on flash as well and copied to ram before main() is called, likewise .bss will be zeroed before main() is called. look at crt0.S for most platforms in gnu sources (glibc, or is it gcc, I dont know) to get the gist of how the bootstrap works in a generic fashion.
A bootloader is not required to run linux or other operating systems, you dont NEED uboot, but it is quite useful. Linux is pretty easy, you copy the kernel and root file system, either set some registers or some tags in memory or both then branch to the entry point in the kernel and linux takes over from there. Because linux is so complicated it is desireable to have a complicated bootloader that can capitalize on high speed interfaces like ethernet (rather than being limited to serial or slower).
I would add something regarding your question Q2.
Q2: Why there is a need for Bootloader to copy application to RAM and then run the application from RAM? What difficulties we will face if our application runs from FLASH?
It is not only about having SPI or similar serial external code memory (which is not that often anyway).
Even the external ROM/FLASH/EPROM/ connected to the usual high speed parallel bus will will prevent a system from running on a maximum clock (with zero wait state) even on the relatively slow MCUs due to the external memory access time. You would need 10 ns FLASH access time for the 100 MHz clock, which is not so easy to get (if economically possible at all). And you would agree that 100 MHz is not such a brain spinning speed any more :-)
That is why many MCU/CPU architectures are doing tricks with reading multiply instructions at once, or having internal cash, or doing whatever was needed to compensate for a slow external code memory. Only most older 8-bit architectures can execute the code directly from the flash memory ('in place').
Even if your only code memory was the internal Flash, something need to be done to speed it up. Take a look for example at this article:
http://www.iqmagazineonline.com/magazine/pdf/v_3_2_pdf/pg14-15-18-19-9Q6Phillips-Z.pdf
It desribes how the ARM7 has incorporated something they called MAM (Memory Accelerator Module). It is a good read, and you will find some measures there to speed up the code memory access for the specific ARM7 arhitecture (goes for most others):
Limit maximum clock frequency (from 80 MHz to about 20 MHz for the example in the article)
Insert wait-cycles during flash accesses
Use an instruction cache
Copy the program code from flash to RAM
Obviously, if the instruction cache was not an option (too small, or the clock too high) you are really left only with execution from the RAM, after relocating the code there at the start up.
There is an option also to run only specific section of code from the RAM, which could be specified to the linker. For the DSP (Digital System Processing) systems, there was really no option to run from the EPROM/FLASH even in the old days with clock around only few tens of MHz, let alone now.
Another issue is debugging, the options for debugging the code placed in ROM, or even Flash, are very limited (you have to move section of the code to RAM to be able to set a break point on most systems).
Regarding Q2, one of the difficulties you may face executing from Flash is another code update. If you are executing from the same block of Flash you are trying to update, the system will crash. This depends on your system architecture (how your application and bootloader are organized in Flash) but may be particularly hard to avoid if you are trying to update the bootloader itself.

Which takes longer time? Switching between the user & kernel modes or switching between two processes?

Which takes longer time?
Switching between the user & kernel modes (or) switching between two processes?
Please explain the reason too.
EDIT : I do know that whenever there is a context switch, it takes some time for the dispatcher to save the status of the previous process in its PCB, and then reload the next process from its corresponding PCB. And for switching between the user and the kernel modes, I know that the mode bit has to be changed. Isn't it all, or is there more to it?
Switching between processes (given you actually switch, not run them in parallel) by an order of oh-my-god.
Trapping from userspace to kernelspace used to be done with a processor interrupt earlier. Around 2005 (don't remember the kernel version), and after a discussion on the mailing list where someone found that trapping was slower (in absolute measures!) on a high-end xeon processor than on an earlier Pentium II or III (again, my memory), they implemented it with a new cpu instruction sysenter (which had actually existed since Pentium Pro I think). This is done in the Virtual Dynamic Shared Object (vdso) page in each process (cat /proc/pid/maps to find it) IIRC.
So, nowadays, a kernel trap is basically just a couple of cpu instructions, hence rather few cycles, compared to tenths or hundreds of thousands when using an interrupt (which is really slow on modern CPU's).
A context switch between processes is heavy. It means storing all processor state (registers, etc) to RAM (at a magic memory location in the user process space actually, guess where!), in practice dirtying all cached memory in the cpu, and reading back the process state for the new process. It will (likely) have nothing still in the cpu cache from last time it ran, so each memory read will be a cache miss, and needed to be read from RAM. This is rather slow. When I was at the university, I "invented" (well, I did come up with the idea, knowing that there is plenty of dye in a CPU, but not enough cool if it's constantly powered) a cache that was infinite size although unpowered when unused (only used on context switches i.e.) in the CPU, and implemented this in Simics. Implemented support for this magic cache I called CARD (Context-switch Active, Run-time Drowsy) in Linux, and benchmarked rather heavily. I found that it could speed-up a Linux machine with lots of heavy processes sharing the same core with about 5%. This was at relatively short (low-latency) process time slices, though.
Anyway. A context switch is still pretty heavy, while a kernel trap is basically free.
Answer to at which memory location in user-space, for each process:
At address zero. Yep, the null pointer! You can't read from this entire page from user-space anyway :) This was back in 2005, but it's probably the same now unless the CPU state information has grown larger than a page size, in which case they might have changed the implementation.

Virtual Processors and Logical Partitions

I basically wanted to know what exactly a virtual processor is. At IBM's site they define it as:
"A virtual processor is a representation of a physical processor core to the operating system of a logical partition that uses shared processors. "
I understand that if there are x processors, each of which can simultaneously perform two operations, then the system can perform 2x operations simultaneously. But where does virtual processor fit into this. And i tried looking up the difference between a logical partition and other partitions such as primary but wasn't really sure.
I'd like to draw an analogy between virtual memory and virtual processors.
Start with expectations:
A user program is written against a set of expectation about what the memory looks like (an a nice flat, large, continuous memory model is the best...)
An OS system is written against a set of expectation of how the hardware performs (what CPU protection modes operation are available, how interrupts arrive and are blocked and handled, how to talk to IO devices, etc...)
Realize that expectation can be met directly by the hardware, or by an abstraction layer
Virtual memory is a set of (specialized, not found in simple chips) hardware tools and OS services that fake a user program into thinking that it has that nice, flat, large, continuous memory space, even while the OS is busily dividing the real memory into little piece, and storing some of them on disk, bringing other back, and otherwise making a real hash of it. But your code doesn't care. Everything just works.
A virtual processor system is a set of (specialized, not found in consumer CPUs) hardware tools and hypervisor services that allow your OS to believe it has direct access to one or more processors with the expected protection modes, interrupts, etc. even though the hypervisor is busily swapping whole OS contexts onto and off of one or more real processors, starting and stopping access to IO busses, and so on and so forth. But the OS doesn't care. Everything just works.
The hardware support to do this is has only recently started to be available in "desktop" CPUs, but Big Iron has had it for ages. It is useful for a couple of reasons
Protection. In a properly protected OS, it is tough for one processes or user to spy on another. But since they can be resident in the same context, it may still be possible. Virtualizing OSs divides them by another, even thinner channel and makes it that much harder for data to leak, and malicious things to be done.
Robustness. If you can swap OS contexts in and out you migrate them from one machine to anther and checkpoint and restart. Which allows for computers that detect failures on their own processors and recover gracefully.
These are the things (aside from millions of LOC of heavily debugged, mission critical code) that have kept people paying for Big Iron.

How To Simulate Lower CPU Processor Machines For Browser Testing

We have some users which are using lower-CPU powered machines and they're encountering slow response times using our web application. Is there any way for me to do testing so that I can simulate lower CPU rates?
For example, I have 2.3 Ghz computing power, can I lower it to 1.6 Ghz or lower so that I may be able to test it?
BTW, our customers are using Windows. I have to simulate low computing power on Internet Explorer as browser.
Most new CPUs multiplier can easily be lowered (Intel: Speedstep, AMD: PowerNow!). This is used to save power. With RMclock you can manually adjust your multiplier and thus lower your frequency and make your pc slower. I use this tool myself so I can tell you that it works.
http://cpu.rightmark.org/products/rmclock.shtml
The virtual machine Bochs(pronounced boxes) allows you to set a instructions per second directive. It's probably the slowest emulator out there as it is though...
Create some virtual machines.
You can use VirtualPC or VirtualBox both are free.
I would recommend to start something on the background which eats up all your processor cycles.
A program which finds primenumbers or something similar.
Another slight option in addition to those above is to boot windows in a lower resource config. Go to the start menu,, select run and type MSCONFIG. You can go to the boot tab, click on advanced options and limit the memory and number of of processsors. It's not as robust as the above, but it does give you another option.
Lowering the CPU clock doesn't always give expected results.
Newer CPUs feature architecture improvements which make them more efficient on an equvialent clock basis than older chips. Incidentally, because of this virtual machines are a bad way of testing performance for "older" tech as well.
Your best bet is to simply buy a couple of older machines. Using similar RAM (types and amounts), processor, motherboard chipsets, hard drives, and video cards. All of which feed into the total performance of the machine itself.
I bring the other components up because changing just one of them can have an impact on even browser performance. A prime example is memory. If your clients are constrained to something like 512MB of RAM, the machines could be performing a lot of hard drive access for VM swaps, even for just running the browser. In this situation downgrading the clock speed on your processor while still retaining your 2GB (assuming) of RAM would still not perform anywhere near the same even if everything else was equal.
Isak Savo'sanswer works, but can be a bit finicky, as the modern tpl is going to try and limit cpu load as much as possible. When I tested it out, It was hard (though possible with some testing) to consistently get the types of cpu usages I wanted.
Then I remembered, http://www.cpukiller.com/, which does this already. Highly recommended. As an aside, I found this util from playing old 90s games on modern machines, back when frame rate was pegged to cpu clock time, making playing them on modern computers way too fast. Great utility.
Another big difference between high-performance and low-performance CPUs is the number of cores available. This can realistically differ by a factor of 4, way more than the difference in clock frequency you're likely to encounter.
You can solve this by setting the thread affinity. Even IE6 will use 13 threads just to show google.com. That means it will benefit from a multi-core CPU. But if you set the thread affinity to one core only, all 13 IE threads will have to share that one core.
I understand that this question is pretty old, but here are some receipts I personally use (not only for Web development):
BES. I'm getting some weird results while using it.
Go to Control Panel\All Control Panel Items\Power Options\Edit Plan Settings\Change Advanced Power Settings, then go to the "Processor" section and set it's maximum state to 5% (or something else). It works only if your processor supports dynamic multiplier change and ACPI driver is installed correctly.
Run Task Manager and set processor affinity to a single core (or whatever number of cores you want) for your browser's (or any other's) process. Not a best practice for browsers, because JavaScript implementations are usually single-threaded, but, as far as I see, modern browsers actually DO use multiple cores.
There are a few different methods to accomplish this.
If you're using VirtualBox, go into the Settings for the VM you want to slow the CPU speed for. Go to System > Processor, then set the Execution Cap. The percentage controls how slow it will go: lower values are slower relative to the regular speed. In practice, I've noticed the results to be choppy, although it does technically work.
It is also possible to set the CPU speed for the whole system. In the Windows 10 Settings app, go to System > Power & Sleep. Then click Additional Power Settings on the right hand side. Go to Change Plan Settings for the currently selected plan, then click Change Advanced Power Plan Settings. Scroll down to Processor Power Management and set the Maximum Processor State. Again, this is a percentage. Although this does work, I find that in practice, it doesn't have a big impact even when the percentage is set very low.
If you're dealing with a videogame that uses DirectX or OpenGL and doesn't have a framerate cap, another common method is to force Vsync on in your graphics driver settings. This will usually slow the rendering to about 60 FPS which may be enough to play at a reasonable rate. However, it will only work for applications using 3D hardware rendering specifically.
Finally: if you'd rather not use a VM, and don't want to change a system global setting, but would rather simulate an old CPU for one specific process only, then I have my own program to do that called Old CPU Simulator.
The main brain of the operation is a command line tool written in C++, but there is also a GUI wrapper written in C#. The GUI requires .NET Framework 4.0. The default settings should be fine in most cases - just select the CPU you'd like to simulate under Target Rate, then hit New and browse for the program you'd like to run.
https://github.com/tomysshadow/OldCPUSimulator (click the Releases tab on the right for binaries.)
The concept is to suspend and resume the process at a precise rate, and because it happens so quickly the process will appear to just be running slowly. For example, by suspending a process for 3 milliseconds, then resuming it for 1 millisecond, it will appear to be running at 25% speed. By controlling the ratio of time suspended vs. time resumed, it is possible to simulate different speeds. This is completely API agnostic (it doesn't hook DirectX, OpenGL, etc. it'll work with a command line program if you want.)
Old CPU Simulator does not ask for a percentage, but rather, the clock speed to simulate (which it calls the Target Rate.) It then automatically determines, based on your CPU's real clock speed, the percentage to use. Although clock speed is not the only factor that has improved computer performance over time (there are also SSDs, faster GPUs, more RAM, multithreaded performance, etc.) it's a good enough approximation to get fairly consistent results across machines given the same Target Rate. It also supports other options that may help with consistency, such as setting the process affinity to one.
It implements three different methods of suspending and resuming a process and will use the best available: NtSuspendProcess, NtQuerySystemInformation, or Toolhelp Snapshots. It also uses timeBeginPeriod and timeEndPeriod to achieve high precision timing without busy looping. Note that this is not an emulator; the binary still runs natively. If you like, you can view the source to see how it's implemented - it's not a large project. On my machine, Old CPU Simulator uses less than 1% CPU and less than 1 MB of memory, so the program itself is quite efficient (unlike running intensive programs to intentionally slow the CPU.)