Simulate and optimize a scheduler job to drain data from a data center - optimization

I have an assignment to simulate a problem which we currently have and that is draining data out of old hard drives. Imagine we have 5 hard disks H1 ... H5. Each has a specific capacity Ci and the remaining space Ri. And, we don't want disks to reach to their full capacity, so we need to come up with a scheduler job which frequently drains data out of a disk and relocates it in some other disks. Now the problem is that this draining process impacts the workflow of our system. The performance of the system can be measured by some metrics lets say M1 and M2. Now, how do I design a draining scheduler which tells me when and how much data should be relocated out of a which disk such that it minimizes the impact on M1 and M2?
I use SimPy to simulate this system in python.

For any realistic and practical scenario; the performance metrics (M1 and M2) will have nothing to do with CPU time or (CPU) scheduling whatsoever. All modern (and most "not modern") disk controllers use DMA/bus mastering to transfer data to/from disk themselves (without using any CPU time to do the transfer) so M1 and M2 will (primarily) depend on disk IO bandwidth and not CPU time.
The device driver for the disk controller should/will support some kind of IO priorities; allowing "when disk controller has nothing more important to do (no higher priority transfers), disk controller driver asks disk controller to transfer data to drain the disk (as pre-arranged by file system layer)". In other words "drain disks when disk is idle" can be achieved merely by using a low IO priority.
However; this alone does not work, and the "only drain when (disk) is idle" idea is fundamentally flawed. The problem is that if the disk is pounded for a long time it can still become full (because the disk controller continually had higher priority work to do), leading to a "no free disk space" critical condition (likely failure). The solution is to make the IO priority of draining depend on how full the disk is. If there's enough remaining space on the disk (more than some threshold), then "IO priority of draining" is the lowest priority (so that it doesn't ruin the performance of normal disk IO); and if there's less free disk space the IO priority of draining is proportional, until you reach "IO priority of draining is highest possible priority because there is no free disk space" (sacrificing performance of normal disk IO to prevent a "no free space at all" critical condition as you approach this extreme). Basically, maybe something like "if(Ri >= threshold) { draining_IO_priority = (1.0 - threshold / Ri) * (max_IO_priority - min_IO_priority) + min_IO_priority; } else { draining_IO_priority = min_IO_priority; }"
Also note that the file system layer (and the disk controller driver and almost everything else except some old user-space APIs) is primarily event driven. When the file system receives a request that would cause disk space to be allocated (e.g. resulting from a process doing a "write()") it responds to the event by deciding if it needs to send a "drain request" to disk controller (in addition to allocating some disk space) or deciding if a previous request needs an IO priority boost; when file system receives a "drain request completed" reply event from disk controller it decides if it needs to send another drain request to disk controller; etc. With this in mind, the file system layer should use a "high CPU scheduler" priority to respond to events quickly (but that has nothing to do with disk IO priorities).
Finally; yes there is an "IO scheduler" (e.g. possibly built into the disk controller's driver); but this is hopefully an extremely trivial "when one transfer completes, find the highest priority pending transfer and do that next" algorithm that doesn't require much thought or complexity. However, for some cases it depends on the device (e.g. for old "rotating mechanical disk" hard drives an attempt to reduce/optimize seek times may be involved).
I guess what I'm trying to say is that, for a well designed system, a "draining scheduler" should not exist at all.

Related

Scheduling on multiple cores with each list in each processor vs one list that all processes share

I have a question about how scheduling is done. I know that when a system has multiple CPUs scheduling is usually done on a per processor bases. Each processor runs its own scheduler accessing a ready list of only those processes that are running on it.
So what would be the pros and cons when compared to an approach where there is a single ready list that all processors share?
Like what issues are there when assigning processes to processors and what issues might be caused if a process always lives on one processor? In terms of the mutex locking of data structures and time spent waiting on for the locks are there any issues to that?
Generally there is one, giant problem when it comes to multi-core CPU systems - cache coherency.
What does cache coherency mean?
Access to main memory is hard. Depending on the memory frequency, it can take between a few thousand to a few million cycles to access some data in RAM - that's a whole lot of time the CPU is doing no useful work. It'd be significantly better if we minimized this time as much as possible, but the hardware required to do this is expensive, and typically must be in very close proximity to the CPU itself (we're talking within a few millimeters of the core).
This is where the cache comes in. The cache keeps a small subset of main memory in close proximity to the core, allowing accesses to this memory to be several orders of magnitude faster than main memory. For reading this is a simple process - if the memory is in the cache, read from cache, otherwise read from main memory.
Writing is a bit more tricky. Writing to the cache is fast, but now main memory still holds the original value. We can update that memory, but that takes a while, sometimes even longer than reading depending on the memory type and board layout. How do we minimize this as well?
The most common way to do so is with a write-back cache, which, when written to, will flush the data contained in the cache back to main memory at some later point when the CPU is idle or otherwise not doing something. Depending on the CPU architecture, this could be done during idle conditions, or interleaved with CPU instructions, or on a timer (this is up to the designer/fabricator of the CPU).
Why is this a problem?
In a single core system, there is only one path for reads and writes to take - they must go through the cache on their way to main memory, meaning the programs running on the CPU only see what they expect - if they read a value, modified it, then read it back, it would be changed.
In a multi-core system, however, there are multiple paths for data to take when going back to main memory, depending on the CPU that issued the read or write. this presents a problem with write-back caching, since that "later time" introduces a gap in which one CPU might read memory that hasn't yet been updated.
Imagine a dual core system. A job starts on CPU 0 and reads a memory block. Since the memory block isn't in CPU 0's cache, it's read from main memory. Later, the job writes to that memory. Since the cache is write-back, that write will be made to CPU 0's cache and flushed back to main memory later. If CPU 1 then attempts to read that same memory, CPU 1 will attempt to read from main memory again, since it isn't in the cache of CPU 1. But the modification from CPU 0 hasn't left CPU 0's cache yet, so the data you get back is not valid - your modification hasn't gone through yet. Your program could now break in subtle, unpredictable, and potentially devastating ways.
Because of this, cache synchronization is done to alleviate this. Application IDs, address monitoring, and other hardware mechanisms exist to synchronize the caches between multiple CPUs. All of these methods have one common problem - they all force the CPU to take time doing bookkeeping rather than actual, useful computations.
The best method of avoiding this is actually keeping processes on one processor as much as possible. If the process doesn't migrate between CPUs, you don't need to keep the caches synchronized, as the other CPUs won't be accessing that memory at the same time (unless the memory is shared between multiple processes, but we'll not go into that here).
Now we come to the issue of how to design our scheduler, and the three main problems there - avoiding process migration, maximizing CPU utilization, and scalability.
Single Queue Multiprocessor scheduling (SQMS)
Single Queue Multiprocessor schedulers are what you suggested - one queue containing available processes, and each core accesses the queue to get the next job to run. This is fairly simple to implement, but has a couple of major drawbacks - it can cause a whole lot of process migration, and does not scale well to larger systems with more cores.
Imagine a system with four cores and five jobs, each of which takes about the same amount of time to run, and each of which is rescheduled when completed. On the first run through, CPU 0 takes job A, CPU 1 takes B, CPU 2 takes C, and CPU 3 takes D, while E is left on the queue. Let's then say CPU 0 finishes job A, puts it on the back of the shared queue, and looks for another job to do. E is currently at the front of the queue, to CPU 0 takes E, and goes on. Now, CPU 1 finishes job B, puts B on the back of the queue, and looks for the next job. It now sees A, and starts running A. But since A was on CPU 0 before, CPU 1 now needs to sync its cache with CPU 0, resulting in lost time for both CPU 0 and CPU 1. In addition, if two CPUs both finish their operations at the same time, they both need to write to the shared list, which has to be done sequentially or the list will get corrupted (just like in multi-threading). This requires that one of the two CPUs wait for the other to finish their writes, and sync their cache back to main memory, since the list is in shared memory! This problem gets worse and worse the more CPUs you add, resulting in major problems with large servers (where there can be 16 or even 32 CPU cores), and being completely unusable on supercomputers (some of which have upwards of 1000 cores).
Multi-queue Multiprocessor Scheduling (MQMS)
Multi-queue multiprocessor schedulers have a single queue per CPU core, ensuring that all local core scheduling can be done without having to take a shared lock or synchronize the cache. This allows for systems with hundreds of cores to operate without interfering with one another at every scheduling interval, which can happen hundreds of times a second.
The main issue with MQMS comes from CPU Utilization, where one or more CPU cores is doing the majority of the work, and scheduling fairness, where one of the processes on the computer is being scheduled more often than any other process with the same priority.
CPU Utilization is the biggest issue - no CPU should ever be idle if a job is scheduled. However, if all CPUs are busy, so we schedule a job to a random CPU, and a different CPU ends up becoming idle, it should "steal" the scheduled job from the original CPU to ensure every CPU is doing real work. Doing so, however, requires that we lock both CPU cores and potentially sync the cache, which may degrade any speedup we could get by stealing the scheduled job.
In conclusion
Both methods exist in the wild - Linux actually has three different mainstream scheduler algorithms, one of which is an SQMS. The choice of scheduler really depends on the way the scheduler is implemented, the hardware you plan to run it on, and the types of jobs you intend to run. If you know you only have two or four cores to run jobs, SQMS is likely perfectly adequate. If you're running a supercomputer where overhead is a major concern, then an MQMS might be the way to go. For a desktop user - just trust the distro, whether that's a Linux OS, Mac, or Windows. Generally, the programmers for the operating system you've got have done their homework on exactly what scheduler will be the best option for the typical use case of their system.
This whitepaper describes the differences between the two types of scheduling algorithms in place.

Write time to hard drive

I realize this number will change based on many factors, but in general, when I write data to a hard-drive (e.g. copy a file), how long does it take for that data to actually be written to the platter after Windows says the copy is done?
Could anyone point me in the right direction to discover more on this topic?
If you are looking for a hard number, that is pretty much unknowable. Generally it is the order of a tens to a few hundred milliseconds for the data to start reaching the disk platters, but can be as high as several seconds in a large server disk array with RAID and de-duplication.
The flow of events goes something like this.
The application calls a function like fwrite().
This call is handled by the filesystem layer in your Operating System, which has to figure out what specific disk sectors are to be manipulated.
The SATA/IDE driver in your OS will talk to the hard drive controller hardware. On a modern PC, it typically uses DMA to feed the data to the disk.
The data sits in a write cache inside the hard disk (RAM).
When the physical platters and heads have made it into position, it will begin to transfer the contents of cache onto the platters.
Steps 3-6 may repeat several times depending on how much data is to be written, where on the disk it is to be written. Additionally, there is usually filesystem metadata that must be updated (e.g. free space counters), which will trigger more writes to the disk.
The time it takes from steps 1-3 can be unpredictable in a general purpose OS like Windows due to task scheduling, background threads, and your disk write is probably queued up with a few dozen other processes. I'd say it is usually on the order of 10-100msec on a typical PC. If you go to the Windows Resource Monitor and click the Disk tab, you can get an idea of the average disk queue length. You can use the Performance Monitor to produce more finely-controlled graphs.
Steps 3-4 are largely controlled by the disk controller and disk interface (SATA, SAS, etc). In the server world, you can be talking about a SAN with FC or iSCSI network switches, which impose their own latencies.
Step 5 will be controlled by they physical performance of the disk. Many consumer-grade HDD manufacturers do not post average seek times anymore, but 10-20msec is common.
Interesting detail about Step 5: Some HDDs lie about flushing their write cache to get better benchmark scores.
Step 6 will depend on your filesystem and how much data you are writing.
You are right that there can be a delay between Windows indicating that data writing is finished and the last data actually written. Things to consider are:
Device Manager, Disk Drive, Properties, Policies - Options for disabling Write Caching.
You might be better off using Direct I/O so that Windows does not save it temporarily in File Cache.
If your program writes the data, you can log what has been copied.
If you are sending the data over a network, you are likely to have no control of when the remote system has finished.
To see what is happening, you can set up Perfmon logging. One of my examples of monitoring:
http://www.roylongbottom.org.uk/monitor1.htm#anchor2

How does memory use affect battery life?

How does memory allocation affect battery usage? Does holding lots of data in variables consume more power than performing many iterations of basic calculations?
P.S. I'm working on a scientific app for mac, and want to optimize it for battery consumption.
The amount of data you hold in memory doesn't influence the battery life as the complete memory has to be refreshed all the time, whether you store something there or not (the memory controller doesn't know whether a part is "unused", AFAIK).
By contrast, calculations do require power. Especially if they might wake up the CPU from an idle or low power state.
I believe RAM consumption is identical regardless of whether it's full or empty. However more physical RAM you have in the machine the more power it will consume.
On a mac, you will want to avoid hitting the hard drive, so try to make sure you don't read the disk very often and definitely don't consume so much RAM you start using virtual memory (or push other apps into virtual memory).
Most modern macs will also partially power down the CPU(s) when they aren't very busy, so reducing CPU usage will actually reduce power consumption.
On the other hand when your app uses more memory it pushes other apps cache data out of the memory and the processing can have some battery cost if the user decides to switch from one to the other, but that i think will be negligible.
it's best to minimize your application's memory footprint once it transitions to the background simply to allow more applications to hang around and not be terminated. Also, applications are terminated in descending order of memory size, so if your application is the largest one existing in the background, it will be killed first.

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.
That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.
Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.
Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

Direct memory access DMA - how does it work?

I read that if DMA is available, then processor can route long read or write requests of disk blocks to the DMA and concentrate on other work. But, DMA to memory data/control channel is busy during this transfer. What else can processor do during this time?
First of all, DMA (per se) is almost entirely obsolete. As originally defined, DMA controllers depended on the fact that the bus had separate lines to assert for memory read/write, and I/O read/write. The DMA controller took advantage of that by asserting both a memory read and I/O write (or vice versa) at the same time. The DMA controller then generated successive addresses on the bus, and data was read from memory and written to an output port (or vice versa) each bus cycle.
The PCI bus, however, does not have separate lines for memory read/write and I/O read/write. Instead, it encodes one (and only one) command for any given transaction. Instead of using DMA, PCI normally does bus-mastering transfers. This means instead of a DMA controller that transfers memory between the I/O device and memory, the I/O device itself transfers data directly to or from memory.
As for what else the CPU can do at the time, it all depends. Back when DMA was common, the answer was usually "not much" -- for example, under early versions of Windows, reading or writing a floppy disk (which did use the DMA controller) pretty much locked up the system for the duration.
Nowadays, however, the memory typically has considerably greater bandwidth than the I/O bus, so even while a peripheral is reading or writing memory, there's usually a fair amount of bandwidth left over for the CPU to use. In addition, a modern CPU typically has a fair large cache, so it can often execute some instruction without using main memory at all.
Well the key point to note is that the CPU bus is always partly used by the DMA and the rest of the channel is free to use for any other jobs/process to run. This is the key advantage of DMA over I/O. Hope this answered your question :-)
But, DMA to memory data/control channel is busy during this transfer.
Being busy doesn't mean you're saturated and unable to do other concurrent transfers. It's true the memory may be a bit less responsive than normal, but CPUs can still do useful work, and there are other things they can do unimpeded: crunch data that's already in their cache, receive hardware interrupts etc.. And it's not just about the quantity of data, but the rate at which it's generated: some devices create data in hard real-time and need it to be consumed promptly otherwise it's overwritten and lost: to handle this without DMA the software may may have to nail itself to a CPU core then spin waiting and reading - avoiding being swapped onto some other task for an entire scheduler time slice - even though most of the time further data's not even ready.
During DMA transfer, the CPU is idle and has no control over memory bus. CPU is put in idle state by using high impedance state