In the FCFS scheduling algorithm the drawback is that if a process P1 with a higher burst time comes before some processes P2,P3... with much smaller burst times then the average waiting time and average completion time is pretty high.
A solution to this problem is to schedule the Shortest Job First(SJF Algo).
But how is the burst time computed in advance? Does the developer specify a formula by which (according to the resources available) the burst time to perform a job is computed in advance?
Estimating burst time of a process is a very large topic .
in general scheduler estimates the length of the next burst based on the lengths of recent cpu bursts. basically what we do is to guess the next CPU burst time by assuming that it will be related to past CPU bursts for that process .
A quick google search led me to this article which will give you a basic idea .
here is a more detailed article
This can be done using an exponential average estimation formula-
Estimated CPU Burst time for (n+1)th CPU burst=(alpha)(Actual CPU Burst time for nth CPU Burst)+(1-alpha)(Estimated CPU Burst time for nth CPU Burst).
where,
alpha=a constant varies between 0<=alpha<=1.
Actual CPU Burst time for nth CPU burst= It is the most recent CPU Burst time of the process/job.
Estimated CPU Burst time for nth CPU burst= It gives us an idea of history of the process/job ie how previously we have estimated CPU Burst time.
For the first time execution (alpha=1), we have to execute the process/job once.
this gives us (Actual CPU Burst time for nth CPU Burst),
Now, we can estimate the upcoming CPU burst time values by varying alpha.
Related
Can someone explain what is the meaning of the time complexity in distributed networking algorithms? The definition given in DNA book by Panduranga is as follow :
"In the synchronous model, time is measured by the number of clock ticks called rounds, i.e., processors are said to compute in “lock step”. When running a distributed algorithm, different nodes might take a different number of rounds to finish. In that case, the maximum time needed over all nodes is taken as the time complexity"
Can you explain the above definition with a simple example
Let's say you want to compute the sum of a really large list (say, 1 billion numbers). To speed things up, you use 4 threads, each computing the sum of 250 million rows, which can then be added to get the total sum. If the time taken for each thread to run is:
thread1 takes 43 seconds
thread2 takes 39 seconds
thread3 takes 40 seconds
thread4 takes 41 seconds
Then you would say that the runtime of this operation is bounded by the thread that takes the longest, in this case 43 seconds. It doesn't matter if the other threads take 2 seconds, the longest task determines the runtime of your algorithm.
Two questions:
According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.
The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?
In Nsight Compute on CC7.5 GPUs
SM% is defined by sm__throughput, and
Memory% is defined by gpu__compute_memory_throughtput
sm_throughput is the MAX of the following metrics:
sm__instruction_throughput
sm__inst_executed
sm__issue_active
sm__mio_inst_issued
sm__pipe_alu_cycles_active
sm__inst_executed_pipe_cbu_pred_on_any
sm__pipe_fp64_cycles_active
sm__pipe_tensor_cycles_active
sm__inst_executed_pipe_xu
sm__pipe_fma_cycles_active
sm__inst_executed_pipe_fp16
sm__pipe_shared_cycles_active
sm__inst_executed_pipe_uniform
sm__instruction_throughput_internal_activity
sm__memory_throughput
idc__request_cycles_active
sm__inst_executed_pipe_adu
sm__inst_executed_pipe_ipa
sm__inst_executed_pipe_lsu
sm__inst_executed_pipe_tex
sm__mio_pq_read_cycles_active
sm__mio_pq_write_cycles_active
sm__mio2rf_writeback_active
sm__memory_throughput_internal_activity
gpu__compute_memory_throughput is the MAX of the following metrics:
gpu__compute_memory_access_throughput
l1tex__data_bank_reads
l1tex__data_bank_writes
l1tex__data_pipe_lsu_wavefronts
l1tex__data_pipe_tex_wavefronts
l1tex__f_wavefronts
lts__d_atomic_input_cycles_active
lts__d_sectors
lts__t_sectors
lts__t_tag_requests
gpu__compute_memory_access_throughput_internal_activity
gpu__compute_memory_access_throughput
l1tex__lsuin_requests
l1tex__texin_sm2tex_req_cycles_active
l1tex__lsu_writeback_active
l1tex__tex_writeback_active
l1tex__m_l1tex2xbar_req_cycles_active
l1tex__m_xbar2l1tex_read_sectors
lts__lts2xbar_cycles_active
lts__xbar2lts_cycles_active
lts__d_sectors_fill_device
lts__d_sectors_fill_sysmem
gpu__dram_throughput
gpu__compute_memory_request_throughput_internal_activity
In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.
Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:
Reduce instructions to the oversubscribed pipeline, or
Issue instructions of different type to use empty issue cycles.
GENERATING THE BREAKDOWN
As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.
For example:
ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe
generates:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
...
The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.
I want to know how each batch be processed in GPU units, so I output the start and end times for each batch。But the processing time for the first batch is much longer than others.
Here is my code to get the processing time for each batchenter image description here
Here is the processing time for the first four batch in GPUenter image description here
For comparison, I also output the processing time in CPUenter image description here
We can find that although the total time in CPU is longer than GPU, the processing time for first batch is shorter.
From my reading there's a timer interrupt called by the hard ware that executes pretty often and transfers control back from a running process to the kernel/scheduler which is then able to determine if a running process has exceeded its time quanta and if so run another task.
This seems imprecise.
For example:
If a timer interrupt was every 1 unit
And the scheduler algorithm determined a cpu bound process time quanta to be 1.5 units, it would actually get 2 units of CPU time.
Or does the scheduler only give time quanta's to processes in units of interrupt timers?
Linux's scheduler (CFS) allocates time slices to threads by first defining a time period in which every thread will run once. This time period is computed by the sched_slice() function and depends on the number of threads on the CPU, and 2 variables that can be set from user space (sysctl_sched_latency and sysctl_sched_min_granularity):
If the number of threads is greater than sysctl_sched_latency / sysctl_sched_min_granularity; then the period will be nr_threads * sysctl_sched_min_granularity; else the period will be sysctl_sched_latency.
For example, on my laptop, I have the following values:
% cat /proc/sys/kernel/sched_latency_ns
18000000
% cat /proc/sys/kernel/sched_min_granularity_ns
2250000
Therefore, sysctl_sched_latency / sysctl_sched_min_granularity = 8. Now, if I have less than 8 threads on a CPU, then each will be allocated 18.000.000 nanoseconds (ie. 18 milliseconds); else, each will be allocated 2.250.000 ns (2.25 ms).
Now, with those values in mind, if we look at the tick frequency (defined at compile time of the kernel) with this command:
% zcat /proc/config.gz | grep CONFIG_HZ
# CONFIG_HZ_PERIODIC is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300
So, on my laptop, I have 300 ticks per second, which means a tick every 3 ms. Which means that in my case, with more than 8 threads on a CPU, I will loose a little bit of precision in my time slices (a thread that should run 2.25 ms will run 3 ms), but I could fix it by recompiling my kernel with more frequent ticks.
However, it should be noted that this is actually not a problem because, as indicated by its name, CFS (Completely Fair Scheduler) aims at being fair, which will be the case here.
I have seen many posts about using the clock() function to determine the amount of elapsed time in a program with code looking something like:
start_time = clock();
//code to be timed
.
.
.
end_time = clock();
elapsed_time = (end_time - start_time) / CLOCKS_PER_SEC;
The value of CLOCKS_PER_SEC is almost surely not the actual number of clock ticks per second so I am a bit wary of the result. Without worrying about threading and I/O, is the output of the clock() function being scaled in some way so that this divison produces the correct wall clock time?
The answer to your question is yes.
clock() in this case refers to a wallclock rather than a CPU clock so it could be misleading at first glance. For all the machines and compilers I've seen, it returns the time in milliseconds since I've never seen a case where CLOCKS_PER_SEC isn't 1000. So the precision of clock() is limited to milliseconds and the accuracy is usually slightly less.
If you're interested in the actual cycles, this can be hard to obtain.
The rdtsc instruction will let you access the number "pseudo"-cycles from when the CPU was booted. On older systems (like Intel Core 2), this number is usually the same as the actual CPU frequency. But on newer systems, it isn't.
To get a more accurate timer than clock(), you will need to use the hardware performance counters - which is specific to the OS. These are internally implemented using the 'rdtsc' instruction from the last paragraph.