Calculating total number of program instructions? - process

So if a program take 5.7 seconds to execute on a processor with the clock frequency of 1.8 GHz where each instruction take 7 clock cycles, what's the total amount of instructions of the program?
I though I could calculate it like this:
Total number of clock cycles = 5.7 seconds * 1.8 GHz = 10,260,000,000 cycles.
Then divide total number of cycles with number of cycles per instructions: 10,260,000,000 / 7 we get 1,466,428,571
But apparently this is wrong? It's part of a quiz and I got this question wrong, wonder why that is?

Related

How can I check my CPI calculation is correct?

I got a CPI of 20.15 on the exercise below, but this value seems too high to me. How can I check if this is correct?
A processor with 7.5 GHz clock frequency runs a program with 8000 millions instructions (8*10^9) in 21.5 seconds.
What's the average CPI assuming that the program above is a representation of the average kinds of programs that runs in this computer?
I've tried it in many different way but I keep getting a CPI of 20.15. Is this correct?
Instructions per second = (8000 ÷ 21.5) million
Clock frequency = 7.5 GHz
So, CPI = (7.5 × 1000) ÷ (8000 ÷ 21.5) = 20.15.
So, yes 20.15 is correct.

How does the kernel scheduler maintain time quanta precision with timer interrupts?

From my reading there's a timer interrupt called by the hard ware that executes pretty often and transfers control back from a running process to the kernel/scheduler which is then able to determine if a running process has exceeded its time quanta and if so run another task.
This seems imprecise.
For example:
If a timer interrupt was every 1 unit
And the scheduler algorithm determined a cpu bound process time quanta to be 1.5 units, it would actually get 2 units of CPU time.
Or does the scheduler only give time quanta's to processes in units of interrupt timers?
Linux's scheduler (CFS) allocates time slices to threads by first defining a time period in which every thread will run once. This time period is computed by the sched_slice() function and depends on the number of threads on the CPU, and 2 variables that can be set from user space (sysctl_sched_latency and sysctl_sched_min_granularity):
If the number of threads is greater than sysctl_sched_latency / sysctl_sched_min_granularity; then the period will be nr_threads * sysctl_sched_min_granularity; else the period will be sysctl_sched_latency.
For example, on my laptop, I have the following values:
% cat /proc/sys/kernel/sched_latency_ns
18000000
% cat /proc/sys/kernel/sched_min_granularity_ns
2250000
Therefore, sysctl_sched_latency / sysctl_sched_min_granularity = 8. Now, if I have less than 8 threads on a CPU, then each will be allocated 18.000.000 nanoseconds (ie. 18 milliseconds); else, each will be allocated 2.250.000 ns (2.25 ms).
Now, with those values in mind, if we look at the tick frequency (defined at compile time of the kernel) with this command:
% zcat /proc/config.gz | grep CONFIG_HZ
# CONFIG_HZ_PERIODIC is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300
So, on my laptop, I have 300 ticks per second, which means a tick every 3 ms. Which means that in my case, with more than 8 threads on a CPU, I will loose a little bit of precision in my time slices (a thread that should run 2.25 ms will run 3 ms), but I could fix it by recompiling my kernel with more frequent ticks.
However, it should be noted that this is actually not a problem because, as indicated by its name, CFS (Completely Fair Scheduler) aims at being fair, which will be the case here.

I/O Disk Drive Calculations

So I am studying for an up and coming exam, one of the questions involves calculating various disk drive properties. I have spent a fair while researching sample questions and formula but because I'm a bit unsure on what I have come up with I was wondering could you possibly help confirm my formulas / answers?
Information Provided:
Rotation Speed = 6000 RPM
Surfaces = 6
Sector Size = 512 bytes
Sectors / Track = 500 (average)
Tracks / Surfaces = 1,000
Average Seek Time = 8ms
One Track Seek Time = 0.4 ms
Maximum Seek Time = 10ms
Questions:
Calculate the following
(i) The capacity of the disk
(ii) The maximum transfer rate for a single track
(iii) Calculate the amount of cylinder skew needed (in sectors)
(iv) The Maximum transfer rate (in bytes) across cylinders (with cylinder skew)
My Answers:
(i) Sector Size x Sectors per Track x Tracks per Surface x No. of surfaces
512 x 500 x 1000 x 6 = 1,536,000,000 bytes
(ii) Sectors per Track x Sector Size x Rotation Speed per sec
500 x 512 x (6000/60) = 25,600,000 bytes per sec
(iii) (Track to Track seek time / Time for 1 Rotation) x Sectors per Track + 4
(0.4 / 0.1) x 500 + 4 = 24
(iv) Really unsure about this one to be honest, any tips or help would be much appreciated.
I fairly sure a similar question will appear on my paper so it really would be a great help if any of you guys could confirm my formulas and derived answers for this sample question. Also if anyone could provide a bit of help on that last question it would be great.
Thanks.
(iv) The Maximum transfer rate (in bytes) across cylinders (with cylinder skew)
500 s/t (1 rpm = 500 sectors) x 512 bytes/sector x 6 (reading across all 6 heads maximum)
1 rotation yields 1536000 bytes across 6 heads
you are doing 6000 rpm so that is 6000/60 or 100 rotations per second
so, 153,600,000 bytes per second (divide by 1 million is 153.6 megabytes per second)
takes 1/100th of a second or 10ms to read in a track
then you need a .4ms shift of the heads to then read the next track.
10.0/10.4 gives you a 96.2 percent effective read rate moving the heads perfectly.
you would be able to read at 96% of the 153.6 or 147.5 Mb/s optimally after the first seek.
where 1 Mb = 1,000,000 bytes

Total execution time of a program with conditional branches in a five-stage pipeline

A CPU has a five-stage pipeline and runs at 1 GHz frequency. Instruction fetch
happens in the first stage of the pipeline. A conditional branch instruction
computes the target address and evaluates the condition in the third stage of the
pipeline. The processor stops fetching new instructions following a conditional
branch until the branch outcome is known. A program executes 10^9 instructions
out of which 20% are conditional branches. If each instruction takes one cycle to
complete on average, the total execution time of the program is:
(A) 1.0 second
(B) 1.2 seconds
(C) 1.4 seconds
(D) 1.6 seconds
Total_execution_time = (1+stall_cycle*stall_frequency)*exec_time_each_inst
exec_time_each_inst = 1s [i.e #1ghz need to execute 10^9 inst => 1 inst = 1 sec]
stall_frequency = 20% = .20
stall_cycle = 2
[i.e in 3rd stage of pipeline we know branch result, so there will be 2 stall cycles]
therefore Total_execution_time = (1+2*.20)*1 = 1.4 seconds
I don't know how to explain it better but hope it helps a bit :)

Using `overlap`, `kernel time` and `utilization` to optimize one's kernels

My kernel archive 100% utilization, but the kernel time is at only 3% and there is no time overlap between memory copies and kernels.
Especially the high utilization and the low kernel time don't make sense to me.
So how should I proceed in optimizing my kernel?
I already made sure, that I only have coalesced and pinned memory access, like the profiler recommended.
`Quadro FX 580 utilization = 100.00% (62117.00/62117.00)`
Kernel time = 3.05 % of total GPU time
Memory copy time = 0.9 % of total GPU time
Kernel taking maximum time = Pinned (0.7% of total GPU time)
Memory copy taking maximum time = memcpyHtoD (0.5% of total GPU time)
There is no time overlap between memory copies and kernels on GPU
Furtermore I have no warp serialization, no divergent branches, and no occupancy limiting factor.
Kernel details: Grid size: [4 1 1], Block size: [256 1 1]
Register Ratio: 0.9375 ( 7680 / 8192 ) [10 registers per thread]
Shared Memory Ratio: 0.09375 ( 1536 / 16384 ) [60 bytes per Block]
Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)
Active threads per SM: 768 (Maximum Active threads per SM: 768)
Potential Occupancy: 1 ( 24 / 24 )
Achieved occupancy: 0.333333 (on 4 SMs)
Occupancy limiting factor: None
p.s. I don't claim that I wrote wundercode, but I just don't know how to proceed from here.
it seems the grid size of your kernel is too small to make full use of SM.
why not decrease block size and increase the grid size.
i think it will do some help.