Using `overlap`, `kernel time` and `utilization` to optimize one's kernels - optimization

My kernel archive 100% utilization, but the kernel time is at only 3% and there is no time overlap between memory copies and kernels.
Especially the high utilization and the low kernel time don't make sense to me.
So how should I proceed in optimizing my kernel?
I already made sure, that I only have coalesced and pinned memory access, like the profiler recommended.
`Quadro FX 580 utilization = 100.00% (62117.00/62117.00)`
Kernel time = 3.05 % of total GPU time
Memory copy time = 0.9 % of total GPU time
Kernel taking maximum time = Pinned (0.7% of total GPU time)
Memory copy taking maximum time = memcpyHtoD (0.5% of total GPU time)
There is no time overlap between memory copies and kernels on GPU
Furtermore I have no warp serialization, no divergent branches, and no occupancy limiting factor.
Kernel details: Grid size: [4 1 1], Block size: [256 1 1]
Register Ratio: 0.9375 ( 7680 / 8192 ) [10 registers per thread]
Shared Memory Ratio: 0.09375 ( 1536 / 16384 ) [60 bytes per Block]
Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)
Active threads per SM: 768 (Maximum Active threads per SM: 768)
Potential Occupancy: 1 ( 24 / 24 )
Achieved occupancy: 0.333333 (on 4 SMs)
Occupancy limiting factor: None
p.s. I don't claim that I wrote wundercode, but I just don't know how to proceed from here.

it seems the grid size of your kernel is too small to make full use of SM.
why not decrease block size and increase the grid size.
i think it will do some help.

Related

How to compute GPU memory bus?

I'm learning OpenCL/CUDA for GPU computing.
When I study the GDDR5 architecture, I'm told that
memory bus = quntity of memory channel * memory channel width
I see an AMD GPU has 16 memory channels with 32-bits wide, so I get the memory pin width = 16 * 32 = 512bits.
But I found that the mainstream graphic card has only 256/384-bits memory bus.
What's going wrong with it?
For GPUs, the number of memory channels usually is not explicitely stated, but rather the total bus width (in bits) for all channels combined. The bus width varies greatly depending on how many memory modules are on the PCB and the bus width per memory module. GPUs with 256bit total bus width typically have 8 memory modules with 1GB capacity each and GPUs with 384bit have 12.
For CPUs or integrated GPUs which share main memory:
memory bus width per channel = 64bit
numer of memory channels = 2 (mainstream plattforms) / 4 or 8 (high-end desktop / workstation)
memory clock = 1600MHz (DDR3) - 3200+MHz (DDR4)
memory bandwidth = 0.125 * memory bus width per channel * numer of memory channels * memory clock
For dedicated GPUs:
total memory bus width = 64bit (GDDR3) - 256bit (GDDR5) - 5120bit (HBM2)
effective memory clock = <5GHz (GDDR5) - 19.5GHz (GDDR6X)
memory bandwidth = 0.125 * total memory bus width * effective memory clock

What will be the best choice for batch size for one device (using Mirrored Strategy in TF)?

Question:
Suppose you have 4 GPUs (having 2GB memory each) to train your deep learning model. You have 1000 data points in your dataset that takes around 10 GB of storage. What will be the best choice for batch size for one device (using Mirrored Strategy in TF)?
Can someone help me to solve this assignment problem? Thanks in advance.
Each GPU has a memory of 2GB and there are 4 GPUs which means you have a total of 8 GB memory to work with.
Now you can't divide 10 GB of data into 8 GB in one go, so you split 10GB into halves, and have an overall batch size of 500 data points(or rather 512 to be closer to a power of 2)
Now you distribute these 500 data points across the 4 GPUs, getting a batch size of ~128 data points per device.
So overall batch size would be 512 data points, and per GPU batch size would 128.

Tensorflow strange CPU usage

my problem is with Tensorflow and the usage of the CPU.
My System:
CPU => AMD FX 8320 (8 Cores รก 3,5ghz) and 8 Threads
Grafik => GTX 970
RAM => 16Gb and i belive ddr3 2600
I want to run a A3C algorithm for Starcraft 2 (pysc2) on my pc what works fine but the usage of the cpu ist somewhat strange.
If i start the algorithm with 4 Workers i get something about 150k Steps in 1h
and all cpu's are used about 25-30%
If i start the same algorithm with 8 Workers i get something about 120k Steps in 1h and all cpu's are used about 25-30%
If i now start the algorithm with 4 workers twice i get each 150k steps 1h and the cpu usage is 60-70%
Why cant i start the algorithm with 8 Workers, get the double amount of steps in 1H and the cpu is used to 70%?

GTX 970 bandwidth calculation

I am trying to calculate the theoretical bandwidth of gtx970. As per the specs given in:-
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications
Memory clock is 7Gb/s
Memory bus width = 256
Bandwidth = 7*256*2/8 (*2 because it is a DDR)
= 448 GB/s
However, in the specs it is given as 224GB/s
Why is there a factor 2 difference? Am i making a mistake, if so please correct me.
Thanks
The 7 Gbps seems to be the effective clock, i.e. including the data rate. Also note that the field explanation for this Wikipedia list says that "All DDR/GDDR memories operate at half this frequency, except for GDDR5, which operates at one quarter of this frequency", which suggests that all GDDR5 chips are in fact quad data rate, despite the DDR abbreviation.
Finally, let me point out this note from Wikipedia, which disqualifies the trivial effective clock * bus width formula:
For accessing its memory, the GTX 970 stripes data across 7 of its 8 32-bit physical memory lanes, at 196 GB/s. The last 1/8 of its memory (0.5 GiB on a 4 GiB card) is accessed on a non-interleaved solitary 32-bit connection at 28 GB/s, one seventh the speed of the rest of the memory space. Because this smaller memory pool uses the same connection as the 7th lane to the larger main pool, it contends with accesses to the larger block reducing the effective memory bandwidth not adding to it as an independent connection could.
The clock rate reported is an "effective" clock rate and already takes into account the transfer on both rising and falling edges. The trouble is the factor of 2 for DDR.
Some discussion on devtalk here: https://devtalk.nvidia.com/default/topic/995384/theoretical-bandwidth-vs-effective-bandwidth/
In fact, your format is correct, but the memory clock is wrong. GeForce GTX 970's memory clock is 1753MHz(refers to https://www.techpowerup.com/gpu-specs/geforce-gtx-970.c2620).

I/O Disk Drive Calculations

So I am studying for an up and coming exam, one of the questions involves calculating various disk drive properties. I have spent a fair while researching sample questions and formula but because I'm a bit unsure on what I have come up with I was wondering could you possibly help confirm my formulas / answers?
Information Provided:
Rotation Speed = 6000 RPM
Surfaces = 6
Sector Size = 512 bytes
Sectors / Track = 500 (average)
Tracks / Surfaces = 1,000
Average Seek Time = 8ms
One Track Seek Time = 0.4 ms
Maximum Seek Time = 10ms
Questions:
Calculate the following
(i) The capacity of the disk
(ii) The maximum transfer rate for a single track
(iii) Calculate the amount of cylinder skew needed (in sectors)
(iv) The Maximum transfer rate (in bytes) across cylinders (with cylinder skew)
My Answers:
(i) Sector Size x Sectors per Track x Tracks per Surface x No. of surfaces
512 x 500 x 1000 x 6 = 1,536,000,000 bytes
(ii) Sectors per Track x Sector Size x Rotation Speed per sec
500 x 512 x (6000/60) = 25,600,000 bytes per sec
(iii) (Track to Track seek time / Time for 1 Rotation) x Sectors per Track + 4
(0.4 / 0.1) x 500 + 4 = 24
(iv) Really unsure about this one to be honest, any tips or help would be much appreciated.
I fairly sure a similar question will appear on my paper so it really would be a great help if any of you guys could confirm my formulas and derived answers for this sample question. Also if anyone could provide a bit of help on that last question it would be great.
Thanks.
(iv) The Maximum transfer rate (in bytes) across cylinders (with cylinder skew)
500 s/t (1 rpm = 500 sectors) x 512 bytes/sector x 6 (reading across all 6 heads maximum)
1 rotation yields 1536000 bytes across 6 heads
you are doing 6000 rpm so that is 6000/60 or 100 rotations per second
so, 153,600,000 bytes per second (divide by 1 million is 153.6 megabytes per second)
takes 1/100th of a second or 10ms to read in a track
then you need a .4ms shift of the heads to then read the next track.
10.0/10.4 gives you a 96.2 percent effective read rate moving the heads perfectly.
you would be able to read at 96% of the 153.6 or 147.5 Mb/s optimally after the first seek.
where 1 Mb = 1,000,000 bytes