How to compute GPU memory bus? - gpu

I'm learning OpenCL/CUDA for GPU computing.
When I study the GDDR5 architecture, I'm told that
memory bus = quntity of memory channel * memory channel width
I see an AMD GPU has 16 memory channels with 32-bits wide, so I get the memory pin width = 16 * 32 = 512bits.
But I found that the mainstream graphic card has only 256/384-bits memory bus.
What's going wrong with it?

For GPUs, the number of memory channels usually is not explicitely stated, but rather the total bus width (in bits) for all channels combined. The bus width varies greatly depending on how many memory modules are on the PCB and the bus width per memory module. GPUs with 256bit total bus width typically have 8 memory modules with 1GB capacity each and GPUs with 384bit have 12.
For CPUs or integrated GPUs which share main memory:
memory bus width per channel = 64bit
numer of memory channels = 2 (mainstream plattforms) / 4 or 8 (high-end desktop / workstation)
memory clock = 1600MHz (DDR3) - 3200+MHz (DDR4)
memory bandwidth = 0.125 * memory bus width per channel * numer of memory channels * memory clock
For dedicated GPUs:
total memory bus width = 64bit (GDDR3) - 256bit (GDDR5) - 5120bit (HBM2)
effective memory clock = <5GHz (GDDR5) - 19.5GHz (GDDR6X)
memory bandwidth = 0.125 * total memory bus width * effective memory clock

Related

Opencl Maximum Size of Private memory per Work Item

I Have an AMD RX 570 4G,
Opencl tells me that I can use a Maximum of 256 Workgroup and 256 WorkItem per group...
Let's say I use all 256 Workgroup with 256 WorkItem in each of them,
Now, What is the Maximum Size of private memory per work item?
Is Private memory Equal to Total VRAM(4GB) Divided by Total Work Items(256x256)?
Or is it equal to Cache if so, how?
VRAM is represented in OpenCL as global memory.
Private memory is initially allocated from the register file. Your RX 570 is from AMD's Polaris architecture, a.k.a. GCN 4 where each compute unit (64 shader processors) has access to 256 vector (SIMD) registers (64x32 bits wide) and 512 32-bit scalar registers. So that works out to about 66KiB per CU, but it's not as simple as just quoting that total.
A workgroup will always be scheduled on a single compute unit, so if you assign it 256 work items, then it will have to perform every vector instruction 4 times in sequence (64 x 4 = 256) and the vector registers will (simplifying slightly) effectively have to be treated as 64 256-entry registers.
Scalar registers are used for data and calculations which are identical on each work item, e.g. incrementing a loop counter, holding buffer base pointers, etc.
Private memory will usually spill to global if you use more than will fit in your register file. So performance simply drops.
So essentially, on GCN, your optimal workgroup size is usually 64. Use as little private memory as possible; definitely aim for less than half of the available register file so that more than one workgroup can be scheduled so latency from memory access can be papered over, otherwise your shader cores will be spending a lot of time just waiting for data to arrive or be written out.
Cache is used for OpenCL local and constant memory spaces. (Constant will again spill to global if you try to use too much. The size of local memory can be checked via the OpenCL API and again is divided among workgroups scheduled on the same compute unit, so if you use more than half, only one group can run on a CU, etc.)
I don't know where you're getting a limit of 256 workgroups from, the limit is essentially set by whether the GPU uses 32-bit or 64-bit addressing. Most applications won't get close to 4bn work items even in the 32-bit case.
Private memory space is registers on the GPU die (0 cycle access latency) and not related to the amount of VRAM (global memory space) at all. The amount of private memory depends on the device (private memory per compute unit).
I don't know private memory size for the RX 570, but for older HD7000 series GPUs it is 256kB per CU. If you have a work group size of 256, you get 1kB per work item, which is equal to 256 float variables.
Cache size determines the size of local and constant memory space.

Loading large set of images kill the process

Loading 1500 images of size (1000,1000,3) breaks the code and throughs kill 9 without any further error. Memory used before this line of code is 16% of system total memory. Total size of images direcotry is 7.1G.
X = np.asarray(images).astype('float64')
y = np.asarray(labels).astype('float64')
system spec is:
OS: macOS Catalina
processor: 2.2 GHz 6-Core Intel Core i7 16 GB 2
memory: 16 GB 2400 MHz DDR4
Update:
getting the bellow error while running the code on 32 vCPUs, 120 GB memory.
MemoryError: Unable to allocate 14.1 GiB for an array with shape (1200, 1024, 1024, 3) and data type float32
You would have to provide some more info/details for an exact answer but, assuming that this is a memory error(incredibly likely, size of the images on disk does not represent the size they would occupy in memory, so that is irrelevant. In 100% of all cases, the images in memory will occupy a lot more space due to pointers, objects that are needed and so on. Intuitively I would say that 16GB of ram is nowhere nearly enough to load 7GB of images. It's impossible to tell you how much you would need but from experience I would say that you'd need to bump it up to 64GB. If you are using Keras, I would suggest looking into the DirectoryIterator.
Edit:
As Cris Luengo pointed out, I missed the fact that you stated the size of the images.

Physical CPU in AIX

Can someone let me know Why the number of physical CPU's is greater than the number of virtual CPU's in AIX.
Online Virtual CPUs : 8,
Active Physical CPUs in system : 48,
Desired Virtual CPUs : 8
Partition Number : 30
Type : Shared-SMT-4
Mode : Uncapped
Entitled Capacity : 0.80
Partition Group-ID : 32798
Shared Pool ID : 0
**Online Virtual CPUs : 8**
Maximum Virtual CPUs : 160
Minimum Virtual CPUs : 1
Online Memory : 84992 MB
Maximum Memory : 127488 MB
Minimum Memory : 256 MB
Variable Capacity Weight : 128
Minimum Capacity : 0.10
Maximum Capacity : 16.00
Capacity Increment : 0.01
Maximum Physical CPUs in system : 48
**Active Physical CPUs in system : 48**
Active CPUs in Pool : 48
Shared Physical CPUs in system : 48
Maximum Capacity of Pool : 4800
Entitled Capacity of Pool : 1190
Unallocated Capacity : 0.00
Physical CPU Percentage : 10.00%
Unallocated Weight : 0
Memory Mode : Dedicated
Total I/O Memory Entitlement : -
Variable Memory Capacity Weight : -
Memory Pool ID : -
Physical Memory in the Pool : -
Hypervisor Page Size : -
Unallocated Variable Memory Capacity Weight: -
Unallocated I/O Memory entitlement : -
Memory Group ID of LPAR : -
**Desired Virtual CPUs : 8**
Desired Memory : 84992 MB
Desired Variable Capacity Weight : 128
Desired Capacity : 0.80
Target Memory Expansion Factor : -
Target Memory Expansion Size : -
Power Saving Mode : Disabled
Sub Processor Mode : -
Your "Entitled Capacity" is 0.8. And the each fraction of a single processor equals 0.1 of one physical processor. So you get 8 virtual processors. Here you can get more information about this:
What is the capacity entitlement?
Physical processors are presented to a logical partition's operating
systems as virtual processors. Physical processors are virtualized
into portions or fractions. Each fraction of a single processor equals
0.1 of one processor. There is an additional fraction of 0.01 The number of cores assigned to a partition is represented by the Capacity
Entitlement. To display the assigned capacity entitlement for a shared
partition use the command # lparstat|awk -F "ent=" '/ent\=/ {print
$NF}' The output will be the number of processorsthis partition is
entitled to use. This is the upper threshold the partition can have
from the processor pool (Capped mode). The partition can use more than
the assigned capacity entitlement (Uncapped mode). Capped and uncapped
modes details will be illustrated later in this document. The number
of virtual processors and processing units that are assigned to a
partition can be changed through the HMC.
Capacity Entitlement considerations:
Capacity entitlement should be correctly configured for normal
production operation and to cover workload during peak time. Having
enough capacity entitlement is important so as not to impact operating
system performance and processor affinity. Running over entitled
capacity can cause bad affinity and noticeable performance degradation
affecting business operation.
Virtual Processors:
A virtual processor is a representation of a physical processor core
to the operating system of a partition that uses shared processors. It
is the number of physical processors that the logical partition can
spread out across. It represents the upper threshold for the number of
physical processors that can be used. We recommend not to increase the
ratio between the virtual processors to entitled capacity to more than
1.6 Each partition has its own assigned virtual processors. The partition will work only on the virtual processors needed for its
workload. The unneeded virtual processors assigned to a partition will
fold away using processor folding feature. To display the current
assigned virtual processors use the command # lparstat -i | grep -i
"Desired Virtual CPUs" Using an HMC, you can change the number of
virtual processors and processing units that are assigned to the
partition.
The given Physical CPU is the number of Physical CPUs installed on the power machine where this lpar is hosted. The Virtual CPU is the allocated Virtual CPU to this particular LPAR.
Also Desired Virtual CPU and Online Virtual CPU are the same thing.

GTX 970 bandwidth calculation

I am trying to calculate the theoretical bandwidth of gtx970. As per the specs given in:-
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications
Memory clock is 7Gb/s
Memory bus width = 256
Bandwidth = 7*256*2/8 (*2 because it is a DDR)
= 448 GB/s
However, in the specs it is given as 224GB/s
Why is there a factor 2 difference? Am i making a mistake, if so please correct me.
Thanks
The 7 Gbps seems to be the effective clock, i.e. including the data rate. Also note that the field explanation for this Wikipedia list says that "All DDR/GDDR memories operate at half this frequency, except for GDDR5, which operates at one quarter of this frequency", which suggests that all GDDR5 chips are in fact quad data rate, despite the DDR abbreviation.
Finally, let me point out this note from Wikipedia, which disqualifies the trivial effective clock * bus width formula:
For accessing its memory, the GTX 970 stripes data across 7 of its 8 32-bit physical memory lanes, at 196 GB/s. The last 1/8 of its memory (0.5 GiB on a 4 GiB card) is accessed on a non-interleaved solitary 32-bit connection at 28 GB/s, one seventh the speed of the rest of the memory space. Because this smaller memory pool uses the same connection as the 7th lane to the larger main pool, it contends with accesses to the larger block reducing the effective memory bandwidth not adding to it as an independent connection could.
The clock rate reported is an "effective" clock rate and already takes into account the transfer on both rising and falling edges. The trouble is the factor of 2 for DDR.
Some discussion on devtalk here: https://devtalk.nvidia.com/default/topic/995384/theoretical-bandwidth-vs-effective-bandwidth/
In fact, your format is correct, but the memory clock is wrong. GeForce GTX 970's memory clock is 1753MHz(refers to https://www.techpowerup.com/gpu-specs/geforce-gtx-970.c2620).

Using `overlap`, `kernel time` and `utilization` to optimize one's kernels

My kernel archive 100% utilization, but the kernel time is at only 3% and there is no time overlap between memory copies and kernels.
Especially the high utilization and the low kernel time don't make sense to me.
So how should I proceed in optimizing my kernel?
I already made sure, that I only have coalesced and pinned memory access, like the profiler recommended.
`Quadro FX 580 utilization = 100.00% (62117.00/62117.00)`
Kernel time = 3.05 % of total GPU time
Memory copy time = 0.9 % of total GPU time
Kernel taking maximum time = Pinned (0.7% of total GPU time)
Memory copy taking maximum time = memcpyHtoD (0.5% of total GPU time)
There is no time overlap between memory copies and kernels on GPU
Furtermore I have no warp serialization, no divergent branches, and no occupancy limiting factor.
Kernel details: Grid size: [4 1 1], Block size: [256 1 1]
Register Ratio: 0.9375 ( 7680 / 8192 ) [10 registers per thread]
Shared Memory Ratio: 0.09375 ( 1536 / 16384 ) [60 bytes per Block]
Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)
Active threads per SM: 768 (Maximum Active threads per SM: 768)
Potential Occupancy: 1 ( 24 / 24 )
Achieved occupancy: 0.333333 (on 4 SMs)
Occupancy limiting factor: None
p.s. I don't claim that I wrote wundercode, but I just don't know how to proceed from here.
it seems the grid size of your kernel is too small to make full use of SM.
why not decrease block size and increase the grid size.
i think it will do some help.