my problem is with Tensorflow and the usage of the CPU.
My System:
CPU => AMD FX 8320 (8 Cores รก 3,5ghz) and 8 Threads
Grafik => GTX 970
RAM => 16Gb and i belive ddr3 2600
I want to run a A3C algorithm for Starcraft 2 (pysc2) on my pc what works fine but the usage of the cpu ist somewhat strange.
If i start the algorithm with 4 Workers i get something about 150k Steps in 1h
and all cpu's are used about 25-30%
If i start the same algorithm with 8 Workers i get something about 120k Steps in 1h and all cpu's are used about 25-30%
If i now start the algorithm with 4 workers twice i get each 150k steps 1h and the cpu usage is 60-70%
Why cant i start the algorithm with 8 Workers, get the double amount of steps in 1H and the cpu is used to 70%?
Related
I am trying to load a relatively large batch of float16 multispectral images (BxCxHxW=800x12x256x256) to train a deep learning model. The code for the DataLoader is extremely simple:
import torch
import os
paths = os.listdir("/home/bla/data")
class MultiSpectralImageDataset(Dataset):
def __init__(self, paths):
self.paths = np.array(self.paths)
self.l = len(self.paths)
def __len__(self):
return self.l
def __getitem__(self, idx):
path = self.paths[idx]
image = np.load(path)
return image
dataset = MultiSpectralImageDataset(paths)
loader = DataLoader(dataset, batch_size=800, shuffle=True, pin_memory=True, num_workers=16, drop_last=True)
for i, X in enumerate(loader):
X = X.cuda(non_blocking=True).float()
The images are individual files on a very fast NVME SSD. I can verify the read speed of the SSD with sudo hdparm -tT /dev/nvme1n1. This gives me:
/dev/nvme1n1:
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
readonly = 0 (off)
readahead = 256 (on)
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
geometry = 1907729/64/32, sectors = 3907029168, start = 0
bla#bla:~/workspace$ sudo hdparm -tT /dev/nvme1n1
/dev/nvme1n1:
Timing cached reads: 59938 MB in 2.00 seconds = 30041.04 MB/sec
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
Timing buffered disk reads: 6308 MB in 3.00 seconds = 2102.35 MB/sec
This confirms the read speed of the SSD is over 2GB/s. However, when using PyTorch DataLoader, I am not nearly able to match this IO speed. During training, the GPU is idle (0% utilization) most of the time, and the CPU is hardly used (htop shows most cores at 0% usage, some cores at at 0.5-1.5% usage). Running iotop shows
The Total Disk Read speed never surpasses 300MB/s. If I decrease num_workers (say by half), the Total Disk Read emains the same (~200MB/s), and each individual thread doubles in read speed. In particular, I observe that every num_workers iterations, the iteration is extremely slow (takes ~1 minute). This apparently simply means that the loading from disk is too slow, as discussed in the PyTorch forum here
What's weird is that I am 99.9% confident it used to work. I remember constistently reaching almost 100% GPU utilization with the same data-loading procedure.
Things I've tried, but with no successs:
Updating Ubuntu, updating everything with apt udpate & upgrade, rebooting, powering off and restarting
Updating the SSD firmware using fwupd (no updates available)
Giving higher priority to the process by running Python using sudo and using os.nice(-10)
Making space on the SSD (30% of the storage is empty, I have run fstrim -v.
Using memmap, i.e. using the keyword in np.load(path, memmap_mode='r')
I really appreciate any help, as I've been stuck with this problem for weeks now, and what used to take 13 minutes per epoch now takes approximately 1h45 per epoch, making things infeasible to train.
Loading 1500 images of size (1000,1000,3) breaks the code and throughs kill 9 without any further error. Memory used before this line of code is 16% of system total memory. Total size of images direcotry is 7.1G.
X = np.asarray(images).astype('float64')
y = np.asarray(labels).astype('float64')
system spec is:
OS: macOS Catalina
processor: 2.2 GHz 6-Core Intel Core i7 16 GB 2
memory: 16 GB 2400 MHz DDR4
Update:
getting the bellow error while running the code on 32 vCPUs, 120 GB memory.
MemoryError: Unable to allocate 14.1 GiB for an array with shape (1200, 1024, 1024, 3) and data type float32
You would have to provide some more info/details for an exact answer but, assuming that this is a memory error(incredibly likely, size of the images on disk does not represent the size they would occupy in memory, so that is irrelevant. In 100% of all cases, the images in memory will occupy a lot more space due to pointers, objects that are needed and so on. Intuitively I would say that 16GB of ram is nowhere nearly enough to load 7GB of images. It's impossible to tell you how much you would need but from experience I would say that you'd need to bump it up to 64GB. If you are using Keras, I would suggest looking into the DirectoryIterator.
Edit:
As Cris Luengo pointed out, I missed the fact that you stated the size of the images.
I am trying to calculate the theoretical bandwidth of gtx970. As per the specs given in:-
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications
Memory clock is 7Gb/s
Memory bus width = 256
Bandwidth = 7*256*2/8 (*2 because it is a DDR)
= 448 GB/s
However, in the specs it is given as 224GB/s
Why is there a factor 2 difference? Am i making a mistake, if so please correct me.
Thanks
The 7 Gbps seems to be the effective clock, i.e. including the data rate. Also note that the field explanation for this Wikipedia list says that "All DDR/GDDR memories operate at half this frequency, except for GDDR5, which operates at one quarter of this frequency", which suggests that all GDDR5 chips are in fact quad data rate, despite the DDR abbreviation.
Finally, let me point out this note from Wikipedia, which disqualifies the trivial effective clock * bus width formula:
For accessing its memory, the GTX 970 stripes data across 7 of its 8 32-bit physical memory lanes, at 196 GB/s. The last 1/8 of its memory (0.5 GiB on a 4 GiB card) is accessed on a non-interleaved solitary 32-bit connection at 28 GB/s, one seventh the speed of the rest of the memory space. Because this smaller memory pool uses the same connection as the 7th lane to the larger main pool, it contends with accesses to the larger block reducing the effective memory bandwidth not adding to it as an independent connection could.
The clock rate reported is an "effective" clock rate and already takes into account the transfer on both rising and falling edges. The trouble is the factor of 2 for DDR.
Some discussion on devtalk here: https://devtalk.nvidia.com/default/topic/995384/theoretical-bandwidth-vs-effective-bandwidth/
In fact, your format is correct, but the memory clock is wrong. GeForce GTX 970's memory clock is 1753MHz(refers to https://www.techpowerup.com/gpu-specs/geforce-gtx-970.c2620).
I have 6 GPUs of RX470. It should be mining average 25-27 mh/s each but it's only 20 mh/s. Overall is 120 instead of 150-170. I think the problem is GPU bios configuration but can't figure out any other thing. Any suggestions?
25 mh/s is what you would expect from an RX 480 stock. To get the same hashrate for RX 470, you'd be looking at overclocking memory speed (+600). In terms of how to overclock, it depends on whether your running linux or windows.
My kernel archive 100% utilization, but the kernel time is at only 3% and there is no time overlap between memory copies and kernels.
Especially the high utilization and the low kernel time don't make sense to me.
So how should I proceed in optimizing my kernel?
I already made sure, that I only have coalesced and pinned memory access, like the profiler recommended.
`Quadro FX 580 utilization = 100.00% (62117.00/62117.00)`
Kernel time = 3.05 % of total GPU time
Memory copy time = 0.9 % of total GPU time
Kernel taking maximum time = Pinned (0.7% of total GPU time)
Memory copy taking maximum time = memcpyHtoD (0.5% of total GPU time)
There is no time overlap between memory copies and kernels on GPU
Furtermore I have no warp serialization, no divergent branches, and no occupancy limiting factor.
Kernel details: Grid size: [4 1 1], Block size: [256 1 1]
Register Ratio: 0.9375 ( 7680 / 8192 ) [10 registers per thread]
Shared Memory Ratio: 0.09375 ( 1536 / 16384 ) [60 bytes per Block]
Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)
Active threads per SM: 768 (Maximum Active threads per SM: 768)
Potential Occupancy: 1 ( 24 / 24 )
Achieved occupancy: 0.333333 (on 4 SMs)
Occupancy limiting factor: None
p.s. I don't claim that I wrote wundercode, but I just don't know how to proceed from here.
it seems the grid size of your kernel is too small to make full use of SM.
why not decrease block size and increase the grid size.
i think it will do some help.