Running Roboflow's notebook, 'Roboflow-Train-YOLOv5', stops after completion the epochs loop.
Instead a reporting the results, I get the following lines, with a ^C at the end of the 3rd line
from the end.
I would like to know the reason for the failure, and there is a way to fix it.
10 epochs completed in 0.191 hours.
Optimizer stripped from runs/train/yolov5s_results2/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/yolov5s_results2/weights/best.pt, 14.9MB
Validating runs/train/yolov5s_results2/weights/best.pt...
Fusing layers...
my_YOLOv5s summary: 213 layers, 7015519 parameters, 0 gradients, 15.8 GFLOPs
Class Images Labels P R mAP#.5 mAP#.5:.95: 20% 1/5 [00:01<00:04, 1.03s/it]^C
CPU times: user 7.01 s, sys: 830 ms, total: 7.84 s
Wall time: 12min 31s
My Colab plan is Colab pro, so I guess it is not a problem of resources.
Related
I am trying to load a relatively large batch of float16 multispectral images (BxCxHxW=800x12x256x256) to train a deep learning model. The code for the DataLoader is extremely simple:
import torch
import os
paths = os.listdir("/home/bla/data")
class MultiSpectralImageDataset(Dataset):
def __init__(self, paths):
self.paths = np.array(self.paths)
self.l = len(self.paths)
def __len__(self):
return self.l
def __getitem__(self, idx):
path = self.paths[idx]
image = np.load(path)
return image
dataset = MultiSpectralImageDataset(paths)
loader = DataLoader(dataset, batch_size=800, shuffle=True, pin_memory=True, num_workers=16, drop_last=True)
for i, X in enumerate(loader):
X = X.cuda(non_blocking=True).float()
The images are individual files on a very fast NVME SSD. I can verify the read speed of the SSD with sudo hdparm -tT /dev/nvme1n1. This gives me:
/dev/nvme1n1:
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
readonly = 0 (off)
readahead = 256 (on)
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
geometry = 1907729/64/32, sectors = 3907029168, start = 0
bla#bla:~/workspace$ sudo hdparm -tT /dev/nvme1n1
/dev/nvme1n1:
Timing cached reads: 59938 MB in 2.00 seconds = 30041.04 MB/sec
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
Timing buffered disk reads: 6308 MB in 3.00 seconds = 2102.35 MB/sec
This confirms the read speed of the SSD is over 2GB/s. However, when using PyTorch DataLoader, I am not nearly able to match this IO speed. During training, the GPU is idle (0% utilization) most of the time, and the CPU is hardly used (htop shows most cores at 0% usage, some cores at at 0.5-1.5% usage). Running iotop shows
The Total Disk Read speed never surpasses 300MB/s. If I decrease num_workers (say by half), the Total Disk Read emains the same (~200MB/s), and each individual thread doubles in read speed. In particular, I observe that every num_workers iterations, the iteration is extremely slow (takes ~1 minute). This apparently simply means that the loading from disk is too slow, as discussed in the PyTorch forum here
What's weird is that I am 99.9% confident it used to work. I remember constistently reaching almost 100% GPU utilization with the same data-loading procedure.
Things I've tried, but with no successs:
Updating Ubuntu, updating everything with apt udpate & upgrade, rebooting, powering off and restarting
Updating the SSD firmware using fwupd (no updates available)
Giving higher priority to the process by running Python using sudo and using os.nice(-10)
Making space on the SSD (30% of the storage is empty, I have run fstrim -v.
Using memmap, i.e. using the keyword in np.load(path, memmap_mode='r')
I really appreciate any help, as I've been stuck with this problem for weeks now, and what used to take 13 minutes per epoch now takes approximately 1h45 per epoch, making things infeasible to train.
UPDATE: In my initial post I stupidly applied stats.mode patch-wise rather than along the axis of the patches. Fixing this increased my speed by a factor of 4. however its still slow and my original questions still exist: (1) can i increase the speed? (2) are there different/better/standard approaches to cleaning up noisy categorical data? Back the post:
I have some image segmentation results that are noisy and I want to clean it up. My idea was to take the mode value for (3,3) patches. This code works but its too slow.:
from sklearn.feature_extraction import image
from scipy import stats
def _mode(a,axis=None):
m,_=stats.mode(a,axis=axis)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
patches=patches.reshape(nb_patches,-1)
return _mode(patches,1).reshape(int(np.sqrt(nb_patches)),-1)
""" original method (new version is ~ 5 times faster, but still slow)
def _mode(arr):
m,_=stats.mode(arr,axis=None)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
w=int(np.sqrt(nb_patches))
o=np.array([_mode(patches[p]) for p in range(nb_patches)])
return o.reshape(w,-1)
"""
Questions:
Is there a way to do this that is much much faster? eliminate for loop/vectorize in numpy? porting to c directly or using numba etc? I struggled getting something to work along these paths
Are there better / more standard methods for accomplishing denoising like this on categorical image data?
Here is a before/after example from the mode_smoothing method above
Below I present 2 answers to question my question:
by expanding out my initial attempt into a function that numba can handle
using the above suggestion by Alex Alex, which i'll call "categorical-smoothing" (is there a standard name for this method?)
I haven't written out a mathematical proof yet but it appears this patch-wise mode smoothing is equivalent to the categorical-smoothing for the correct choice of parameters. Both lead to a big speed-boost but categorical-smoothing solution is cleaner, faster, and doesn't involve numba - so it wins.
NUMBA
#njit
def mode_smoothing(data,kernel=(3,3),step=(1,1),edges=False,high_value=False,center_boost=False):
""" mode smoothing over patches
Args:
data<np.array>: numpy array (ndim=2)
kernel<tuple[int]>: (height,width) of patch window
step<tuple[int]>: (y-step,x-step)
edges<bool>:
- if true
* include edge patches by taking mode over smaller patch window
* the returned image be the same shape as the input data
- if false
* only run over patches with the full kernel size
* the returned image will be reduced in size by the radius of the kernel
high_value<bool>:
when there are multiple possible mode values choose the highest if true,
otherwise choose the lowest value
center_boost<int|bool>:
if true, instead of using pure mode-value increase the count on the center pixel
Return
<np.array> of patch wise mode values. shape my be different than input. see `edges` above
"""
h,w=data.shape
ry=int(kernel[0]//2)
rx=int(kernel[1]//2)
sy,sx=step
_mode_vals=[]
if edges:
j0,j1=0,h
i0,i1=0,w
else:
j0,j1=ry,h-ry
i0,i1=rx,w-rx
for j in range(j0,j1,sy):
for i in range(i0,i1,sx):
ap=data[
max(j-ry,0):j+ry+1,
max(i-rx,0):i+rx+1]
cv=data[j,i]
values=np.unique(ap)
count=0
for v in values:
newcount=(ap==v).sum()
if center_boost and (v==cv):
newcount+=center_boost
if high_value:
test=newcount>=count
else:
test=newcount>count
if test:
count=newcount
mode_value=v
_mode_vals.append(mode_value)
return np.array(_mode_vals).reshape(j1-j0,i1-i0)
CATEGORICAL SMOOTHING
from scipy.signal import convolve2d
KERNEL=np.ones((3,3))
def categorical_smoothing(data,nb_categories,kernel=KERNEL):
data=np.eye(nb_categories)[:,data]
for i in range(nb_categories):
data[i]=convolve2d(data[i],kernel,mode='same')
return data.argmax(axis=0)
EQUIVALENCE/SPEED-CHECK
This is probably easy to prove but...
S=512
N=19
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=False)
kernel=np.ones((3,3))
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
print()
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=1)
kernel=np.ones((3,3))
kernel[1,1]=2
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
""" OUTPUT
CPU times: user 826 ms, sys: 0 ns, total: 826 ms
Wall time: 825 ms
CPU times: user 416 ms, sys: 7.83 ms, total: 423 ms
Wall time: 423 ms
True
CPU times: user 825 ms, sys: 3.78 ms, total: 829 ms
Wall time: 828 ms
CPU times: user 422 ms, sys: 3.91 ms, total: 426 ms
Wall time: 425 ms
True
"""
my problem is with Tensorflow and the usage of the CPU.
My System:
CPU => AMD FX 8320 (8 Cores á 3,5ghz) and 8 Threads
Grafik => GTX 970
RAM => 16Gb and i belive ddr3 2600
I want to run a A3C algorithm for Starcraft 2 (pysc2) on my pc what works fine but the usage of the cpu ist somewhat strange.
If i start the algorithm with 4 Workers i get something about 150k Steps in 1h
and all cpu's are used about 25-30%
If i start the same algorithm with 8 Workers i get something about 120k Steps in 1h and all cpu's are used about 25-30%
If i now start the algorithm with 4 workers twice i get each 150k steps 1h and the cpu usage is 60-70%
Why cant i start the algorithm with 8 Workers, get the double amount of steps in 1H and the cpu is used to 70%?
I have an original hdf5 file with a dataset thats of shape (3737, 224, 224, 3) and it was not extendable. i.e. no maxshape argument passed during its creation.
I decided to create a new hdf5 file and create the dataset with maxshape=(None, 224, 224, 3) such that I can resize it later. I then just copied the dataset from the original hdf5 to this new one, and saved.
The contents of the two hdf5 are exactly the same. I then tried to read all the data back, and I found significant performance degradation for the resizable version.
Original:
CPU times: user 660 ms, sys: 2.58 s, total: 3.24 s
Wall time: 6.08 s
Resizable:
CPU times: user 18.6 s, sys: 4.41 s, total: 23 s
Wall time: 49.5 s
Thats almost 10 times as slow. Is this to be expected? The file size difference is only less than 2 mb. Are there optimization tips/tricks I need to be aware of?
Upon reading the hdf5 doc carefully, it seemed that if you specify a maxshape during creation of a dataset (which enable it to be resizable in the future) also turned chunking on. This seemed to be mandatory. And the "default" chunking size it gives me by default is dataset.chunks = (234, 14, 28, 1).
According to doc, this means data are not contiguous but "haphazardly" stored in a b-tree like structure. This most likely explains the slowness I observed, it is probably doing much more extra i/o than I thought.
I set the chunk size to be the entire dataset size by passing "chunks=(3737, 224, 224, 3)" and this time, I got
CPU times: user 809 µs, sys: 837 ms, total: 838 ms
Wall time: 914 ms
Thats a big speed up in loading my (3737, 224, 224, 3) tensor. I sort of understand why chunking is a scalability solution. But the fact that it magically assign a chunk size is confusing. My context is mini-batch training for deep learning. So the optimal thing is each chunk is a mini-batch.
I'm trying to build a simple neural network in Tensorflow, but I have a question about gradient optimization.
It might be a naive question, but do I have to set conditions to stop the optimizer? Below is a sample printout from my network and you can see that after iteration (batch gradient descent of all data) 66, the cost begins to increase again. So is it up to me to make sure the optimizer stops at this point? (note: I didn't put all the output here, but the cost begins to increase exponentially as the number of iterations increase).
Thanks for any guidance.
iteration 64 with average cost of 654.621 and diff of 0.462708
iteration 65 with average cost of 654.364 and diff of 0.257202
iteration 66 with average cost of 654.36 and diff of 0.00384521
iteration 67 with average cost of 654.663 and diff of -0.302368
iteration 68 with average cost of 655.328 and diff of -0.665161
iteration 69 with average cost of 656.423 and diff of -1.09497
iteration 70 with average cost of 658.011 and diff of -1.58826
That's correct - the TensorFlow tf.train.Optimizer classes expose an operation that you can run to take one (gradient descent-style) step, but they do not monitor the current value of the cost or decide when to stop, so you may see increasing cost once the network begins to overfit.