I am trying to train a model for image recognition using Yolo version 3 with this notebook:
https://drive.google.com/file/d/1YnZLp6aIl-iSrL4tzVQgxJaE1N2_GfFH/view?usp=sharing
But for some reason, everything works fine but the final training. The training starts, and after 5-10 minutes (randomly) it stops working. The browser becomes unresponsive (I am unable to do anything inside that tab), and after several minutes Colab completely disconnects.
I have tried this 10 and more times and I always get the same result. I tried it on both Chrome Canary and regular Chrome (last versions), as well inside anonymous windows, but I always get the same result.
Any ideas? Why is that happening?
Eager to know your thoughts about this.
All the best,
Fab.
Problem solved. Tried the same process on Firefox and discovered that the auto-saving feature of Google drive was conflicting with the process! So... I had to simply use the "playground" of colab instead as explained here:
https://stackoverflow.com/questions/58207750/how-to-disable-autosave-in-google-colab#:~:text=1%20Answer&text=Open%20the%20notebook%20in%20playground,Save%20a%20copy%20in%20Drive.
No idea why Chrome didn't give me any feedback about that, but Firefox saved my day!
Following #fabrizio-ferrari answer, I disabled output saving and the problem persisted.
Runtime -> Change runtime type -> Omit code cell output when saving this notebook
I moved to firefox and the problem disappeared.
Related
I was trying out dlib‘s deep learning-based face detection MMOD, and it worked perfectly fine without any errors. After the weekend, I rerun my google colab, and I get the following error:
RuntimeError: Error while calling cudnnConvolutionBiasActivationForward( context(), &alpha1, descriptor(data), data.device(), (const cudnnFilterDescriptor_t)filter_handle, filters.device(), (const cudnnConvolutionDescriptor_t)conv_handle, (cudnnConvolutionFwdAlgo_t)forward_algo, forward_workspace, forward_workspace_size_in_bytes, &alpha2, out_desc, out, descriptor(biases), biases.device(), identity_activation_descriptor(), out_desc, out) in file /tmp/pip-install-fdw8qrx_/dlib_e3176ea453c4478d8dbecc372b81297e/dlib/cuda/cudnn_dlibapi.cpp:1237. code: 9, reason: CUDNN_STATUS_NOT_SUPPORTED
literally same code previously saved in GitHub, and now in google colab
Any ideas about what could have happened over the weekend, and how to fix it? Thank you!
So after I tried EVERYTHING I could come up with (trying the code on a different machine, on a different platform, check if there were any library updates), I went through my github committed version, and realized, that the dlib library was updated, but not announced anywhere...
So yeah, note for the future self: always include the .version afterimporting the tools, might save DAYS of trying to figure out what on earth happened
In the last week or two I have seen frequent disconnects while trying to run a lengthy training run. A month or two ago this seemed to be working pretty reliably. My code has definitely changed but those internal details seem unrelated to the operation of Colab.
(On the other hand, I did switch my local machine from an Intel MacBook Pro running Big Sur to an M1 (Apple Silicon) MacBook Pro running Monterey. I assume that does not matter to Colab running in the cloud, via a Chrome browser.)
I see two kinds of disconnects:
There are “faux disconnects” which seem like false positives from
the disconnect detector. These last less than a second, then the
computation continues apparently unscathed. A black notification
slides up from the lower left corner of then window, then slides
back. See a link to a video of this below.
Then there are “real disconnects.” I start a computation that I
expect to run for several hours. I see “faux disconnects” happen
frequently. But less than an hour into the computation, I find
the Colab window idle, no status information, and a Reconnect button
in the upper right corner.
Link to video. I started this session around 1:03 pm. This video was recorded at 1:35 pm. Normally the training session should have run for several hours. Instead it died at 1:52 pm (~50 minutes into the run). See some additional comments in an issue at GitHub.
Can anyone help me understand how to get past this? I am currently unable to make progress in my work because I cannot complete a training run before my Colab runtime decides to disconnect.
Edit:
FYI: since once a “real disconnect” happens it is too late to look at the (no longer connected) runtime's log, and since this seems to run for about an hour before disconnecting, I saved a log file when a run was about 10 minutes in.
Edit on August 1, 2022:
My real problem is the “real disconnect” on my real Colab notebook. But my notebook is overly complicated, so not a good test case. I tried to make a small test case, see Colab notebook: DisconnectTest.ipynb. It contains a generic NIST-based Keras/TensorFlow benchmark from the innertubes. I made a screen grab video of the first 2.5 minutes of a run. While this run completes OK — that is, there are no “real disconnects” — it had several “faux disconnects.” The first one is at 1:36. These seem fairly benign, but they do disrupt the Resources panel on the right. This makes it hard to know if the source of the “real disconnect” has anything to do with exhausting resources.
As I described in a parallel post on Colab's Issue #2965 on Github, this appears to be “some interaction between Colab and Chrome (and perhaps macOS Monterey (Version 12.5.1), and perhaps M1 Apple Silicon). Yet Colab seems to work fine on M1/Monterey/Safari.”
As described there, a trivial Colab example fails on Chrome browser but works fine on Safari.
enter image description here
enter image description here
I am using colab pro. About 4 months ago, I experienced slow learning of the tensorflow model. The learning speed is so slow, and as a result of checking it myself today, I was able to confirm that the gpu was detected normally, but the GPU POWER was off. The volatile GPU Util is also allocated as 0 , but it looks like the GPU is not being utilized for training. When I looked for the cause, there was a saying that the data I/O bottleneck was, so I also modified the DATALOADER, and when I ran the same code and dataset in a different COLAB account, I was able to see that the GPU allocation worked well and the time was also shortened. If there is a problem with the os settings or if there is something I need to fix, please let me know. have a good day
I figured out that the problem was simply a path problem. As we've gotten feedback before, it seems like there's been a bottleneck in loading images through folders.
It was solved by specifying the path of the dataset as content/ .
Maybe you heard that Google Colab has P100 GPU, It is way more faster than other all GPUs except V100 (V100 is avaliable in only Colab Pro.). As Its powerful, its pretty rare in Colab Free (P100). I didnt get "Tesla P100" in Colab before. So I tried to code a program that Factory Resets Runtime until getting "Tesla P100-PCIE..." text in nvidia-smi (If you create a cell which contains !nvidia-smi in code, You'll get your GPU's model) . I tried to do with Selenium but It failed cause of "This browser may not be secure" error. Then I tried to do with Javascript (Google DevConsole) but It failed cause of an error that I dont know what does it mean. So Im here.
[Q] How to get "Tesla P100" in Google Colab in programmaticly way?
Recently google colab consumes too much of internet data . Approx 4GB in 6 hours of training for single notebook . What can be the issue ?
Yes I have the same issue. It normally works fine but, there is sudden spike in the internet data. Check this. In the process it wasted 700 Mb in just 20 minutes, and I have mobile internet, so this creates a problem sometimes. Didn't find the answer but it seems like there is some kind of synchronization going on between the browser and the colab platform.
One thing you could do is to open the notebook in Playground mode as shown in this link How to remove the autosave option in Colab. This only happens because of the fact that Colab is saving everytime and there is a constant spike in the network. It becomes difficult when you use only mobile data. So, it is a safe option to open the notebook in Playground mode, so that the synchronization doesn't continue as usual.