The program by which the day is for picture learning of AI using Google Colaboratory, continuously.
It was being carried out, time which made them learn was total of 3-about 4 hours, the one of the program
When trying to execute other programs during execution, the following message was output.
It can't be connected to gpu at present because I reached the amount of consumption upper limit in colab.
Would there be a point that you pay attention by use of Google Colaboratory to avoid the symptom the following symptom generated after message output?
It couldn't be reconnected any more this daytime.
Even if a program wasn't executed, one after the next day just connected to GPU and reached the gpu amount of consumption upper limit.
When GPU could be connected, during executing a program, I reach the amount of consumption upper limit by a short time.
Related
I have to test load testing for 32000 users, duration 15 minutes. And I have run it on command line mode. Threads--300, ramp up--100, loop 1. But after showing some data, it is freeze. So I can't get the full report/html. Even i can't run for 50 users. How can I get rid of this. Please let me know.
From 0 to 2147483647 threads depending on various factors including but not limited to:
Hardware specifications of the machine where you run JMeter
Operating system limitations (if any) of the machine where you run JMeter
JMeter Configuration
The nature of your test (protocol(s) in use, the size of request/response, presence of pre/post processors, assertions and listeners)
Application response time
Phase of the moon
etc.
There is no answer like "on my macbook I can have about 3000 threads" as it varies from test to test, for GET requests returning small amount of data the number will be more, for POST requests uploading huge files and getting huge responses the number will be less.
The approach is the following:
Make sure to follow JMeter Best Practices
Set up monitoring of the machine where you run JMeter (CPU, RAM, Swap usage, etc.), if you don't have a better idea you can go for JMeter PerfMon Plugin
Start your test with 1 user and gradually increase the load at the same time looking into resources consumption
When any of monitored resources consumption starts exceeding reasonable threshold, i.e. 80% of maximum available capacity stop your test and see how many users were online at this stage. This is how many users you can simulate from particular this machine for particular this test.
Another machine or test - repeat from the beginning.
Most probably for 32000 users you will have to go for distributed testing
If your test "hangs" even for smaller amount of users (300 could be simulated even with default JMeter settings and maybe even in GUI mode):
take a look at jmeter.log file
take a thread dump and see what threads are doing
I've looked for some questions and mostly they discussed dataset uploading, but one did state that google colab only use our internet connection to run the code. I am confused with this, does this mean our internet speed also affects the training time of our model? or what matters is that once we ran our code, the google server takes care of it, and do not need our connection?
I think yes, Google Colab's speed is affected by our Internet connection. I'm not absolutely sure why, but you can check in this link. Obviously, when the model is being trained, the Internet data usage rises considerably. My guess is that our computer, as a client, needs to save some hidden internal information relevant to the running state. Therefore, the Colab server has to send this information every time 1 line of code is executed, and the next line of code can only be executed when this information reaches the client. So if the Internet connection is slow, it will take more time for the client to receive the information and the whole process will be slow.
There is also another proof to see if the Internet connection really affects Google Colab's speed. With the coffee Internet, which is significantly stronger than that of my house, the same block of code is executed more than 2 times faster than when using my house's wifi.
Do I need to remain connected with Colab and the internet when training a dataset model (Darknet) for Object detection on Google Colab? As the training is going on Colab and connected to my drive, here Weight files will be saved on my google drive folder. So, can I disconnect my internet and exit colab?
this is the last screen shown when I started my training process on COLAB and now I am waiting for my weight files to be saved in my drive
You absolutely must remain connected to colab for ensuring that your code continues to run. Just because your weight files are being saved to Google Drive does not mean that you can disconnect/close the browser and it will continue to run. Keep in mind that Google Drive is just mounted for the sake of storage space and is not an alternative to an active Colab session.
However, if your Colab session gets disconnected suddenly due to internet/server issues, it'll automatically try to reconnect after a "short" while and continue execution from the interruption point (if you're back online). However, if the time to start after the timeout is too long, you have to run all cells from the start and it can't continue it's most recent operation.
Note that this does not apply if you have exhausted the permitted usage limit for a give time (supposed to be 12 hrs). In this case, you may have to wait for many hours before being allowed to use Colab again
I am running managed Instance groups whose overall c.p.u is always below 30% but if i check instances individually then i found some are running at 70 above and others are running as low as 15 percent.
Keep in mind that Managed Instance Groups don't take into account individual instances as whether a machine should be removed from the pool or not. GCP's MIGs keep a running average of the last 10 minutes of activity of all instances in the group and use that metric to determine scaling decisions. You can find more details here.
Identifying instances with lower CPU usage than the group doesn't seem like the right goal here, instead I would suggest focusing on why some machines have 15% usage and others have 70%. How is work distributed to your instances, are you using the correct strategies for load balancing for your workload?
Maybe your applications have specific endpoints that cause large amounts of CPU usage while the majority of them are basic CRUD operations, having one machine generating a report and displaying higher usage is fine. If all instances render HTML pages from templates and return the results one machine performing much less work than the others is a distribution issue. Maybe you're using a RPS algorithm when you want a CPU utilization one.
In your use case, the best option is to create an Alert notification that will alert you when an instance goes over the desired CPU usage. Once you receive the notification, you will then be able to manually delete the VM instance. As it is part of the Managed Instance group, the VM instance will automatically recreate.
I have attached an article on how to create an Alert notification here.
There is no metric within Stackdriver that will call the GCE API to delete a VM instance .
There is currently no such automation in place. It should't be too difficult to implement it yourself though. You can write a small script that would run on all your machines (started from Cron or something) that monitors CPU usage. If it decides it is too low, the instance can delete itself from the MIG (you can use e.g. gcloud compute instance-groups managed delete-instances --instances ).
On the Compute Engine VM in us-west-1b, I run 16 vCPUs near 99% usage. After a few hours, the VM automatically crashes. This is not a one-time incident, and I have to manually restart the VM.
There are a few instances of CPU usage suddenly dropping to around 30%, then bouncing back to 99%.
There are no logs for the VM at the time of the crash. Is there any other way to get the error logs?
How do I prevent VMs from crashing?
CPU usage graph
This could be your process manager saying that your processes are out of resources. You might wanna look into Kernel tuning where you can increase the limits on the number of active processes on your VM/OS and their resources. Or you can try using a bigger machine with more physical resources. In short, your machine is falling short on resources and hence in order to keep the OS up, process manager shuts down the processes. SSH is one of those processes. Once you reset the machine, all comes back to normal.
How process manager/kernel decides to quit a process varies in many ways. It could simply be that a process has consistently stayed up for way long time to consume too many resources. Also, one thing to note is that OS images that you use to create a VM on GCP is custom hardened by Google to make sure that they can limit malicious capabilities of processes running on such machines.
One of the best ways to tackle this is:
increase the resources of your VM
then go back to code and find out if there's something that is leaking in the process or memory
if all fails, then you might wanna do some kernel tuning to make sure your processes have higer priority than other system process. Though this is a bad idea since you could end up creating a zombie VM.