google earth engine: how resume a specific batch who fail - batch-processing

I'm first time using Google Earth engine via this repo:
https://github.com/kratzert/Caravan/blob/main/code/Caravan_part1_Earth_Engine.ipynb
All is working good but at a moment I forgot to dowload finished batch on my pc to free some space on google Drive.
So 2 batch has fail du to lack of memory space.
Can I just resume this 2 specific failed batchs?
Or I have to run again the code to download all batch? (7 days)?
If I run again, can I stopthe process after the number of batch missing and use the other batch of the first try??

It's generally a good practice to chunk out larger tasks so that you can pick up where you left off / avoid mysterious fails that seem common w massive requests.
That's been my experience, anyways.

There is no way to resume a task

Related

is google colab training speed affected by our internet connection?

I've looked for some questions and mostly they discussed dataset uploading, but one did state that google colab only use our internet connection to run the code. I am confused with this, does this mean our internet speed also affects the training time of our model? or what matters is that once we ran our code, the google server takes care of it, and do not need our connection?
I think yes, Google Colab's speed is affected by our Internet connection. I'm not absolutely sure why, but you can check in this link. Obviously, when the model is being trained, the Internet data usage rises considerably. My guess is that our computer, as a client, needs to save some hidden internal information relevant to the running state. Therefore, the Colab server has to send this information every time 1 line of code is executed, and the next line of code can only be executed when this information reaches the client. So if the Internet connection is slow, it will take more time for the client to receive the information and the whole process will be slow.
There is also another proof to see if the Internet connection really affects Google Colab's speed. With the coffee Internet, which is significantly stronger than that of my house, the same block of code is executed more than 2 times faster than when using my house's wifi.

Data not showing up intermittently on the OpenTSDB UI

We are running some high volume tests by pushing metrics to OpenTSDB (2.3.0) with BigTable, and a curious problem surfaces from time to time. For some metrics, an hour of data stops showing up on the web UI when we run a query. The span of "missing" data is very clearcut and borders on the hour (UTC). After a while, while rerunning the same query, the data shows up. There does not seem to be any pattern that we can deduce here, other than the hour span. Any pointers on what to look for and debug this?
How long do you have to wait before the data shows up? Is it always the most recent hour that is missing?
Have you tried using OpenTSDB CLI when this is happening and issuing a scan to see if the data is available that way?
http://opentsdb.net/docs/build/html/user_guide/cli/scan.html
You could also check via an HBase shell scan to see if you can get the raw data that way (here's information on how it's stored in HBase):
http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html
If you can verify the data is there then it seems likely to be a web UI problem. If not, the next likely culprit is something getting backed up in the write pipeline.
I am not aware of any particular issue in the Google Cloud Bigtable backend layer that would cause this behavior, but I believe some folks have encountered issues with OpenTSDB compactions during periods of high load that result in degraded performance.
It's worth checking in the Google Cloud Console to see if there's any outliers in the latency, CPU or throughput graphs that correlate with the times during which you experience the issue.

Google Cloud ML, extend previous run of hyperparameter tuning

I am running hyper parameter tuning using Google Cloud ML. I am wondering if it is possible to benefit from (possibly partial) previous runs.
One application would be :
I launch an hyperparameter tuning job
I stop it because I want to change the type of cluster I am using
I want to restart my hypertune job on a new cluster, but I want to benefit from previous runs I already paid for.
or another application :
I launch an hypertune campain
I want to extend the number of trials afterwards, without starting from scratch
and then for instance, I want remove one degree of liberty (e.g. training_rate), focusing on other parameters
Basically, what I need is "how can I have a checkpoint for hypertune ?"
Thx !
Yes, this is an interesting workflow -- Its not exactly possible with the current set of APIs, so its something we'll need to consider in future planning.
However, I wonder if there are some workarounds that can pan out to approximate your intended workflow, right now.
Start with higher number of trials - given you can cancel a job, but not extend one.
Finish a training job early based on some external input - eg. once you've arrived at a fixed training_rate, you could record that in a file in GCS, and mark subsequent trials with different training rate as infeasible, so those trials end fast.
To go further, eg. launch another job (to add runs, or change scale tier), you could potentially try using the same output directory, and this time lookup previous results for a given set of hyperparameters with an objective metric (you'll need to record them somewhere where you can look them up -- eg. create gcs files to track the trial runs), so the particular trial completes early, and training moves on to the next trial. Essentially rolling your own "checkpoint for hypertune".
As I mentioned, all of these are workarounds, and exploratory thoughts on what might be possible from your end with current capabilities.

Getting "Backend error. Job aborted" while trying to export a big query table to GCS

Since the past couple of weeks I have been continuously getting "Backend error. Job aborted" error while trying to export a big query table to google cloud storage in csv format.
The table has been created using bq select * statement (using allowLargeQueryResult option)
Also the target bucket name doesn't seem to be problematic.
Here's a sample extract.
Errors:
Backend error. Job aborted.
Job ID: kiwiup.com:kiwi-bigquery:job_mk90xJqtyinbzRqIfWVjM2mHLP0
Start Time: 2:53pm, 8 Aug 2014
End Time: 8:53pm, 8 Aug 2014
The job is taking almost six hours to complete after which it fails. Previously it used to complete in a couple of minutes. Any help would be appreciated.
Your export job hit a timeout. We're currently investigating why; the date of your job coincides with a bandwidth issue we were having that should have been resolved. We're currently adding more instrumentation and monitoring so it will be easier to debug in the future.
As a workaround, if you give multiple extraction URI patterns, BigQuery will spin up more workers in parallel. See the "Multiple Wildcard URIs" example here.
As Jordan said, this coincided with a bandwidth problem. Sorry for the inconvenience.
In some cases, giving multiple wildcard URIs will increase parallelism, but this applies only to fairly large (10's of GB) tables, and can actually decrease parallelism. Multiple wildcard URIs are designed to support Hadoop jobs, not to control parallelism.

Will doing fork multiple times affect performance?

I need to read log files (.CSV) using fastercsv and save the contents of it in a db (each cell value is a record). The thing is there are around 20-25 log files which has to be read daily and those log files are really large (each CSV file is more then 7Mb). I had forked the reading process so that user need not have to wait a long time but still reading 20-25 files of that size is taking time (more then 2hrs). Now I want to fork reading of each file i.e there will be around 20-25 child process getting created, my question is can I do that? If yes will it affect the performance and is fastercsv able to handle this?
ex:
for report in #reports
pid = fork {
.
.
.
}
Process.dispatch(pid)
end
PS:I'm using rails 3.0.7 and Its going to happen in server which is running in amazon's large instance(7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform)
If the storage is all local (and I'm not sure you can really say that if you're in the cloud), then forking isn't likely to provide a speedup because the slowest part of the operation is going to be disc I/O (unless you're doing serious computation on your data). Hitting the disc via several processes isn't going to speed that up at once, though I suppose if the disc had a big cache it might help a bit.
Also, 7MB of CSV data isn't really that much - you might get a better speedup if you found a quicker way to insert the data. Some databases provide a bulk load function, where you can load in formatted data directly, or you could turn each row into an INSERT and file that straight into the database. I don't know how you're doing it at the moment so these are just guesses.
Of course, having said all that, the only way to be sure is to try it!