Google Cloud ML, extend previous run of hyperparameter tuning - tensorflow

I am running hyper parameter tuning using Google Cloud ML. I am wondering if it is possible to benefit from (possibly partial) previous runs.
One application would be :
I launch an hyperparameter tuning job
I stop it because I want to change the type of cluster I am using
I want to restart my hypertune job on a new cluster, but I want to benefit from previous runs I already paid for.
or another application :
I launch an hypertune campain
I want to extend the number of trials afterwards, without starting from scratch
and then for instance, I want remove one degree of liberty (e.g. training_rate), focusing on other parameters
Basically, what I need is "how can I have a checkpoint for hypertune ?"
Thx !

Yes, this is an interesting workflow -- Its not exactly possible with the current set of APIs, so its something we'll need to consider in future planning.
However, I wonder if there are some workarounds that can pan out to approximate your intended workflow, right now.
Start with higher number of trials - given you can cancel a job, but not extend one.
Finish a training job early based on some external input - eg. once you've arrived at a fixed training_rate, you could record that in a file in GCS, and mark subsequent trials with different training rate as infeasible, so those trials end fast.
To go further, eg. launch another job (to add runs, or change scale tier), you could potentially try using the same output directory, and this time lookup previous results for a given set of hyperparameters with an objective metric (you'll need to record them somewhere where you can look them up -- eg. create gcs files to track the trial runs), so the particular trial completes early, and training moves on to the next trial. Essentially rolling your own "checkpoint for hypertune".
As I mentioned, all of these are workarounds, and exploratory thoughts on what might be possible from your end with current capabilities.

Related

SSAS Process Default behavior

I'm trying to make sense of Process Default behavior on SSAS 2017 Enterprise Edition.
My cube is processed daily in this standard sequence:
Loop through 30 dimensions and performing Process Add or Process Update as required.
Process approximately 80 partitions for the previous day.
Exec a Process Default as the final step.
Everything works just fine, and for the amount of data involved, performs really well. However I have observed that after the process default completes, if I re-run the process default step manually (with no other activity having occurred whatsoever), it will take exactly the same time as the first run.
My understanding was that this step basically scans the cube looking for unprocessed objects and will process any objects found to be unprocessed. Given the flow of dimension processing, and subsequent partition processing, I'd certainly expect some objects to be unprocessed on the first run - particularly aggregations and indexes.
The end to end processing time is around 65 mins, but 10 mins of this is the final process default step.
What would explain this is that if the process default isn't actually finding anything to do, and the elapsed time is the cost of scanning the meta data. Firstly it seems an excessive amount of time, but also if I don't run the step, the cube doesn't come online, which suggests it is definitely doing something.
I've had a trawl through Profiler to try to find events to capture what process default is doing, but I'm not able to find anything that would capture the event specifically. I've also monitored the server performance during the step, and nothing is under any real load.
Any suggestions or clarifications..?

How to delete an instance if cpu is low?

I am running managed Instance groups whose overall c.p.u is always below 30% but if i check instances individually then i found some are running at 70 above and others are running as low as 15 percent.
Keep in mind that Managed Instance Groups don't take into account individual instances as whether a machine should be removed from the pool or not. GCP's MIGs keep a running average of the last 10 minutes of activity of all instances in the group and use that metric to determine scaling decisions. You can find more details here.
Identifying instances with lower CPU usage than the group doesn't seem like the right goal here, instead I would suggest focusing on why some machines have 15% usage and others have 70%. How is work distributed to your instances, are you using the correct strategies for load balancing for your workload?
Maybe your applications have specific endpoints that cause large amounts of CPU usage while the majority of them are basic CRUD operations, having one machine generating a report and displaying higher usage is fine. If all instances render HTML pages from templates and return the results one machine performing much less work than the others is a distribution issue. Maybe you're using a RPS algorithm when you want a CPU utilization one.
In your use case, the best option is to create an Alert notification that will alert you when an instance goes over the desired CPU usage. Once you receive the notification, you will then be able to manually delete the VM instance. As it is part of the Managed Instance group, the VM instance will automatically recreate.
I have attached an article on how to create an Alert notification here.
There is no metric within Stackdriver that will call the GCE API to delete a VM instance .
There is currently no such automation in place. It should't be too difficult to implement it yourself though. You can write a small script that would run on all your machines (started from Cron or something) that monitors CPU usage. If it decides it is too low, the instance can delete itself from the MIG (you can use e.g. gcloud compute instance-groups managed delete-instances --instances ).

How to get each step logged in Pentaho 8.0

How can I get details for each step in each transformation inside a job logged separately so I can check the step runtime ??? I would like to get the steps ordered but it would perfectly fit me if I can get only datetimestamp for each record that is written inside the database.... but the problem is, that it runs in parallel and writes all data at one run only !! which means all steps have the same datetime ... is there any way how to achieve it? Because otherwise, I am not able to determine any bottlenecks, performance problems, etc .... this really sucks
On the transformation properties, every transformation has a "Logging" tab where you can specify a DB connection to store the logging data/metadata, as well as the fields you want to save.
Regarding the timestamp associated to the execution time, yes, transformation steps run in parallel and that's the way a transformation is designed to work; you can alter that behavior by placing blocking steps, but that will result in performance loss.

Data not showing up intermittently on the OpenTSDB UI

We are running some high volume tests by pushing metrics to OpenTSDB (2.3.0) with BigTable, and a curious problem surfaces from time to time. For some metrics, an hour of data stops showing up on the web UI when we run a query. The span of "missing" data is very clearcut and borders on the hour (UTC). After a while, while rerunning the same query, the data shows up. There does not seem to be any pattern that we can deduce here, other than the hour span. Any pointers on what to look for and debug this?
How long do you have to wait before the data shows up? Is it always the most recent hour that is missing?
Have you tried using OpenTSDB CLI when this is happening and issuing a scan to see if the data is available that way?
http://opentsdb.net/docs/build/html/user_guide/cli/scan.html
You could also check via an HBase shell scan to see if you can get the raw data that way (here's information on how it's stored in HBase):
http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html
If you can verify the data is there then it seems likely to be a web UI problem. If not, the next likely culprit is something getting backed up in the write pipeline.
I am not aware of any particular issue in the Google Cloud Bigtable backend layer that would cause this behavior, but I believe some folks have encountered issues with OpenTSDB compactions during periods of high load that result in degraded performance.
It's worth checking in the Google Cloud Console to see if there's any outliers in the latency, CPU or throughput graphs that correlate with the times during which you experience the issue.

What is this vague accusation of RRD data loss about?

I want to use CollectD to gather some statistics (about storage) and have Graphite display them nicely. Apparently this can be done either by
having CollectD store the data as RRD files and pointing Graphite at
those, or
using a CollectD plugin to push the data to Graphite's Carbon API, which will store the data in a Whisper database (which is similar to RRD but not compatible).
I think I want to go with RRDs, but I found this statement in the Whisper docs that concerns me:
In many cases (depending on configuration) if an update is made to an
RRD series but is not followed up by another update soon, the original
update will be lost.
Hmmm. That's a bit scary, but the accusation is so vague that I don't know what to make of it. What is the configuration they are talking about, and the situation in which it causes data loss?
My situation is that the metrics data I am gathering will be available in chunks -- periodically I will go get the latest data and make as many entries into the database as there are new samples available. So, for example, I might grab some data and update the database with the values from 3 minutes ago, 2 minutes ago, and 1 minute ago, one right after the other. In fact, I might have dozens of new samples to put in the database at once. Does using RRD this way have anything to do with the Whisper accusation?
NOTE: I do not need to back-fill data; I will always be adding newer data than what has already been stored.
One scenario I see this happening would be if you have an AVERAGE RRA setup, and have the xxf value set to a low percentage. When the data is compressed over time, you could receive an unknown value and 'loose' all the data that was averaged. If you are using a RRD for what it was designed for, and have it setup with the proper type and settings, I wouldn't think you will run into a problem.
I would recommend taking an in depth look at the RRD documentation found HERE to answer questions about how RRD's and RRA's handle the data, and the different storage techniques that are available to you.