I was successful in getting a simple cluster with one of each: chief, parameter server, worker, evaluator to work. I followed the directions on the tf.estimator.train_and_evaluate page and set a stopping condition by setting train_spec.max_steps to 20000. My problem is that this is sufficient to stop the chief and worker, but the parameter server and evaluator continue to wait in a loop. Since I'm running under a batch scheduler and asked for a certain time limit, I have to wait until then to get my output. Is there any way to signal the parameter server and evaluator to exit when all the training is done?
Related
I'm running a training job on the google AI platform, just training a simple tf.Estimator. Is there a way to prevent the whole job from completing if there's still an evaluation task running?
I remember someone using Kubeflow in GCP that needed to use the '--stream-logs' flag when submitting a AI Platform training job using the gcloud command (1). Otherwise, the job would get stopped before completion.
According to the documentation,
'with the --stream-logs flag, the job will continue to
run after this command exits and must be cancelled with gcloud
ai-platform jobs cancel JOB_ID)'
It is worth giving it a try and check if, in your case, this flag can also keep the job running instead of terminating it prematurely.
In the case that the issue kept happening when activating the flag, you might want to inspect the logs of the job to better understand the root cause of this behaviour.
The task is to run defined number of transformations (.ktr) in parallel.
Each transformation opens it's own database connection to read data.
But we have a limitation on given user, who has only 5 allowed parallel connection to DB and let's consider that this could not be changed.
So when I start job depicted below, only 5 transformations finish their work successfully, and other 5 fails with db connection error.
I know that there is an option to redraw job scheme to have only 5 parallel sequences, but I don't like this approach, as it requires reimplementation when count of threads changes.
Is it possible to configure some kind of pool of executors, so Pentaho job will understand that even if there were 10 transformations provided, only random 5 could be processed in parallel?
I am assuming that you know the number of parallel database connections available. If you know this, use switch/case component and then number of transformations in parallel. Second option is to use job-executor.In Job Executor, if you can set variable which in turn call the job accordingly. For example, you are calling a job using job-executor with value
c:/data-integrator/parallel_job_${5}.kjb where 5 is number of connections available
or
c:/data-integrator/parallel_job_${7}.kjb where 7 is number of connections available
Is this making sense to you.
The concept is the following:
Catch database connection error during transformation run attempt
Wait a couple of seconds
Retry run of a transformation
Look at attached transformation picture. It works for me.
Disadvantages:
A lot of connection errors in the logs, which could confuse.
Given solution could turn in infinite loop (but could be amended to avoid it)
We were running two TrainingJob instances of type (1) ml.p3.8xlarge and (2) ml.p3.2xlarge .
Each training job is running a custom algorithm with Tensorflow plus a Keras backend.
The instance (1) is running ok, while the instance (2) after a reported time of training of 1 hour, with any logging in CloudWatch (any text tow log), exits with this error:
Failure reason
CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.
I'm not sure what this message mean.
This message mean SageMaker tried to launch the instance but EC2 was not having enough capacity of this instance hence after waiting for some time(in this case 1 hour) SageMaker gave up and failed the training job.
For more information about capacity issue from ec2, please visit:
troubleshooting-launch-capacity
To solve this, you can either try running jobs with different instance type as suggested in failure reason or wait a few minutes and then submit your request again as suggested by EC2.
I am running hyper parameter tuning using Google Cloud ML. I am wondering if it is possible to benefit from (possibly partial) previous runs.
One application would be :
I launch an hyperparameter tuning job
I stop it because I want to change the type of cluster I am using
I want to restart my hypertune job on a new cluster, but I want to benefit from previous runs I already paid for.
or another application :
I launch an hypertune campain
I want to extend the number of trials afterwards, without starting from scratch
and then for instance, I want remove one degree of liberty (e.g. training_rate), focusing on other parameters
Basically, what I need is "how can I have a checkpoint for hypertune ?"
Thx !
Yes, this is an interesting workflow -- Its not exactly possible with the current set of APIs, so its something we'll need to consider in future planning.
However, I wonder if there are some workarounds that can pan out to approximate your intended workflow, right now.
Start with higher number of trials - given you can cancel a job, but not extend one.
Finish a training job early based on some external input - eg. once you've arrived at a fixed training_rate, you could record that in a file in GCS, and mark subsequent trials with different training rate as infeasible, so those trials end fast.
To go further, eg. launch another job (to add runs, or change scale tier), you could potentially try using the same output directory, and this time lookup previous results for a given set of hyperparameters with an objective metric (you'll need to record them somewhere where you can look them up -- eg. create gcs files to track the trial runs), so the particular trial completes early, and training moves on to the next trial. Essentially rolling your own "checkpoint for hypertune".
As I mentioned, all of these are workarounds, and exploratory thoughts on what might be possible from your end with current capabilities.
Currently I can run through my tensorflow graph correctly, but the running time is longer than my expectation, so I'd like to know how to profile execution time for each node in the graph.
You could probably use fields recorded in step_stats. TimelineTest shows an example of how to get stats about execution.