I have a bunch of summary nodes (scalars, histograms, etc) that are constantly writing to the log. Checkpointing is not as frequent, and so I often have situations in which I'm recovering from a checkpoint that is earlier than the events that have been written to the log. When I resume from the checkpoint and start writing to the log again, what exactly happens? Do the old events get overwritten? The documentation is not very clear on this. Looking in TensorBoard, it appears as if the "future" events are still there. Ideally I'd like to flush everything ahead of the current global_step and just start over.
TensorBoard does have logic to handle this case - it looks for restart events, and tries to purge everything with a global_step greater than the restart step. See this code. If you are still seeing the orphaned events, that means something isn't working - maybe the SessionLog.START event isn't being written when your job restarts from checkpoint?
Can you create a simple repro of this and file an issue on GitHub?
Related
I have the following case:
I'm creating transports documents in a LOOP (using BAPI_CREATE). After this loop, if everything is fine, I call BAPI_TRANSACTION_COMMIT (and wait = 'X').
After that, I do another loop for the created transports to change them. But not everytime I can change the first transport (the LAST created). Could it be because that commit work has not been performed properly at all? I used WAIT UP TO 3 SECONDS before the second loop, and it worked; but I would find out the real problem and how to solve it.
Thanks.
These are different processes, even with commit and wait.
try to experiment with
SET UPDATE TASK LOCAL.
I have read the standard (and the javadoc) but still have some questions.
My use case is simple:
A batchlet fetches data from an external source and acknowledges the data (meaning that the data is deleted from the external source after acknowledgement).
Before acknowledging the data the batchlet produces relevant output (in-menory-object) that is to be passed to the next chunk oriented step.
Questions:
1) What is the best practice for passing data between a batchlet and a chunk step?
It seems that I can do that by calling jobContext#setTransientUserData
in the batchlet and then in my chunk step I can access that data by calling
jobContext#getTransientUserData.
I understand that both jobContext and stepContext are implemented in threadlocal-manner.
What worries me here is the "Transient"-part.
What will happen if the batchlet succeeds but my chunk-step fails?
Will the "TransientUserData"-data still be available or will it be gone if the job/step is restarted?
For my use case it is important that the batchlet is run just once.
So even if the job or the chunk step is restarted it is important that the output data from the successfully-run-batchlet is preserved - otherwise the batchlet have to be once more. (I have already acknowledged the data and it is gone - so running the batchlet once more would not help me.)
2)Follow up question
In stepContext there is a couple of methods: getPersistentUserData and setPersistentUserData.
What is these method's intended usage?
What does the "Persistent"-part refer to?
Are these methods relevant only for partitioning?
Thank you!
/ Daniel
Transient user data is just transient, and will not be available during job restart. A job restart can happen in a different process or machine, so users cannot count on job transient from previous run being available at restart.
Step persistent user data are those application data that the batch job developers deem necessary to save/persist for purpose of restarting, monitoring or auditing. They will be available at restart, but they are typically scoped to the current step (not across steps).
From reading your brief descriptioin, I got the feeling that your 2 steps are too tightly coupled and you can almost consider them one single unit of work. You want them either both succeed or both fail in order to maintain your application state integrity. I think that could be the root of the problem.
I am running hyper parameter tuning using Google Cloud ML. I am wondering if it is possible to benefit from (possibly partial) previous runs.
One application would be :
I launch an hyperparameter tuning job
I stop it because I want to change the type of cluster I am using
I want to restart my hypertune job on a new cluster, but I want to benefit from previous runs I already paid for.
or another application :
I launch an hypertune campain
I want to extend the number of trials afterwards, without starting from scratch
and then for instance, I want remove one degree of liberty (e.g. training_rate), focusing on other parameters
Basically, what I need is "how can I have a checkpoint for hypertune ?"
Thx !
Yes, this is an interesting workflow -- Its not exactly possible with the current set of APIs, so its something we'll need to consider in future planning.
However, I wonder if there are some workarounds that can pan out to approximate your intended workflow, right now.
Start with higher number of trials - given you can cancel a job, but not extend one.
Finish a training job early based on some external input - eg. once you've arrived at a fixed training_rate, you could record that in a file in GCS, and mark subsequent trials with different training rate as infeasible, so those trials end fast.
To go further, eg. launch another job (to add runs, or change scale tier), you could potentially try using the same output directory, and this time lookup previous results for a given set of hyperparameters with an objective metric (you'll need to record them somewhere where you can look them up -- eg. create gcs files to track the trial runs), so the particular trial completes early, and training moves on to the next trial. Essentially rolling your own "checkpoint for hypertune".
As I mentioned, all of these are workarounds, and exploratory thoughts on what might be possible from your end with current capabilities.
I have a program where the user inputs data, but every now and then the program crashes for unforeseen reasons. I'm re-coding to fix the errors as they surface, but I would like to prevent the user from losing any unsaved work when the crash occurs.
I've designed a solution that autosaves to a file every hour, but...
Is there a way to save app data to a file before it force closes after the Error Window?
What is the standard method of handling these situations?
Thank you for reading my Question =]
Depending on the type of crash, you may not have any control of it, so your program may not be given a chance to gracefully fail. In this case you need to be logging all "important" user activity, as they work through.
If you notice all users suddenly fail on or near Action A, then Action A is most likely the cause of the error. Or it could be specific time of day, in which backups run, for example.
By analyzing average duration of operations, you may find that before your app fails, duration is increased. This could lead you to memory leaks or CPU bottlenecks.
Based on what you find, you may add more detailed logging for Action A, or try something else. It's hard to say without knowing the nature of your application.
I've got one specific job that seems to hang my celery workers every so often. I'm using rabbitmq as a broker. I've tried a couple things to fix this, to no avail:
Autoscaling the workers to allow the hung ones plenty of time to finish execution
Setting a global timeout
So I've come up a little short on what's causing this problem, and how I can fix it. Can anyone give me any pointers? The task in question is simply inserting a record into the database (MongoDB in this case.)
Update: I've added CELERYD_FORCE_EXECV. We'll see if that fixes it.
Update 2: nope!
A specific job making the child processes hang is often a symptom of IO that never completes, e.g. a web request or socket read without a timeout.
Most libraries supports setting a timeout, but if not you can always use socket.setdefaulttimeout:
import socket
#task
def http_get(url, timeout=1.0, retry_after=3.0, max_retries=None):
prev_timeout = socket.getdefaulttimeout()
socket.setdefaulttimeout(timeout)
try:
return requests.get(url)
except socket.timeout:
raise http_get.retry(exc=exc, countdown=retry_after, max_retries=max_retries)
finally:
socket.setdefaulttimeout(prev_timeout)
You are most likely hitting a infinite loop bug in Celery / Kombu (see https://github.com/celery/celery/issues/3712) that only got fixed very recently. It has not gotten into a release yet. See commit https://github.com/celery/kombu/pull/760 for details. If you cannot use a repo build for your installation a work around is to either switch to Redis or set CELERY_WORKER_PREFETCH_MULTIPLIER=0 and -P solo for now.