How to take a checkpoint at a given tick and then restore using the gem5 Python API? - gem5

I had always done with an m5 checkpoint m5op + fs.py -r. I then also learned that fs.py has --take-checkpoints which can select the tick.
But today I needed to do it for an integration Linux boot test (tests/gem5/fs/linux/arm/run.py) to start running closer to the point of interest, and I don't want to modify the kernel to add the m5op + the runner script does not have -r/--take-checkpoint options. I wish this stuff were gem5.opt options available to all runs rather Python script options, but they're not.

On gem5 71b450fc46ca5888971acf3160b813bf24784604 the script original script does:
m5.instantiate()
exit_event = m5.simulate()
so to take the checkpoint I can hack it to:
m5.instantiate()
# Run up to desired tick.
exit_event = m5.simulate(100000)
m5.checkpoint('m5out/mycpt')
and to restore hack it to:
m5.instantiate('m5out/mycpt')
exit_event = m5.simulate()
m5.checkpoint()

Related

Ray Tune Stuck with multiple runs

Hi I am trying hyper parameter optimization with ray tune.
Below is my code implementation.
However I get stuck and can't get the result back even though there aren't any error messages.
#ray.remote
def main:
do_somthing
return loss
def ray_pick_best_hypter(config):
runs = 10
loss_avg = np.mean(ray.get([main.remote(config,run=x) for x in range(runs)]))
tune.report(loss_avg=loss_avg)
config = load_config()
analysis = ray.tune.run(ray_pick_best_hypter, config=config,progress_reporter=reporter)
The below code works fine, but I want to run multiple experiments and get the mean value.
def ray_pick_best_hypter(config):
loss_avg = ray.get([main.remote(config,run=x))
tune.report(loss_avg=loss_avg)
What is the problem in the code?
It seems you are starting multiple distributed training processes from within your trainable. Each call to main.remote() will start a new distributed task. Since you're starting 10 of them at the same time, they will try to run in parallel.
However, the default resource allocation for each trial is usually just 1 CPU - so the remote tasks cannot be scheduled.
What you can do to resolve this is to pass resources_per_trial={"cpu": 11} - that way each of your remote tasks will have their own CPU to run on.

How do you save progress during a Keras Tuner run?

I am currently shifting through a larger search space with Keras Tuner on a free Google Colab instance. Because of usage limits, my search run will be interrupted before completion. I would like to save the progress of my search periodically in anticipation of those interruptions, and simply resume from the last checkpoint when the Colab resources become available to me again.
I found documentation on how to save specific models from a run, but I want to save the entire state of the search, including what was already tried and the results of those experiments.
Can I just call Tuner.get_state(), save the result, and then resume from where I left off with Tuner.set_state()? Or is there another way?
You don't need to call tuner.get_state() and tuner.set_state(). While instantiating a Tuner, say a RandomSearch, as mentioned in the example,
# While creating the tuner for the first time
tuner = RandomSearch(
build_model,
objective="val_accuracy",
max_trials=3,
executions_per_trial=2,
directory="my_dir",
project_name="helloworld",
)
You need to set the arguments directory and project_name. The checkpoints are saved in this directory. You may save this directory as a ZIP file and download using files.download().
When you get a new Colab instance, unzip that archive and restore the my_dir directory. Instantiate a Tuner again using,
# While loading the Tuner
tuner = RandomSearch(
build_model,
objective="val_accuracy",
max_trials=3,
executions_per_trial=2,
directory="my_dir", # <----- Use the directory as you did earlier
overwrite=False, # <-------
project_name="helloworld",
)
Now start the search and you'll notice that the best params so far hasn't changed. Also, tuner.results_summary() returns all search results.
See the documentation for the Tuner class here.

Incorrect Broadcast input array shape error when trying to use Pretraining

I am trying to use spacy's 'pre-train' feature for a NER task, so here is what I tried doing(I am still trying to use it),
Step 1: I started by initializing the model with 'en_core_web_lg' next I saved this model to disk and tested its NER capability on few lines to see if it recognizes the tags in those test lines. (Made a note of ignored tags)
Step 2: Next I created a .jsonl file with new data to train on (about 20 new lines, I wanted to see the model's capability given new data around an entity(ignored tags found earlier) will it be able to correctly identify tags after doing transfer learning). So using this .jsonl and the model I saved earlier file I used 'spacy pre-train' command to train, this created a token2vec .bin file for me (model999.bin).
Step 3: Next I created a function that takes the location of an earlier saved model(model saved in step 1) and location of token2vec (model999.bin file obtained in step 2). Inside the function it loads the model>creates/gets pipe>disables rest of the files>uses (pipe_name).model.tok2vec.from_bytes(file_.read()) to read from model999.bin and broadcast the learned vectors to base model.
But when I run this function, I get this error:
ValueError: could not broadcast input array from shape (96,3,384) into shape (96,3,480)
(I have uploaded the entire notebook here: [https://github.com/pratikdk/ner_test/blob/master/base_model_contextual_TF.ipynb ]).
In order to pre-train I used this function
python -m spacy pre-train ub.jsonl model_saves w2s
Here are the 20 lines I tried training on top of the base model
[ https://github.com/pratikdk/ner_test/blob/master/ub.jsonl ]
What am I doing wrong here exactly? Please can you also point the fix, I am sure many would need insight on this.
Environment
Operating System: CentOS
Python Version Used: 3.7.3
spaCy Version Used: 2.1.3
Environment Information: Anaconda Jupyter Lab
So I was able to fix this, the developer(on github) answered my question.
Here is the answer:
https://github.com/explosion/spaCy/issues/3616

How to report algorithm running time?

I am running a variational auto-encoder in TensorFlow, which could take a long time. Thus I want to report the time the algorithm has been running for as a scalar on TensorBoard.
One dirty way is to hard-code the start time of the compilation into a global variable, or pass it as an argument to the model function and compute the difference with current time.
Does Tensorflow have a native way to do it?
There is the tf.train.ProfilerHook. Comes with release 1.14.
Example usage:
estimator = tf.estimator.LinearClassifier(...)
hooks = [tf.train.ProfilerHook(output_dir=model_dir, save_secs=600, show_memory=False)]
estimator.train(input_fn=train_input_fn, hooks=hooks)
Executing the hook will generate files timeline-xx.json in output_dir.
Then open chrome://tracing/ in chrome browser and load the file. You will get a time usage timeline like below.

TensorBoard doesn't show all data points

I was running a very long training (reinforcement learning with 20M steps) and writing summary every 10k steps. In between step 4M and 6M, I saw 2 peaks in my TensorBoard scalar chart for game score, then I let it run and went to sleep. In the morning, it was running at about step 12M, but the peaks between step 4M and 6M that I saw earlier disappeared from the chart. I tried to zoom in and found out that TensorBoard (randomly?) skipped some of the data points. I also tried to export the data but some data point including the peaks are also missing in the exported .csv.
I looked for answers and found this in TensorFlow github page:
TensorBoard uses reservoir sampling to downsample your data so that it can be loaded into RAM. You can modify the number of elements it will keep per tag in tensorboard/backend/server.py.
Has anyone ever modified this server.py file? Where can I find the file and if I installed TensorFlow from source, do I have to recompile it after I modified the file?
You don't have to change the source code for this, there is a flag called --samples_per_plugin.
Quoting from the help command
--samples_per_plugin: An optional comma separated list of plugin_name=num_samples pairs to explicitly
specify how many samples to keep per tag for that plugin. For unspecified plugins, TensorBoard
randomly downsamples logged summaries to reasonable values to prevent out-of-memory errors for long
running jobs. This flag allows fine control over that downsampling. Note that 0 means keep all
samples of that type. For instance, "scalars=500,images=0" keeps 500 scalars and all images. Most
users should not need to set this flag.
(default: '')
So if you want to have a slider of 100 images, use:
tensorboard --samples_per_plugin images=100
The comment is out of date - it can actually be modified in tensorboard/backend/application.py, in the "Default Size Guidance". By default, it stores 1000 scalars. You can increase that limit arbitrarily, or set it to 0 to store every scalar.
You don't need to recompile TensorBoard, or even download it from source. You could just modify this file in your TensorBoard yourself.
If you install TensorFlow using pip in virtualenv (ubuntu, mac), then within your virtualenv directory the path to application.py should be something like lib/python2.7/site-packages/tensorflow/tensorboard/backend. If you modify that file, you should get the new setting in your tensorboard (when you run tensorboard in that virtualenv). If you're like me, you'll put a print statement too so you can be sure that you're running modified code :)