How to take Statistics of Different Checkpoints at different Segment in gem5 checkpoints? - gem5

I have created some (say 10) checkpoints with a fixed interval in the ROI of the gem5 simulation for Parsec benchmark.
Then I tried restoring the checkpoints with the following command
./build/ALPHA/gem5.opt configs/example/fs.py -r 1
but I got the stats.txt file in m5out directory. And in that stats file the result or statistics coming in combined form for all the checkpoints.
Q How can I take the output for different checkpoints in a different section so that the stats for each checkpoint is visible differently?

Related

I trained a neural network but I cannot find where it got saved, I cannot find any .meta , .index, .data files

I was trying to follow this page https://www.tensorflow.org/tutorials/sequences/audio_recognition
I successfully executed the following command:
python tensorflow/examples/speech_commands/train.py
I used a virtual environment in Anaconda. Used Tensorflow 14 and Python 3.6
It took about about 22 hours to train it. it said "/tmp/speech_commands_train/conv.ckpt-100" after every 100 iterations
(there were 18000 in total)
but now when I try to find conv.ckpt-18000.meta or just speech_commands_train I cannot find it.
I am very new to this. This is my first effort in deep learning.
how the terminal looked when training ended
Firstly, what you mean by " Where It saved", by it you mean logs, the trained model or weights.
In your case, you are just storing the weights at given checkpoints hence you can acess them at given paths said in the tutorial
I0730 16:54:41.813438 55030 train.py:252] Saving to "/tmp/speech_commands_train/conv.ckpt-100"
*This is saving out the current trained weights to a checkpoint file. If your training script gets interrupted, you can look for the last saved checkpoint and then restart the script with -*
Also you can store logs using file writer and model using save_model or tensorboard callback with logdir.
Don't forget to upvote if found it useful

Train Tensorflow with my own images successfully, but still have problems

I am using ubuntu 16.04, with GPU Geforce 1080, 8 GB GPU memory.
I have properly created TF-record files, and I trained the model successfully. However I still have two problems.
I did the following steps and I still have two problems, just tell me please what I am missing:-
I used VOCdevkit and I properly created two files which are:- pascal_train.record and pascal_val.record
Then,
1- From this link, I used the raccoon images, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/JPEGImages (after I deleted the previous images).
Then, I used the raccoon annotation, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/Annotation (after I deleted the previous ones).
2- I modified the models/object_detection/data/pascal_label_map.pbxt and I wrote one class name which is 'raccoon'
3- I used ssd_mobilenet_v1_pets.config. I modified it, the number of class is only one and I did not train from scratch, I used ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
fine_tune_checkpoint: "/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_coco_11_06_2017/model.ckpt"
from_detection_checkpoint: true
4- From this link I arrange my data structure which is like that:-
models
1.1 model
1.1.1 ssd_mobilenet_v1_pets.config
1.1.2 train
1.1.3 evaluation
1.1.4 ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
1.2 object_detection
1.2.1 data that contains (pascal_train.record, pascal_val.record, and pascal_label_map.pbtxt)
1.2.2 VOCdevkit
1.2.2.1 VOC2012
1.2.2.1.1 JPEGImages (my own images)
1.2.2.1.2 Annotations (raccoon annotation)
1.2.2.1.3 ImageSets
1.2.2.1.3.1 Main (raccoon_train.txt,raccoon_val.txt,raccoon_train_val.txt)
5- Now, I will train my model
(abdu-py2) jesse#jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/train.py --logtostderr --pipeline_config_path=/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config --train_dir=/home/jesse/abdu-py2/models/model/train
Every thing looks fine, I created it many files like checkpoint and events.out.tfevents.1503337171 file (and others) after many thousands of training steps.
However, my two problems are:-
1- Based on this link, I can not run evaluation eval.py (for memory reason) at the same time with train.py.
2- I tried to use events.out.tfevents.1503337171 file that I created from training steps, but it seems it has not been created correctly.
So, I don't know where I am mistaken, I think my data structure is not correct, I tried to arrange it based on my understanding.
Thanks in advance
Edit:-
Regarding Q2/
I figured it out how to convert the events files and model.ckpt files (that I created them from training process) to inference_graph_.pb . The inference_graph_.pb could be tested later with object_detection_tutorial.ipynb. For my case I tried it, but I could not detect anything since I am mistaken somewhere during train.py process.
The following steps convert the trained files to .pb files
(abdu-py2) jesse#jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path /home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix /home/jesse/abdu-py2/models/model/train/model.ckpt-27688 \
--output_directory /home/jesse/abdu-py2/models/model
Question 1 - this is just a problem that you'll encounter because of your hardware. Once you get to a point where you'd like to a evaluate the model, just stop your training and run your eval command (it seems as though you've successfully evaluated your model, so you know the command). It will provide you a some metrics for the most recent model checkpoint. You can iterate through this process until you're comfortable with the performance of your model.
Question 2 - These event files are used as input into Tensorboard. The events files are in binary format, thus are not human readable. Start a Tensorboard application while your model is training and/or evaluating. To do so, run something like this:
tensorboard --logdir=train:/home/grasp001/abdu-py2/models/object_detection/train1/train,eval:/home/grasp001/abdu-py2/models/object_detection/train1/eval
Once you have Tensorboard running, use your web browser to navigate to localhost:6006 to check out your metrics. You can use this during training as well to monitor loss and other metrics for each step of training.
Trainer.py line 370 after the session_config
Limit the gpu proccess power
session_config.gpu_options.per_process_gpu_memory_fraction = 0.5
and then you can run eval.py at the same time. The tensorflow use all the free memory independently if it needs it

Which file to be used for eval step in TEXTSUM?

Am working on the texsum model of tensorflow which is text summarization. I was following commands specified in readme at github/textsum. It said that file named validation, present in data folder, is to be used in eval step, but there was no validation file in data folder.
I thought to make one myself and later realized that it should be a binary file. So I needed to prepare a text file which will be converted to binary.
But that text file has to have a specific format. Will it be same as that of the file used in train step? Can i use the same file for train step and eval step?
The sequence of steps i followed are:
Step 1: Train the model by using the vocab file which was mentioned as "updated" for toy dataset
Step 2: Training continued for a while and it got "Killed" at running_avg_loss: 3.590769
Step 3: Using the same data and vocab files for eval step, as had been used for training, I ran eval. It keeps on running with running_avg_loss between 6 to 7
I am doubtful of step 3, if same files are to be used or not.
So you don't have to run eval unless you are in fact testing your model after you have trained to determine how the training does against another set of data it has never seen before. I have also been sing it to determine if I am starting to overfit the data.
So you will usually take 20-30% of your overall dataset and use it for the eval process. You then go about training against your training data. Once complete, you can just run decode right away should you desire or you can run eval against the 20% - 30% dataset you set aside form the start. Once you feel comfortable with the results you can then run your decode to get the results.
Your binary format should be the same as your training data.

How are weights saved in the CIFAR10 tutorial for tensorflow?

In the TensorFlow tutorial to train a network on CIFAR-10, where and how do they save the weights/parameters between running training and evaluation? I cannot see any files saved to my project directory.
Here are the links to the tutorial and the code:
https://www.tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/image/cifar10
It saves the logs and checkpoints to the /tmp/ folder by default.
The weights are included in the checkpoint files.
As you can see in both eval and train files, it does take a checkpoint dir as parameter.
cifar10_train.py:
tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train',
"""Directory where to write event logs """
"""and checkpoint.""")
cifar10_eval.py:
tf.app.flags.DEFINE_string('eval_dir', '/tmp/cifar10_eval',
"""Directory where to write event logs.""")
tf.app.flags.DEFINE_string('eval_data', 'test',
"""Either 'test' or 'train_eval'.""")
tf.app.flags.DEFINE_string('checkpoint_dir', '/tmp/cifar10_train',
"""Directory where to read model checkpoints.""")
You can call those scripts with custom values for those. For my project using Inception I have to change it since the main hard drive does not have enough space for the bottlenecks created by inception.
It might be a good practice to explicitly set those values since the /tmp/ folder is not persistent and thus you might lose your training data.
The following code will save the training data into a custom folder.
python cifar10_train.py --train_dir="/home/username/train_folder"
and then, to evaluate:
python cifar10_eval.py --checkpoint_dir="/home/username/train_folder"
It also applies to the other examples.
Let's assume you're running cifar10_train, saving happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L122
And the default location is defined in this line (it's "/tmp/cifar10_train"):
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L51
In cifar10_eval, restoring the weights happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_eval.py#L75

Checkpoint file not found, restoring evaluation graph

I have a model which runs in a distributed mode for 4000 steps. After every 120s the accuracies are calculated (as is done in the provided examples). However, at times the last checkpoint file is not found.
Error:
Couldn't match files for checkpoint gs://path-on-gcs/train/model.ckpt-1485
The checkpoint file is present at the location. A local run for 2000 steps runs perfectly.
last_checkpoint = tf.train.latest_checkpoint(train_dir(FLAGS.output_path))
I assume that the checkpoint is still in saving process, and the files are not actually written. Tried introducing a wait before the accuracies are calculated as such. However, this seemed to work at first, the model still failed with a similar issue.
saver.save(session, sv.save_path, global_step)
time.sleep(2) #wait for gcs to be updated
From your comment I think I understand what is going on. I may be wrong.
The cloud_ml distributed sample
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426
uses a temporary file by default. As a consequence, it works locally on the /tmp. Once the training is complete, it copies the result on gs:// but it does not correct the checkpoint file which stills contains references to local model files on /tmp. Basically, this is a bug.
In order to avoid this, you should launch the training process with --write_to_tmp 0 or modify the task.py file directly for disabling this option. Tensorflow will then directly work on gs:// and the resulting checkpoint will therefore be consistent. At least it worked for me.
One way of checking if my assumptions are correct is to copy the resulting checkpoint file from gs:// on your local filesystem using gsutils and then output its content.