TF record corrupted after several successful training epochs

TF record corrupted after several successful training epochs - tensorflow

I was training a neural network and had run over all the training data for several epochs successfully.
However, the tfrecord corrputed error suddenly came out as follows:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 77, in tf_record_iterator
reader.GetNext(status)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 106241330
I checked the data file again and it was indeed corrupted at that line. But the data was intact before I ran the training code and I simply just read the data by following code:
batch_data = []
record_iterator = tf.python_io.tf_record_iterator(path=file, options=options)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
data = generate_data_from_record(example) # record parsing code
batch_data.append(data)
if len(batch_data) == batch_size:
yield batch_data
batch_data = []
I am wondering why the data file was corrupted and how can I remain the integrity of the data file.

You should make a clean copy of your tfrecord files. Whenever your working copy get corrupted, replace from the clean copy. The dataLoss error seems to be as a result of several reading of the same record, and its also dependent on the disk.

If someone is facing this problem, the above answer by #nwoye-cid worked for me plus the link below to install everything properly.
Also, restart your kernel from scratch if nothing works then only go for other solutions.
Link!

Related

OSError: Unable to open file (can't retrieve stat info for file)

I am training a neural network in Keras on a dataset of MRI files in hdf5 format. After running few steps of an epoch successfully, suddenly the above error comes up.
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (can't retrieve stat info for file)
I am running my code on Google Colab, and one observation is that it does not come while using a CPU and the execution speed is slow, it occurs when GPU is being used and execution is faster.
Could anyone please suggest how to debug it?
Following is the last statement of the error -
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

SentencePiece in Google Colab

I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece binaries in Google Colab. Do I need to build with cmake?
When I try and install using pip install sentencepiece and try to include sentencepiece in my "transforms" in my script, I get this following error
After running this script (matched from the OpenNMT translation tutorial)
!onmt_build_vocab -config en-sp.yaml -n_sample -1
I get:
Traceback (most recent call last):
File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 63, in main
build_vocab_main(opts)
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 32, in build_vocab_main
transforms = make_transforms(opts, transforms_cls, fields)
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/transform.py", line 176, in make_transforms
transform_obj.warm_up(vocabs)
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/tokenize.py", line 110, in warm_up
load_src_model.Load(self.src_subword_model)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
Below is how my script is written. I'm not sure what the not a string is coming from.
## Where the samples will be written
save_data: en-sp/run/example
## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt
## Where the model will be saved
save_model: drive/MyDrive/Europarl/model/model
# Prevent overwriting existing files in the folder
overwrite: False
# Corpus opts:
data:
europarl:
path_src: train_europarl-v7.es-en.es
path_tgt: train_europarl-v7.es-en.en
transforms: [sentencepiece, filtertoolong]
weight: 1
valid:
path_src: dev_europarl-v7.es-en.es
path_tgt: dev_europarl-v7.es-en.en
transforms: [sentencepiece]
skip_empty_level: silent
world_size: 1
gpu_ranks: [0]
...
EDIT: So I went ahead and Googled the issue more and found a google colab project that built sentencepiece using cmake here https://colab.research.google.com/github/mymusise/gpt2-quickly/blob/main/examples/gpt2_quickly.ipynb#scrollTo=dDAup5dxDXZW. However, even after building using cmake, I'm still getting this issue.

To fix this issue, I had to filter and tokenize my dataset and then train with sentencepiece. I used the scripts from this helpful source: https://github.com/ymoslem/MT-Preparation to do everything and now my model is training!

Converting google-cloud-ml github Reddit example from regression to classification and adding keys?

I've been trying to adapt the reddit_tft example from the cloud-ml github samples repo to my needs.
I've been able to get it running as per the tutorial readme.
However what i want to use it for is a binary classification problem and also output keys in batch prediction.
So i have made copy of the tutorial code here and have changed it in a few places to be able to have a model type of deep_classifier that would use a DNNClasifier instead of a DNNRegressor.
I've changed the score variable to be
if(score>0,1,0) as score
It's training fine, deploys to cloud ml but i'm not sure how to now get keys back from my predictions. `
I've updated the sql pulling from BigQuery to include id as example_id here
It seems the code from the tutorial had some sort of placeholder for example_id so i'm trying to leverage that.
It all seems to work but when i get batch predictions all i get is json like this:
{"classes": ["0", "1"], "scores": [0.20427155494689941, 0.7957285046577454]}
{"classes": ["0", "1"], "scores": [0.14911963045597076, 0.8508803248405457]}
...
So example_id does not seem to be making it into the serving functions like i need.
I've tried to follow the approach here which is based on adapting the census example for keys.
I just cant figure out how to finish adapting this reddit example to also output keys in the predictions as they look a bit different to me in terms of design and functions being used.
Update 1
My latest attempt is here Trying to use the approach outlined here.
However this is giving errors:
NotFoundError (see above for traceback): /tmp/tmp2jllvb/model.ckpt-1_temp_9530d2c5823d4462be53fa5415e429fd; No such file or directory
[[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:ps/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, dnn/hiddenlayer_0/kernel/part_2/read, dnn/dnn/hiddenlayer_0/kernel/part_2/Adagrad/read, dnn/hiddenlayer_1/kernel/part_2/read, dnn/dnn/hiddenlayer_1/kernel/part_2/Adagrad/read, dnn/input_from_feature_columns/input_layer/subreddit_id_embedding/weights/part_0/read, dnn/dnn/input_from_feature_columns/input_layer/subreddit_id_embedding/weights/part_0/Adagrad/read, dnn/logits/bias/part_0/read, dnn/dnn/logits/bias/part_0/Adagrad/read, global_step)]]
Update 2
My latest attempt and details are here.
I'm now getting a error from tensorflow-fransform (run_preprocess.sh works fine in tft 0.1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 282, in __setstate__
self._dtype = tf.as_dtype(state['dtype'])
TypeError: string indices must be integers, not str
Update 3
I have changed things to just use beam + csv and avoid tft. Also i'm now using the approach as outlined here for extending the canned estimator to get the key back with the predictions.
However when following this post to try get the comments in as features i'm now running into a new error.
The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/estimator/python/estimator/extenders.py", line 87, in new_model_fn spec = estimator.model_fn(features, labels, mode, config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 203, in public_model_fn return self._call_model_fn(features, labels, mode, config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 694, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/dnn_linear_combined.py", line 520, in _model_fn config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/dnn_linear_combined.py", line 158, in _dnn_linear_combined_model_fn dnn_logits = dnn_logit_fn(features=features, mode=mode) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/dnn.py", line 89, in dnn_logit_fn features=features, feature_columns=feature_columns) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/feature_column/feature_column.py", line 226, in input_layer with variable_scope.variable_scope(None, default_name=column.name): File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1826, in __enter__ current_name_scope_name = self._current_name_scope.__enter__() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 4932, in __enter__ return self._name_scope.__enter__() File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3514, in name_scope raise ValueError("'%s' is not a valid scope name" % name) ValueError: 'Tensor("Slice:0", shape=(?, 20), dtype=int64)_embedding' is not a valid scope name
My repo for this attempt/approach is here. This all runs fine if i just use subreddit as a feature, it's adding in the comment feature that seems to be causing the problems. Lines 103 to 111 is where i have followed this approach.
Not sure what's triggering the error in my code from reading the trace. Anyone any ideas?
Or can anyone point me towards another approach to go from text to bow to embedding feature in TF?

See:
https://medium.com/#lakshmanok/how-to-extend-a-canned-tensorflow-estimator-to-add-more-evaluation-metrics-and-to-pass-through-ddf66cd3047d
Here's what the code looks like to pass through keys:
def forward_key_to_export(estimator):
estimator = tf.contrib.estimator.forward_features(estimator, KEY_COLUMN)
## This shouldn't be necessary (I've filed CL/187793590 to update extenders.py with this code)
config = estimator.config
def model_fn2(features, labels, mode):
estimatorSpec = estimator._call_model_fn(features, labels, mode, config=config)
if estimatorSpec.export_outputs:
for ekey in ['predict', 'serving_default']:
estimatorSpec.export_outputs[ekey] = \
tf.estimator.export.PredictOutput(estimatorSpec.predictions)
return estimatorSpec
return tf.estimator.Estimator(model_fn=model_fn2, config=config)
##
# Create estimator to train and evaluate
def train_and_evaluate(output_dir):
estimator = tf.estimator.DNNLinearCombinedRegressor(...)
estimator = forward_key_to_export(estimator)
...
tf.estimator.train_and_evaluate(estimator, ...)

We have plans, but haven't moved the changes into Census yet for the output keys. In the mean time can you please see if this gist helps https://gist.github.com/andrewm4894/ebd3ac3c87e2ab4af8a10740e85073bb#file-with_keys_model-py
Please feel free to send a PR if you get to it sooner and we will merge your contribution.

Tensorflow : Android demo accuracy

What related GitHub issues or Stack Overflow threads have you found by searching the web for your problem?
I searched #1269 #504
Environment info
Mac OS for build and Android version 5 to run .apk demo.
If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)
I followed the steps mentioned in #1269 and could able to run the example successfully, but the accuracy of the result is very low and often wrong. I have trained my systems on 25 different daily used products like soap, soup, noodles, etc.
Where as when i run the same example using following script it give me very high accuracy (approx. 90-95%)
import sys
import tensorflow as tf
// change this as you see fit
image_path = sys.argv[1]
// Read in the image_data
image_data = tf.gfile.FastGFile(image_path, 'rb').read()
// Loads label file, strips off carriage return
label_lines = [line.rstrip() for line
in tf.gfile.GFile("/tf_files/retrained_labels.txt")]
// Unpersists graph from file
with tf.gfile.FastGFile("/tf_files/retrained_graph.pb", 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
_ = tf.import_graph_def(graph_def, name='')
with tf.Session() as sess:
// Feed the image_data as input to the graph and get first prediction
softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
predictions = sess.run(softmax_tensor, \
{'DecodeJpeg/contents:0': image_data})
// Sort to show labels of first prediction in order of confidence
top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
for node_id in top_k:
human_string = label_lines[node_id]
score = predictions[0][node_id]
print('%s (score = %.5f)' % (human_string, score))
The only difference I see here is that the model file used in the Android demo is stripped because it does not support DecodeJpeg, whereas in the above code its the actually generated unstripped model. Is there any specific reason or somewhere I am wrong here?
I also tried using optimize_for_inference
but unfortunately, it fails with following error:
[milinddeore#P028: ~/tf/tensorflow ] bazel-bin/tensorflow/python/tools/optimize_for_inference --input=/Users/milinddeore/tf_files_nm/retrained_graph.pb --output=/Users/milinddeore/tf/tensorflow/tensorflow/examples/android/assets/tf_ul_stripped_graph.pb --input_names=DecodeJpeg/content —-output_names=final_result
Traceback (most recent call last):
File "/Users/milinddeore/tf/tensorflow/bazel-bin/tensorflow/python/tools/optimize_for_inference.runfiles/org_tensorflow/tensorflow/python/tools/optimize_for_inference.py", line 141, in <module>
app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/Users/milinddeore/tf/tensorflow/bazel-bin/tensorflow/python/tools/optimize_for_inference.runfiles/org_tensorflow/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/Users/milinddeore/tf/tensorflow/bazel-bin/tensorflow/python/tools/optimize_for_inference.runfiles/org_tensorflow/tensorflow/python/tools/optimize_for_inference.py", line 90, in main
FLAGS.output_names.split(","), FLAGS.placeholder_type_enum)
File "/Users/milinddeore/tf/tensorflow/bazel-bin/tensorflow/python/tools/optimize_for_inference.runfiles/org_tensorflow/tensorflow/python/tools/optimize_for_inference_lib.py", line 91, in optimize_for_inference
placeholder_type_enum)
File "/Users/milinddeore/tf/tensorflow/bazel-bin/tensorflow/python/tools/optimize_for_inference.runfiles/org_tensorflow/tensorflow/python/tools/strip_unused_lib.py", line 71, in strip_unused
output_node_names)
File "/Users/milinddeore/tf/tensorflow/bazel-bin/tensorflow/python/tools/optimize_for_inference.runfiles/org_tensorflow/tensorflow/python/framework/graph_util_impl.py", line 141, in extract_sub_graph
assert d in name_to_node_map, "%s is not in graph" % d
AssertionError: is not in graph
I suspect that this problem is due to the android not being parse DecodeJpeg, but please correct me if i am wrong.
What other attempted solutions have you tried?
Yes, I the above script and it gives me quite high accuracy result.

Well, the reason for bad accuracy is following:
I ran this example code on Lenovo Vibe K5 mobile (This has SanpDragon 415), this wasn't compiled for hexagon DSP, even through the DSP on 415 is very old as compared to 835 (Hexagon DSP 682), in fact i am not quite sure if the Hexagon SDK will work with 415 or not, i haven't tried this though. This means the example was running on CPU to first detect motion and later classify them and hence the poor performance.
Slow FPS, will capture the images very slowly and hence moving objects will be really difficult.
So if you have bad image, there are very strong chances that the prediction will also be bad.
Camera capture and classification is taking long time, due to latency its not quite real-time.

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?
=============
the related code is below:
def def_ps(self):
self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)
def save(self,idx):
ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
if not os.path.exists(ret):
msg = "save model for %u path %s not exists."%(idx,ret)
lg.error(msg)
raise Exception(msg);
=============
the log is below:
2016-06-02 21:33:52,323 root ERROR save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root ERROR has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
File "d_rl_main_model_dist_0.py", line 755, in run_worker
model_a.save(next_model_idx)
File "d_rl_main_model_dist_0.py", line 360, in save
Trainer.save(self,save_idx)
File "d_rl_main_model_dist_0.py", line 289, in save
raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.
===========
not meets the tensorflow api which define Saver.save as below:
https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Saver
tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)
Returns:
A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.
Raises:
TypeError: If sess is not a Session.
ValueError: If latest_filename contains path components.

The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.
Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

TF record corrupted after several successful training epochs - tensorflow

You should make a clean copy of your tfrecord files. Whenever your working copy get corrupted, replace from the clean copy. The dataLoss error seems to be as a result of several reading of the same record, and its also dependent on the disk.

If someone is facing this problem, the above answer by #nwoye-cid worked for me plus the link below to install everything properly. Also, restart your kernel from scratch if nothing works then only go for other solutions. Link!

Related

OSError: Unable to open file (can't retrieve stat info for file)

SentencePiece in Google Colab

Converting google-cloud-ml github Reddit example from regression to classification and adding keys?

Tensorflow : Android demo accuracy

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

Categories

Resources