TF corrupted record while training - tensorflow

I was training a neural network overnight and it crashed. I have 2 questions:
What causes this error?
How can I prevent it from happening again?
The 2 main errors are:
ERROR:tensorflow:Exception in QueueRunner: corrupted record at 52284962154
DataLossError (see above for traceback): corrupted record at 52284962154
EDIT
The same code was used on another machine and it crashed with the same error after about 6 hours. The number 52284962154 was identical.

The problem was a write error. After converting the data to TFRecords again, the error disappeared. It can go beyond step 30747 now.

Related

Another follow-up to numpy.linalg.eig and loading

Anyway, I'm still struggling with the task of 1) Calculating the eigenvalues and eigenvectors of a matrix, 2) Saving them to a file, 3) Loading the data back. I can do steps 1 and 2; but no matter what I try step 3 always throws an error. See np.savetxt triggers ValueError. Why? and Writing and Reading Eigenvalues & Eigenvectors, follow up
This time I tried saving the eigenvalues and eigenvectors separately, so they're both arrays. Unfortunately I still get the pickle error. Even when loading just the eigenvalues.
eigs=np.linalg.eig(P#K#P)
eigvals=np.real(eigs[0])
eigvecs=np.real(eigs[1])
np.savetxt('eigvals.txt',eigvals)
np.savetxt('eigvecs.txt',eigvecs)
Sure enough, eigvals and eigvecs show up as arrays, sizes 10000 and 10000x10000 respectively, in the Variable Explorer. And when I manually open eigvals.txt, I see a long list of floats as expected. But when I then try np.load('eigvals.txt','r'), I still get the pickle error (ValueError: Cannot load file containing pickled data when allow_pickle=False). What's wrong now?
Thanks

MXNET build model error on r

When I try to use mxnet to build a feedforward model it appeared the following error:
Error in mx.io.internal.arrayiter(as.array(data), as.array(label), unif.rnds, :
basic_string::_M_replace_aux
I follow the R regression example on mxnet website but I change the data into my own data which contains 109 examples and 1876 variables. The first several steps can run without error until ran the model building step. I just can't understand the error information mean. I wonder that it is because of my dataset or the way I deal with the data.
Can you provide the code snippet you are using? That gives more details on the issue. Also, any stacktrace will be useful.
You get this error message mainly due to invalid column/row access and shape (dimension) mismatch. Can you verify if you are using correct "index" values in creating matrix. Let me know if this fixes the issue.
However, MXNet can be better at printing details about error in the stacktrace. I have created a issue to follow up on this - https://github.com/dmlc/mxnet/issues/4206

Why sometimes tensorflow runs slower and slower with the process of training?

I train a RNN network, the first epoch used 7.5 hours. But with the training process runs, tensorflow runs slower and slower, the second epoch used 55 hours. I checked the code, most APIs that become slower with time are these :
session.run([var1, var1, ...], feed_dict=feed),
tensor.eval(feed_dict=feed).
For example, one line code is session.run[var1, var2, ...], feed_dict=feed), as the program begins, It uses 0.1 seconds, but with the process runs, the time used for this line of code becomes bigger and bigger, After 10 hours, time this line spends comes to 10 seconds.
I have been befall this several times. Which triggered this? How could I do to avoid this?
If this line of code: self.shapes = [numpy.zeros(g[1].get_shape(), numy.float32) for g in self.compute_gradients] adds nodes to the graph of tensorflow? I suspect this maybe the reason. This line of code will be called many times periodically,and self is not an object of tf.train.optimizer.
Try finalizing your graph after you create it (graph.finalize()). This will prevent operations to be added to the graph. I also think self.compute_gradients is adding operations to the graph. Try defining the operation outside your loop and running it inside your loop
I had a similar issue. My solution was putting
tf.reset_default_graph()
after each epoch or sample. This resets the graph and frees up all the resources used in a way closing the session does not.

Tensorflow FIFOQueue error: FIFOQueue is closed and has insufficient elements

Now I am using tensorflow to write a program to validate models. I use the FIFOQueue to queue the input data. For example, I have 50,000 images and enqueue 100 images at a time. The program works beautifully except for the final iteration. At the final iteration, it shows the error
"E tensorflow/core/client/tensor_c_api.cc:485] FIFOQueue '_0_path_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: path_queue_Dequeue = QueueDequeue_class=["loc:#path_queue"], component_types=[DT_INT32, DT_BOOL, DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"]]"
I think that is because it tries to enqueue the 50,001~50,100 images but cannot achieve this. However, I don't need to enqueue these images and will not use them. How can I avoid this error?
Another question is that if I would like to use dequeue_many(100), however, the total number of images is not divisible by 100, say 45678. In this case, tensorflow will throw an error. How can I solve this?
Thanks.
Try dequeue_up_to instead of dequeue_many:
https://www.tensorflow.org/versions/r0.10/api_docs/python/io_ops.html
Hope that helps!
You could catch the specific error which will gracefully end training once all examples have been exhausted:
try:
while True:
# Run training Ops here...
except tf.errors.OutOfRangeError:
print('Done training -- epoch limit reached')
I have faced this issue multiple times and from my experience this is usually caused if the input files cannot be found. my input was a list of pngs from a directory and I was using this to get the input images.
input = tensorflow.train.string_input_producer(tensorflow.train.match_filenames_once("/input/*.png"))
which was somehow not getting the files correctly. Changing it to
filename_im = tensorflow.train.string_input_producer(glob.glob('/input/*.png'))
solved the issue
I believe that this is only a warning that the queue is empty, but does not cause errors. I see similar warnings but my program does not break. Does yours? See this thread.

Why do i get the message ' Line search fails in two-class probability estimates' when using libsvm for binary classification'?

I am suddenly facing the problem wherein i get the message ' Line search fails in two-class probability estimates' when using libsvm for binary classification' when classifying the test images . The training database is of size 2000 and test database is of size 2000. The feature vector size is 200. However, for another feature of size 256, the problem does not arise. Even for other large size features the problem didnot occur.Suddenly the problem has occured. I am using LIBSVM and it is binary classification. What can be the possible reason? Please help asap.Thanks in advance.
I have tried the solution suggested in earlier similar question,but of no use.