Tensorflow: The graph couldn't be sorted in topological order only when running in terminal - tensorflow

I encounter some problem while running the python script on Google cloud computing instance, with python 3.6, tensorflow 1.13.1. I see several people encounter similar problems of loops in computational graph on stackoverflow. But none of it really find the culprit for it. And I observe something interesting so maybe someone experienced can figure it out.
The error message is like this:
2019-05-28 22:28:57.747339: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2019-05-28 22:28:57.754195: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
My script for train.py will look like this:
import A,B,C
...
def main():
....
if __name__ == '__main__':
main()
So I will show my two ways to run this script:
VERSION1:
In terminal,
python3 train.py
This gives me the error like I state above. When I only use CPU, i notice it throws me something like failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected. So I add GPU to my instance but the loop in computational graph is still there.
VERSION 2(This is where weird thing happens):
I just simply copy, with nothing changed, the code in main to the jupyter notebook and run there. Then suddenly, no error occurs anymore.
I don't really know what's going on under the hood. I just notice the message at the beginning of the running is not the same between two different ways of running the code.
If you encounter the same problem, copy to jupyter notebook might help directly. I would really like share more info if someone has any ideas what might possible cause this. Thank you!

Well it turns out, no matter what, I choose a wrong way to build the graph at the beginning, which will not give the loop in my perspective. The loop error give me an idea I'm doing something wrong. But the interesting question I mentioned above is still not answered! However, I'd like to share my error so anyone see the loop error should think whether you're doing the same thing as me.
In the input_fn, I use tensor.eval() to get corresponding numpy.array in the middle to interact with data outside of that function. I choose not to use tf.data.Dataset because the whole process is complicated and I can't compress the whole thing into Dataset directly. But it turns out this approach sabotage the static computational graph design of Tensorflow. So during training, it trains on the same batch again and again. So my two cents advice is that if you want to achieve something super complex in your input_fn. It's likely you will be better off or even only doing the right thing by using the old fashioned modelling way- tf.placeholder.

Related

How to check the root cause of CUDA out of memory issue in the middle of training?

I'm running roberta on huggingface language_modeling.py. After doing 400 steps I suddenly get a CUDA out of memory issue. Don't know how to deal with it. Can you please help? Thanks
This can have multiple reasons. If you only get it after a few iterations, it might be that you don't free the computational graphs. Do you use loss.backward(retain_graph=True) or something similar?
Also, when you're running inference, be sure to use
with torch.no_grad():
model.forward(...)
Otherwise the computational graphs are saved there as well and potentially never freed since you never call backward() on them.
My problem was that I didn't check the size of my GPU memory with comparison to the sizes of samples. I had a lot of pretty small samples and after many iterations a large one. My bad.
Thank you and remember to check these things if it happens to you to.

How to disable summary for Tensorflow Estimator?

I'm using Tensorflow-GPU 1.8 API on Windows 10. For many projects I use the tf.Estimator's, which really work great. It takes care of a bunch of steps including writting summaries for Tensorboard. But right now the 'events.out.tfevents' file getting way to big and I am running into "out of space" errors. For that reason I want to disable the summary writting or at least reduce the amount of summaries written.
Going along with that mission I found out about the RunConfig you can pass over at construction of tf.Estimator. Apparently the parameter 'save_summary_steps' (which by default is 200) controls the way summaries are wrtitten out. Unfortunately changing this parameter seems to have no effect at all. It won't disable (using None value) the summary or reducing (choosing higher values, e.g. 3000) the file size of 'events.out.tfevents'.
I hope you guys can help me out here. Any help is appreciated.
Cheers,
Tobs.
I've observed the following behavior. It doesn't make sense to me so I hope we get a better answer:
When the input_fn gets data from tf.data.TFRecordDataset then the number of steps between saving events is the minimum of save_summary_steps and (number of training examples divided by batch size). That means it does it a minimum of once per epoch.
When the input_fn gets data from tf.TextLineReader, it follows save_summary_steps as you'd expect and I can give it a large value for infrequent updates.

Does TensorFlow support to save the initial hyper-parameter configuration automatically?

We need to run the designed networks many times for better performance and it would be better to record our experiments we have run. Maybe it could be good to provide to record these hyper-parameter configuration automatically by the tensorflow execution engine. For example, I record by set different directory name for the log directory as:
log_lr_0.001_theta_0.1_alpha_0.1
log_lr_0.01_theta_0.01_alpha_0.02
....
Are there any automatic ways to help this? In addition, it would be better that when we start a new tensorflow training instance, a new port will be allocated and a new tensorboard is started and shows its learning state.
No, tensorflow doesn't support initial hyper parameter configuration automatically.
I've faced the same issue as you, and I'm using a tool called Sacred, I hope you'd find that useful.

Tensorflow not linking operations into single CUDA kernel

I've just started learning how to use Tensorflow and have run into an issue that's making me doubt my understanding of how it should work. I want to get a rough idea of how much performance I should be getting using basic arithmetical operations on a GPU. I create a one dimensional tensor of 100 million elements and then chain 1000 add operations on this tensor. My expectation is that the Tensorflow run-time would be able to link these operations into a single CUDA kernel that's executed on the GPU, however when I run it it seems that each operation is being issued to the GPU separately. It takes around 5 seconds to complete on my gtx 1080 ti, which gives around 20 Gflops. While running, python.exe is using up a full CPU core and Nvidia Nsight shows many kernels being submitted. In comparison, when I try and see what I get with Alea.GPU I get around 3Tflops and a single CUDA kernel issued.
Am I misunderstanding how basic operations should work on a GPU? is the only way to get good GPU efficiency to manually group operations into more complex custom operations or use the higher level ML functions?
Thank you.
import tensorflow as tf
import time
TENSOR_SIZE=100000000
TF_REP=1000
def testSpeed(x):
tf.InteractiveSession();
z=tf.zeros(TENSOR_SIZE)
for i in range(0, TF_REP):
z=tf.add(z,x)
return tf.reduce_sum(z).eval();
x=tf.range(0.0, TENSOR_SIZE)
t0=time.perf_counter()
testSpeed(x)
t1=time.perf_counter()
print("Time taken "+str(t1-t0)+"s gflops= " + str(TENSOR_SIZE * TF_REP / 1000000000.0 / (t1 - t0)))
Firstly, you should separate your code into 2 stages, a build_graph stage, which defines the various tensors. I suggest collecting them in a function called build_graph(). Then create your session and run data through it. You are trying to apply procedural programming techniques to an imperative library.
Next is the issue of swapping data onto and off of the GPU. When you run tf.reduce_sum(z).eval() you are copying the result from GPU back to CPU every time.
Lastly, you are creating many sessions with tf.InteractiveSession(), you should only have 1 session created. Go back to the first issue to resolve this. A best practice is to never create tensorflow OPs after the session has been created. Tensorflow will allow you to, but as a best practice don't, and you shouldn't need to if you coded things correctly. If you feel like you need to, post a question asking why you can't do XYZ without defining it before the session was created and someone will almost certainly offer a correction to the workflow.

How to obtain the module/object code of a Theano numpy program

In my University we have a Cluster having Tesla GPUs. However the resource is shared by several departments and the supercomputing department requires users to provide uniquely the module/code object of the program one needs to run in the cluster. In such a situation, I searched for some information about this. The supercomputer has a queue system (which is usual in supercomputers to be shared). As I understand, the supercomputing department requires one to follow procedures like this. So, how to obtain the object code of a Keras-Theano model compiled for GPU? Just like the produced by gcc model.c --> a.out which is what i need.
Any other idea is very appreciated.
The easiest solution should be pickling theano function, however this only saves optimized graph but not generated code. I'm not sure if this works for your case.
You can use command theano-cache list to find the directories for generated Op code, which is typically under /home/user/.theano/. However hand-compiling this into module may be complicated.
There's also a PR for shared library generation however it's not merged yet.