Does the Saver in TensorFlow have a cross-platform file format? - cross-platform

Suppose I execute the following on machine A after some calculations for some tf.Session sess and some tf.train.Saver saver, assuming I have some tf.Graph G with some variables V:
with tf.Graph().as_default():
# Define G, V, initialize for sess, then run some computation
saver.save(sess, '/A/somefolder/somefile')
This creates somefile, somefile.meta, and updates checkpoints in somefolder.
Next, suppose on machine B I copy the entire contents of somefolder and run the following:
with tf.Graph().as_default():
# Define G and V the same way. No initialization or run here.
saver.restore(sess, '/B/somefolder/somefile')
For both machines A and B, will the variables have the same state at the end of the code blocks? Is this guaranteed to work across all platforms? What about different versions of Linux?

The saver uses a simple file format based on LevelDB to store a key-value table that maps variable names (as strings) to SavedTensorSlice protocol buffers. The format is intended to work across all platforms, although it has mostly been tested on little-endian (i.e. x86-based) architectures. The same file should work across different versions of Linux, and between Linux and Mac OS X. If it doesn't, please raise an issue!

Related

Inexplicable behaviour when using numpy.T as init for pyTorch weights

I use numpy to init the weights of my PyTorch MLP. It's a really small network, 2 layers, 21 neurons per layer. The network's output is BRDF values that are then rendered by Mitsuba 0.6.0.
The very peculiar and strange issue I am experiencing is when transposing the np-arrays during the initialization phase. Doing version A gives me a network that renders perfectly in Mitsuba (what I would expect). Doing version B, which should be equivalent, gives me a network that scores the same loss in PyTorch, but renders different values in Mitsuba.
# Version A:
w = np.random.uniform(low=-0.05, high=0.05, size=(6, 21)).astype(np.float32)
model.fc1.weight = torch.nn.Parameter(torch.from_numpy(w.T), requires_grad=True)
# Version B:
w = np.random.uniform(low=-0.05, high=0.05, size=(21, 6)).astype(np.float32)
model.fc1.weight = torch.nn.Parameter(torch.from_numpy(w), requires_grad=True)
Note how in Version B, all that changed are the dimensions and the call to transpose. Therefore, the shapes are equivalent to Version A, and the contents should be equivalent as well, as both are sampled from the same distribution.
I cannot share a MWE, as this is proprietary research, but I assure you that the ONLY thing I changed between these two runs is the two lines in the above code snippets. I do not think Mitsuba is at fault either, because the first network (version A) renders fine, and the second network is equivalent to that, but for the init. I tried mimicking the numpy-inits with the respective PyTorch-equivalents, and the issue persists.
Any help is greatly appreciated!!
VersionA
VersionB

Copy variables from one TensorFlow graph to another

I have two tensorflow graphs. One for training and the other for evaluation. They share a lot of variable names. When I evaluate a model I want to copy all variable values from the train graph to the test graph. Obviously, I can do it via tf.train.Saver, but this solution seems not very appropriate to me, especially the fact that we have to use the disk for this.
When you speak about multiple graphs, I assume you mean something like:
g1 = tf.Graph()
with g1.as_default():
# add your stuff
g2 = tf.Graph()
with g2.as_default():
# add other stuff
If this is correct, then are you sure you really need two graphs? Can't you have one graph consisting of two connected components?
Using multiple graphs is discouraged (p 47) because:
Multiple graphs require multiple sessions, each will try to use all available resources by default
Can't pass data between them without passing them through python/numpy, which doesn't work in distributed
It’s better to have disconnected subgraphs within one graph
This also gives you a solution how to pass variables in a non-distributed setting.

Tensorboard scalars and graphs duplicated

I'm using TensorBoard to visualize network metrics and graph.
I create a session sess = tf.InteractiveSession() and build the graph in Jupyter notebook.
In the graph, I include two summary scalars:
with tf.variable_scope('summary') as scope:
loss_summary = tf.summary.scalar('Loss', cross_entropy)
train_accuracy_summary = tf.summary.scalar('Train_accuracy', accuracy)
I then create a summary_writer = tf.summary.FileWriter(logdir, sess.graph) and run:
_,loss_sum,train_accuracy_sum=sess.run([...],feed_dict=feed_dict)
I write the metrics:
summary_writer.add_summary(loss_sum, i)
summary_writer.add_summary(train_accuracy_sum, i)
I run the code three times.
Each time I run, I re-import TF and create a new interactive session.
But, in Tensorboard, a separate scalar window is created for each run:
Also, the graph appears to be duplicated if I check data for the last run:
How do I prevent duplication of the graph and scalar window each time I run?
I want all data to appear in the same scalar plots (with multiple series / plot).
I want each run to reference a single graph visualization.
I suspect the problem arises because you are running the code three times in the process (same script, Jupyter notebook, or whatever), and those invocations share the same "default graph" in TensorFlow. TensorFlow needs to give each node in the graph a unique name, so it appends "_1" and "_2" to the names of the summary nodes in the second and third invocations.
How do you avoid this? The easiest way is to create a new graph each time you run the code. There are (at least) three ways to do this:
Wrap the code in a with tf.Graph().as_default(): block, which constructs a new tf.Graph object and sets it is the default graph for the extent of the with block.
If you construct your session before creating the graph, you can construct your session as sess = tf.InteractiveSession(graph=tf.Graph()). The newly constructed tf.Graph object remains as the default graph until you call sess.close().
Call tf.reset_default_graph() between invocations of the code.
The with-block approach is the "most structured" way to do things, and might be best if you are writing a standalone script. However, since you are using tf.InteractiveSession, I assume you are using an interactive REPL of some kind, and the other two approaches are probably more useful (e.g. for splitting the execution across multiple cells).
This problem occurs to hold multiple graphs its not a problem if you want to solve this use:
tf.reset_default_graph()

Tensorflow: dynamically call GPUs with enough free memory

My desktop has two gpus which can run Tensorflow with specification /gpu:0 or /gpu:1. However, if I don't specify which gpu to run the code, Tensorflow will by default to call /gpu:0, as we all know.
Now I would like to setup the system such that it can assign gpu dynamically according to the free memory of each gpu. For example, if a script doesn't specify which gpu to run the code, the system first assigns /gpu:0 for it; then if another script runs now, it will check whether /gpu:0 has enough free memory. If yes, it will continue assign /gpu:0 to it, otherwise it will assign /gpu:1 to it. How can I achieve it?
Follow-ups:
I believe the question above may be related to the virtualization problem of GPU. That is to say, if I can virtualize multi-gpu in a desktop into one GPU, I can get what I want. So beside any setup methods for Tensorflow, any ideas about virtualization is also welcome.
TensorFlow generally assumes it's not sharing GPU with anyone, so I don't see a way of doing it from inside TensorFlow. However, you could do it from outside as follows -- shell script that calls nvidia-smi, parses out GPU k with more memory, then sets "CUDA_VISIBLE_DEVICES=k" and calls TensorFlow script
Inspired by:
How to set specific gpu in tensorflow?
def leave_gpu_with_most_free_ram():
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = _output_to_list(sp.check_output(command.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
least_busy_idx = memory_free_values.index(max(memory_free_values))
# update CUDA variable
gpus =[least_busy_idx]
setting = ','.join(map(str, gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = setting
print('Left next %d GPU(s) unmasked: [%s] (from %s available)'
% (leave_unmasked, setting, str(available_gpus)))
except FileNotFoundError as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked')
print(e)
except sp.CalledProcessError as e:
print("Error on GPU masking:\n", e.output)
Add a call to this function before importing tensorflow

How to use StreamingDataFeeder as contrib.learn.Estimator.fit()'s input_fn?

I have recently started using tensorflow.contrib.learn (skflow) library and really like it. However, I am facing an issue with using Estimator, the fit function uses either
(X, Y, and batch_size) - the problem with this approach is that it does not support provision for specifying number of epochs and allowing arbitrary source of data.
input_fn - besides, setting epochs, it gives me much more flexibility on source of training ( which in my case is coming directly from a database).
Now I am aware that I could create input_fn which reads files, however, as I am not interested in dealing with files, the following functions are not useful for me -
tf.contrib.learn.read_batch_examples
tf.contrib.learn.read_batch_features
tf.contrib.learn.read_batch_record_features
Ideally, I would like to use StreamingDataFeeder as input_fn. Any ideas how I can achieve this?
StreamingDataFeeder is used when you provide iterators as x / y to fit/predict/evaluate of Estimator.
Example:
x = (np.array([i]) for i in xrange(10**10)) # use range for python >=3.0
y = (np.array([i + 1]) for i in xrange(10**10))
lr = tf.contrib.learn.LinearRegressor(
feature_columns=[tf.contrib.layers.real_valued_column('')])
# only consumes 1000*10 values from iterators.
lr.fit(x, y, steps=1000, batch_size=10)
If you want to use input_fn for feeding data - you need to use graph operations to read / process data. For example you can create a C++ operation that will produce your data (it can be listening port or reading from database Op) and convert into Tensor. Mainly this is good for reading data from files, but other readers can be implemented as well.