Can Tensorflow read from HDFS on Mac? - dynamic

I'm trying to coerce Tensorflow on OS/X to read from HDFS. The documentation
https://www.tensorflow.org/deploy/hadoop
does not clearly specify whether this is possible, and the code refers only to "posix" operating systems. The error I'm seeing when trying to use the HDFS is the following:
UnimplementedError (see above for traceback): File system scheme hdfs not implemented
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer)]]
Here's what I've done up to this point:
brew installed Hadoop 2.7.2
separately compiled Hadoop 2.7.2 for the native libraries. Hadoop is installed on /usr/local/Cellar/hadoop/2.7.2/libexec on my system, and the native libraries (libhdfs.dylib) are in ~/Source/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-2.7.2/lib/native.
Edited the code at https://github.com/tensorflow/tensorflow/blob/v1.0.0/tensorflow/core/platform/hadoop/hadoop_file_system.cc#L113-L119 to read from libhdfs.dylib rather than libhdfs.so, recompiled, and reinstalled Tensorflow. (I have to admit this is pretty boneheaded, and I have no idea if it's all that's required to make this code work on Mac.)
Here is the code to reproduce.
test.sh:
set -x
export JAVA_HOME=$($(dirname $(which java | xargs readlink))/java_home)
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.2/libexec
. $HADOOP_HOME/libexec/hadoop-config.sh
export HADOOP_HDFS_HOME=$(echo ~/Source/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-2.7.2)
export CLASSPATH=$($HADOOP_HDFS_HOME/bin/hdfs classpath --glob)
# Virtual environment with Tensorflow and necessary dependencies
. venv/bin/activate
python ./test.py
test.py:
import tensorflow as tf
_, example_bytes = tf.TFRecordReader().read(
tf.train.string_input_producer(
[
"hdfs://localhost:9000/user/foo/feature_output/part-r-00000",
"hdfs://localhost:9000/user/foo/feature_output/part-r-00001",
"hdfs://localhost:9000/user/foo/feature_output/part-r-00002",
"hdfs://localhost:9000/user/foo/feature_output/part-r-00003",
]
)
)
with tf.Session().as_default() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
print(len(sess.run(example_bytes)))
The code path I'm seeing in the Tensorflow source seems to indicate to me that I'd receive a different error than the one above if the issue were really mac-specific, since some kind of handler is registered for the "hdfs" scheme regardless: https://github.com/tensorflow/tensorflow/blob/v1.0.0/tensorflow/core/platform/hadoop/hadoop_file_system.cc#L474 . Has anyone else succeeded in coercing Tensorflow to work with Mac? If it isn't supported, is there an easy place to patch it?
I'm also open to suggestions as to what might be a better approach. The high-level goal is to efficiently train a model in parallel, using shared parameter servers, considering that each worker will only read a subset of the data. This is readily accomplished using the local filesystem, but it's less clear how to scale beyond that. Even if I do succeed in making the code above work, the result could suffer from problems with data locality.
This thread https://github.com/tensorflow/tensorflow/issues/2218 suggests using pyspark.RDD.toLocalIterator to iterate over the data set with a placeholder in the graph. Aside from my concern about forcing each worker to iterate through the full dataset, I don't see a way to coerce Tensorflow's builtin Estimator class to accept a custom feed function along with a specified input_fn, and a custom input_fn appears necessary in order to take advantage of models like LinearClassifier (https://www.tensorflow.org/tutorials/linear) that are capable of learning from sparse, weighted features.
Any thoughts?

Did you enable HDFS support in ./configure when building? That's the error you would get if HDFS is disabled.
I think you made the correct change to make it work. Feel free to send a pull request to look for .dylib on macOS.

Related

Using custom StyleGAN2-ada network in GANSpace (.pkl to .pt conversion)

I trained a network using Nvdia's StyleGAN2-ada pytorch implementation. I now have a .pkl file. I would like to use the GANSpace code on my network. However, to use GANSpace with a custom model, you need to be able to give it a checkpoint to your model that should be uploaded somewhere (they suggest Google Drive)(checkpoint required in code here). I am not entirely sure how this works or why it works like this, but either way it seems I need a .pt file of my network, not a .pkl file, which is what I currently have.
I tried following this tutorial. It seems the GANSpace code actually provides a file (models/stylegan2/convert_weight.py) that can do this conversion. However, it seems the file convert_weight.py that was supposed to be there has been replaced by a link to a whole other repo. If I try run the convert_weight.py file as below, it gives me the following error
python content/stylegan2-pytorch/convert_weight.py --repo="content/stylegan2-pytorch/" "content/fruits2_output/00000-fruits2-auto1/network-snapshot-025000.pkl"
ModuleNotFoundError: No module named 'dnnlib'
This makes sense because there is no such dnnlib module. If I instead change it to look for the dnnlib module somewhere that does have it (here) like this
python content/stylegan2-pytorch/convert_weight.py --repo="content/stylegan2/" "content/fruits2_output/00000-fruits2-auto1/network-snapshot-025000.pkl"
it previously gave me an error saying TensorFlow had not been installed (which in all fairness it hadn't because I am using PyTorch), much like this error reported here. I then installed TensorFlow, but then it gives me this error.
ModuleNotFoundError: No module named 'torch_utils'
again the same as in the previous issue reported on github. After installed torch_utils I get the same error as SamTransformer (ModuleNotFoundError: No module named 'torch_utils.persistence'). The response was "convert_weight.py does not supports stylegan2-ada-pytorch".
There is a lot I am not sure about, like why I need to convert a .pkl file to .pt in the first place. A lot of the stuff seems to talk about converting Tensorflow models to Pytorch ones, but mine was done in Pytorch originally, so why do I need to convert it? I just need a way to upload my own network to use in GANSpace - I don't really mind how, so any suggestions would be much appreciated.
Long story short, the conversion script provided was to convert weights from the official Tensorflow implementation of StyleGAN2 into Pytorch. As you mentioned, you already have a model in Pytorch, so it's reasonable for the conversion script to not work.
Instead of StyleGAN2 you used StyleGAN2-Ada which isn't mentioned in the GANspace repo. Most probably it didn't exist by the time the GANspace repo was created. As far as I know, StyleGAN2-Ada uses the same architecture as StyleGAN2, so as long as you manually modify your pkl file into the required pt format,you should be able to continue setup.
Looking at the source code for converting to Pytorch, GANspace requires the pt file to be a dict with keys: ['g', 'g_ema', 'd', 'latent_avg']. StyleGAN2-Ada saves a pkl containing a dict with the following keys: ['G', 'G_ema', 'D', 'augment_pipe']. You might be able to get things to work by loading the contents of your pkl file and resaving them in pt using these keys.

Distributing DCGAN with horovod on Sagemaker

I am trying to distribute my workload to multiple GPUs with AWS Sagemaker. I am using a custom algorithm for a DCGAN with tensorflow 2.0. The code thus far works perfect on a single GPU. I decided to implement the same code but with horovod distribution across multiple GPUs to reduce run time. The code, when changed from the original to horovod, seems to work the same, and the training time is roughly the same. However, when I print out hvd.size() I am only getting a size of 1, regardless of the multiple GPU's present. Tensorflow recognizes all the present GPU's; Horovod, no.
I've tried running my code on both Sagemaker and on an EC2 instance in a docker container, and in both environments the same issue persists.
Here is the a link to my github repo:
Here
I've also tried using a different neural network entirely from the horovod repository, updated to tf2.0:
hvdmnist
At this point I am only trying to get the GPU's within one instance to be utilized, and am not trying utilize multiple instances.
I think I might be missing a dependency of some sort in the docker image, either that or there is some sort of prerequisite command for me to run. I don't really know.
Thanks.

keras + scikit-learn wrapper, appears to hang when GridSearchCV with n_jobs >1

UPDATE: I have to re-write this question as after some investigation I realise that this is a different problem.
Context: running keras in a gridsearch setting using the kerasclassifier wrapper with scikit learn. Sys: Ubuntu 16.04, libraries: anaconda distribution 5.1, keras 2.0.9, scikitlearn 0.19.1, tensorflow 1.3.0 or theano 0.9.0, using CPUs only.
Code:
I simply used the code here for testing: https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/, the second example 'Grid Search Deep Learning Model Parameters'. Pay attention to line 35, which reads:
grid = GridSearchCV(estimator=model, param_grid=param_grid)
Symptoms: When grid search uses more than 1 jobs (means cpus?), e.g.,, setting 'n_jobs' on the above line A to '2', line below:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=2)
will cause the code to hang indefinitely, either with tensorflow or theano, and there is no cpu usage (see attached screenshot, where 5 python processes were created but none is using cpu).
By debugging, it appears to be the following line with 'sklearn.model_selection._search' that causes problems:
line 648: for parameters, (train, test) in product(candidate_params,
cv.split(X, y, groups)))
, on which the program hangs and cannot continue.
I would really appreciate some insights as to what this means and why this could happen.
Thanks in advance
Are you using a GPU? If so, you can't have multiple threads running each variation of the params because they won't be able to share the GPU.
Here's a full example on how to use keras, sklearn wrappers in a Pipeline with GridsearchCV: Pipeline with a Keras Model
If you really want to have multiple jobs in the GridSearchCV, you can try to limit the GPU fraction used by each job (e.g. if each job only allocates 0.5 of the available GPU memory, you can run 2 jobs simultaneously)
See these issues:
Limit the resource usage for tensorflow backend
GPU memory fraction does not work in keras 2.0.9 but it works in 2.0.8
I dealt with this problem too and it really slowed me down not being able to run what is essentially trivially-parallelizable code. The issue is indeed with the tensorflow session. If a session in created in the parent process before GridSearchCV.fit(), it will hang!
The solution for me was to keep all session/graph creation code restricted to the KerasClassifer class and the model creation function i passed to it.
Also what Felipe said about the memory is true, you will want to restrict the memory usage of TF in either the model creation function or a subclass of KerasClassifier.
Related info:
Session hang issue with python multiprocessing
Keras + Tensorflow and Multiprocessing in Python
TL;DR Answer: You can't because your Keras model can't be serialized, and serialization is needed for parallelizing in Python with joblib.
This problem is much detailed here: https://www.neuraxle.org/stable/scikit-learn_problems_solutions.html#problem-you-can-t-parallelize-nor-save-pipelines-using-steps-that-can-t-be-serialized-as-is-by-joblib
The solution to parallelize your code is to make your Keras estimator serializable. This can be done using savers as described at the link above.
If you're lucky enough to be using TensorFlow v2's prebuilt Keras module, the following practical code sample will reveal to be useful to you as you'd practically just need to take the code and modify it with yours:
https://github.com/guillaume-chevalier/seq2seq-signal-prediction
In this example, all the saving and loading code is all pre-written for you using Neuraxle-TensorFlow, and this makes it parallelizeable if you use Neuraxle's AutoML methods (e.g.: Neuraxle's grid search and Neuraxle's own parallelism things).

How to check that Tensorflow graph rewrites that use MKL occur?

From looking at Tensorflow code, some MKL optimizations are done by a graph rewrite replacing sets of nodes by fused functions that use MKL. I tried to look for the rewrites with tf.logging.set_verbosity(1) but never see of the log messages I expect.
I have built Tensorflow from sources on CPU with MKL and XLA enabled. I think the build is using MKL because I can use 'NCHW' data format for tf.nn.conv2d and tf.nn.bias_add in the forward pass if they occur together. It also runs faster and fully utilises the CPU. The backward pass though errors saying that "CPU BiasGradOp only supports NHWC" though it looks like MKL functions exist to fuse Conv2D and BiasAdd both forwards and backwards with 'NCHW'. So I want to look directly for the rewrites.
How can I see if the graph rewrites are happening?
One way is to use the timeline/trace feature. You can followed this StackOverflow answer. If it uses MKL you would see nodes with names like _MklReshape or _MklConv2D
This is not specifically testing for graph rewrites, but you can check, if mkl is enabled in tensorflow by using:
tf.python.pywrap_tensorflow.IsMklEnabled()
From: https://github.com/tensorflow/tensorflow/issues/17176#issuecomment-371364155
Tensorflow has a debugger (tfdbg) with a tutorial here. The debugger prints a list of all graph nodes that will be visited by a session.run() before running it.
You can also explore the input tensors, output tensors, and the attributes of each node.
Ariel's answer also works to see the op types if you don't want to take the time to compile with tfdbg.
For v2.0.0+ the command is:
python -c "from tensorflow.python import pywrap_tensorflow; print(pywrap_tensorflow.IsMklEnabled())"
source: https://software.intel.com/en-us/forums/intel-optimized-ai-frameworks/topic/837000

How to load tensorflow checkponit by myself without c++ api?

I am using tensorflow 1.0.
My production environment cannot build tensorflow-cpp because low gcc&glibc version.
Is there any doc about how to load a checkponit or freezed-graph in C++ without api?
1、 how to save network parameter? (embeding...)
2、 how to save graph structure (layers,weights...)
There is no documentation on doing this that I know of. Loading a checkpoint without the C++ runtime won't be very useful to you because you won't be able to run it.
The checkpoint by default does not include the graph structure, but if you export a metagraph you will get it in a serialized protocol buffer format. Implementing a parser for this (and the weights checkpoint) yourself sounds difficult to get right and likely to break in the future.