Google Cloud Dataflow: Different behavior for DirectRunner versus DataFlowRunner when using argparse - argparse

I am building a google cloud dataflow pipeline to process videos. I am having a very hard time debugging the pipeline because the environment behavior seems different on DirectRunner versus DataflowRunner.
My video processing tool (called DeepMeerkat below) takes in arguments from argparse. I call the pipeline:
python \
--runner DataFlowRunner \
--project $PROJECT \
--staging_location $BUCKET/staging \
--temp_location $BUCKET/temp \
--job_name $PROJECT-deepmeerkat \
--setup_file ./ \
--maxNumWorkers 3 \
--tensorflow \
Where the last two arguments, tensorflow and training are both for my pipeline, the rest are needed for clouddataflow.
I parse the args and pass the argv to the pipeline
and then within DeepMeerkat's argparse, parse just the known args.
This works perfectly locally, tensorflow is turned off (default is on) and training is turned on (default is on). Printing args confirms the behavior. But then it fails to parse on cloud dataflow, tensorflow stays on, and training is off.
DeepMeerkat args: Namespace(tensorflow=False, training=True)
From the logging of DataFlowRunner:
DeepMeerkat args: Namespace(tensorflow=True, training=False)
Any ideas of what's going on here? Identical commands, identical code, just changing DirectRunner to DataFlowRunner.
I'd rather not go down the road of passing custom arguments to pipeline options, since I would then need to assign them somehow downstream, if one already has a tool that parses arguments, this seems like a much more straightforward solution, provided there isn't something special about the dataflow worker.

I had the wrong conceptual model for this. Locally, each "worker" still has access to sys args, so it was not that the runner behavior was different, but rather the "worker" was circumventing the cloud pipeline and grabbing new args to parse. The way to do this in DataFlowRunner is to explicitly pass pipeline args to your DoFN function using an
. Then parse those args internally within the beam pipeline as if they came from a string.


Incorrect Broadcast input array shape error when trying to use Pretraining

I am trying to use spacy's 'pre-train' feature for a NER task, so here is what I tried doing(I am still trying to use it),
Step 1: I started by initializing the model with 'en_core_web_lg' next I saved this model to disk and tested its NER capability on few lines to see if it recognizes the tags in those test lines. (Made a note of ignored tags)
Step 2: Next I created a .jsonl file with new data to train on (about 20 new lines, I wanted to see the model's capability given new data around an entity(ignored tags found earlier) will it be able to correctly identify tags after doing transfer learning). So using this .jsonl and the model I saved earlier file I used 'spacy pre-train' command to train, this created a token2vec .bin file for me (model999.bin).
Step 3: Next I created a function that takes the location of an earlier saved model(model saved in step 1) and location of token2vec (model999.bin file obtained in step 2). Inside the function it loads the model>creates/gets pipe>disables rest of the files>uses (pipe_name).model.tok2vec.from_bytes( to read from model999.bin and broadcast the learned vectors to base model.
But when I run this function, I get this error:
ValueError: could not broadcast input array from shape (96,3,384) into shape (96,3,480)
(I have uploaded the entire notebook here: [ ]).
In order to pre-train I used this function
python -m spacy pre-train ub.jsonl model_saves w2s
Here are the 20 lines I tried training on top of the base model
[ ]
What am I doing wrong here exactly? Please can you also point the fix, I am sure many would need insight on this.
Operating System: CentOS
Python Version Used: 3.7.3
spaCy Version Used: 2.1.3
Environment Information: Anaconda Jupyter Lab
So I was able to fix this, the developer(on github) answered my question.
Here is the answer:

How to report algorithm running time?

I am running a variational auto-encoder in TensorFlow, which could take a long time. Thus I want to report the time the algorithm has been running for as a scalar on TensorBoard.
One dirty way is to hard-code the start time of the compilation into a global variable, or pass it as an argument to the model function and compute the difference with current time.
Does Tensorflow have a native way to do it?
There is the tf.train.ProfilerHook. Comes with release 1.14.
Example usage:
estimator = tf.estimator.LinearClassifier(...)
hooks = [tf.train.ProfilerHook(output_dir=model_dir, save_secs=600, show_memory=False)]
estimator.train(input_fn=train_input_fn, hooks=hooks)
Executing the hook will generate files timeline-xx.json in output_dir.
Then open chrome://tracing/ in chrome browser and load the file. You will get a time usage timeline like below.

Predict value of single image after training model on TPU

I still want to know how I can predict the value of an image after training the network, but it seems like it is not supported yet. Any idea for a workaround (taken from the
if mode == tf.estimator.ModeKeys.PREDICT:
raise RuntimeError("mode {} is not supported yet".format(mode))
Besides Stackoverflow - anywhere else I can get support for the implementing my models using TPU?
Here is a Python program that sends an image to a TPU-trained model (ResNet in this case) and gets back a classification:
with tf.gfile.FastGFile('/some/path.jpg', 'r') as ifp:
credentials = GoogleCredentials.get_application_default()
api ='ml', 'v1', credentials=credentials,
request_data = {'instances':
{"image_bytes": {"b64": base64.b64encode(}}
parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, MODEL, VERSION)
response = api.projects().predict(body=request_data, name=parent).execute()
Full code is here:
This article documents the process of writing a model for the Cloud TPU:
It is supported now. Changes have been done to to make it working.
Besides stackoverflow, you can add your issues on github
According to the documentation, you can choose online or batch modes for prediction, but you can't select the target device. As stated, "the prediction service allocates resources to run your job."
The documentation says that prediction is performed by nodes. I thought I'd read somewhere that prediction nodes are always CPUs in the Google Compute Engine, but I can't find a clear reference.

Tensorflow Estimator - Periodic Evaluation on Eval Dataset

The tensorflow documentation does not provide any example of how to perform a periodic evaluation of the model on an evaluation set.
Some people suggested the use of an Experiment, which sounds great but unfortunately does not work (depreciation and triggers an error).
Others suggested the use of SummarySaverHook, but I don't see how you can use that with an evaluation set (as opposed to the training set).
A solution would be to do the following
for i in range(number_of_epoch):
estimator.train(...) // on training set
estimator.evaluate(...) // on evaluation set
This architecture is explicitly discouraged in this paper (page 4 top right).
Any other idea/implementation?
The error message when running the experiment is the following:
File ".../anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/", line 253, in train if (config.environment != run_config.Environment.LOCAL and
AttributeError: 'RunConfig' object has no attribute 'environment'
Tensorflow version 1.3
Only a few parameters/options of Experiment are deprecated (what specific errors are you seeing?). If you create an Estimator that will do periodic checkpoints (using options in RunConfig) and an Experiment using it, you will get evaluation for each checkpoint by default when using train_and_evaluate method.
EDIT: As Maxime pointed out in the comments. He needed to add the following lines to get rid of his error:
os.environ['TF_CONFIG'] = json.dumps({'environment': 'local'})
config = tf.contrib.learn.RunConfig()

Training custom dataset with translate model

Running the model out of the box generates these files in the data dir :
dev-v2.tgz newstest2013.en
giga-fren.release2.fixed.en newstest2013.en.ids40000
giga-fren.release2.fixed.en.ids40000 training-giga-fren.tar vocab40000.from
Reading the src of :"from_train_data", None, "Training data.")"to_train_data", None, "Training data.")
To utilize my own training data I created dirs my-from-train-data & to-from-train-data and add my own training data to each of these dirs, training data is contained in the files mydata.from &
my-to-train-data contains mydata.from
my-from-train-data contains
I could not find documentation as to using own training data or what format it should take so I inferred this from the src and contents of data dir created when executing translate model out of the box.
Contents of mydata.from :
Is this a question
Contents of :
I then attempt to train the model using :
python --from_train_data my-from-train-data --to_train_data my-to-train-data
This returns with an error :
tensorflow.python.framework.errors_impl.NotFoundError: my-from-train-data.ids40000
Appears I need to create file my-from-train-data.ids40000 , what should it's contents be ? Is there an example of how to train this model using custom data ?
Great question, training a model on your own data is way more fun than using the standard data. An example of what you could put in the terminal is:
python --from_train_data mydatadir/ --to_train_data mydatadir/to_translate.out --from_dev_data mydatadir/ --to_dev_data mydatadir/test_to_translate.out --train_dir train_dir_model --data_dir mydatadir
What goes wrong in your example is that you are not pointing to a file, but to a folder. from_train_data should always point to a plaintext file, whose rows should be aligned with those in the to_train_data file.
Also: as soon as you run this script with sensible data (more than one line ;) ), will generate your ids (40.000 if from_vocab_size and to_vocab_size are not set). Important to know is that this file is created in the folder specified by data_dir... if you do not specify one this means they are generated in /tmp (I prefer them at the same place as my data).
Hope this helps!
Quick answer to :
Appears I need to create file my-from-train-data.ids40000 , what should it's contents be ? Is there an example of how to train this model using custom data ?
Yes, that's the vocab/ word-id file missing, which is generated when preparing to create the data.
Here is a tutorial from the Tesnorflow documentation.
quick over-view of the files and why you might be confused by the files outputted vs what to use:
python/ops/ >> Library for building sequence-to-sequence models.
models/rnn/translate/ >> Neural translation sequence-to-sequence model.
models/rnn/translate/ >> Helper functions for preparing translation data.
models/rnn/translate/ >> Binary that trains and runs the translation model.
The Tensorflow file requires several files to be generated when using your own corpus to translate.
It needs to be aligned, meaning: language line 1 in file 1. <> language line 1 file 2. This
allows the model to do encoding and decoding.
You want to make sure the Vocabulary have been generated from the dataset using this file:
Check these steps:
--data_dir [your_data_directory] --train_dir [checkpoints_directory]
--en_vocab_size=40000 --fr_vocab_size=40000
Note! If the Vocab-size is lower, then change that value.
There is a longer discussion here tensorflow/issues/600
If all else fails, check out this ByteNet implementation in Tensorflow which does translation task as well.