Code order influences the final result - tensorflow

I encountered a problem that the code order influences the final result. At first, the code works. After I move one line, tensorflow generates an error.
For example,
working version:
probs = net.get_output()
label_node = tf.placeholder(tf.int32, name='label_node')
top_1_op = tf.nn.in_top_k(probs, label_node, 1)
top_5_op = tf.nn.in_top_k(probs, label_node, 5)
threads = image_producer.start(session=sess, coordinator=coordinator)
for (labels, images) in image_producer.batches(sess):
top_1_result, top_5_result = sess.run([top_1_op, top_5_op],
feed_dict={input_node: images, label_node: labels})
Non-working version:
threads = image_producer.start(session=sess, coordinator=coordinator) # move here
probs = net.get_output()
label_node = tf.placeholder(tf.int32, name='label_node')
top_1_op = tf.nn.in_top_k(probs, label_node, 1)
top_5_op = tf.nn.in_top_k(probs, label_node, 5)
for (labels, images) in image_producer.batches(sess):
top_1_result, top_5_result = sess.run([top_1_op, top_5_op],
feed_dict={input_node: images, label_node: labels})
Tensorflow generates an error
"tensorflow.python.framework.errors.NotFoundError: FeedInputs: unable to find feed output label_node:0".
As you can see, tensorflow should be able to find "label_node:0". Actually, tensorflow cannot find top_1_op and top_5_op either.
The content of image_producer.start is something similar to:
op_A = ...
queue_runner = tf.train.QueueRunner(queue_B, [op_B] * num_concurrent)
session.run(op_A)
t = queue_runner.create_threads(session, coord=coordinator, start=True)
A more strange thing is that in the non-workable version, after I add two lines in image_producer.start, the code works again. For example, image_producer.start becomes
op_C = ... # new
session.run(op_C) # new
op_A = ...
queue_runner = tf.train.QueueRunner(queue_B, [op_B] * num_concurrent)
session.run(op_A)
t = queue_runner.create_threads(session, coord=coordinator, start=True)
Does anyone have an idea about possible causes of this problem? Or any idea about how to debug this?

It sounds like you are suffering from a bug that was fixed after TensorFlow 0.9.0 was released. In that version (and earlier) TensorFlow suffered from a race condition that could lead to unrecoverable errors if you modified the graph after queue runners (or other threads calling sess.run()) had started. The only workaround in version 0.9.0 is to start
the queue runners (i.e. the image_producer in your code) after the graph has been completely constructed.

Related

Loading dataset/dataloader object onto GPU

I am running code from another repository, but my issue is general so I am posting it here. Running their code, I get the error along the lines of Expected all tensors to be on the same device, found two: cpu and cuda:0. I have already verified that the model is on cuda:0; the issue is that the dataloader object used is not set to the device. Also, the dataset/models I use here are huggingface-transformers models and huggingface datasets.
Here is the relevant block of code where the issue arises:
eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
eval_dataloader = self.get_eval_dataloader(eval_dataset)
eval_examples = self.eval_examples if eval_examples is None else eval_examples
compute_metrics = self.compute_metrics
self.compute_metrics = None
eval_loop = (self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop)
try:
#this is where the error occurs
output = eval_loop(
eval_dataloader,
description="Evaluation",
prediction_loss_only=True if compute_metrics is None else None,
ignore_keys=ignore_keys,
)
For context, this occurs inside an evaluate() method of a class inheriting from Seq2SeqTrainer from huggingface. I have tried using something like
for i, (inputs, labels) in eval_dataloader:
inputs, labels = inputs.to(device), labels.to(device)
But that doesn't work (it gives an error of Too many values to unpack (expected 2). Is there any other way I can send this dataloader to the GPU? In particular, is there any way I can edit the evaluation_loop method of Transformers Trainer to move the batches to the GPU or something?

Tf-agent Actor/Learner: TFUniform ReplayBuffer dimensionality issue - invalid shape of Replay Buffer vs. Actor update

I try to adapt the this tf-agents actor<->learner DQN Atari Pong example to my windows machine using a TFUniformReplayBuffer instead of the ReverbReplayBuffer which only works on linux machine but I face a dimensional issue.
[...]
---> 67 init_buffer_actor.run()
[...]
InvalidArgumentError: {{function_node __wrapped__ResourceScatterUpdate_device_/job:localhost/replica:0/task:0/device:CPU:0}} Must have updates.shape = indices.shape + params.shape[1:] or updates.shape = [], got updates.shape [84,84,4], indices.shape [1], params.shape [1000,84,84,4] [Op:ResourceScatterUpdate]
The problem is as follows: The tf actor tries to access the replay buffer and initialize the it with a certain number random samples of shape (84,84,4) according to this deepmind paper but the replay buffer requires samples of shape (1,84,84,4).
My code is as follows:
def train_pong(
env_name='ALE/Pong-v5',
initial_collect_steps=50000,
max_episode_frames_collect=50000,
batch_size=32,
learning_rate=0.00025,
replay_capacity=1000):
# load atari environment
collect_env = suite_atari.load(
env_name,
max_episode_steps=max_episode_frames_collect,
gym_env_wrappers=suite_atari.DEFAULT_ATARI_GYM_WRAPPERS_WITH_STACKING)
# create tensor specs
observation_tensor_spec, action_tensor_spec, time_step_tensor_spec = (
spec_utils.get_tensor_specs(collect_env))
# create training util
train_step = train_utils.create_train_step()
# calculate no. of actions
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1
# create agent
agent = dqn_agent.DqnAgent(
time_step_tensor_spec,
action_tensor_spec,
q_network=create_DL_q_network(num_actions),
optimizer=tf.compat.v1.train.RMSPropOptimizer(learning_rate=learning_rate))
# create uniform replay buffer
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=1,
max_length=replay_capacity)
# observer of replay buffer
rb_observer = replay_buffer.add_batch
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=batch_size,
num_steps = 2,
single_deterministic_pass=False).prefetch(3)
# create callable function for actor
experience_dataset_fn = lambda: dataset
# create random policy for buffer init
random_policy = random_py_policy.RandomPyPolicy(collect_env.time_step_spec(),
collect_env.action_spec())
# create initalizer
init_buffer_actor = actor.Actor(
collect_env,
random_policy,
train_step,
steps_per_run=initial_collect_steps,
observers=[replay_buffer.add_batch])
# initialize buffer with random samples
init_buffer_actor.run()
(The approach is using the OpenAI Gym Env as well as the corresponding wrapper functions)
I worked with keras-rl2 and tf-agents without actor<->learner for other atari games to create the DQN and both worked quite well afer a some adaptions. I guess my current code will also work after a few adaptions in the tf-agent libary functions, but that would obviate the purpose of the libary.
My current assumption: The actor<->learner methods are not able to work with the TFUniformReplayBuffer (as I expect them to), due to the missing support of the TFPyEnvironment - or I still have some knowledge shortcomings regarding this tf-agents approach
Previous (successful) attempt:
from tf_agents.environments.tf_py_environment import TFPyEnvironment
tf_collect_env = TFPyEnvironment(collect_env)
init_driver = DynamicStepDriver(
tf_collect_env,
random_policy,
observers=[replay_buffer.add_batch],
num_steps=200)
init_driver.run()
I would be very grateful if someone could explain me what I'm overseeing here.
I fixed it...partly, but the next error is (in my opinion) an architectural problem.
The problem is that the Actor/Learner setup is build on a PyEnvironment whereas the
TFUniformReplayBuffer is using the TFPyEnvironment which ends up in the failure above...
Using the PyUniformReplayBuffer with a converted py-spec solved this problem.
from tf_agents.specs import tensor_spec
# convert agent spec to py-data-spec
py_collect_data_spec = tensor_spec.to_array_spec(agent.collect_data_spec)
# create replay buffer based on the py-data-spec
replay_buffer = py_uniform_replay_buffer.PyUniformReplayBuffer(
data_spec= py_collect_data_spec,
capacity=replay_capacity*batch_size
)
This snippet solved the issue of having an incompatible buffer in the background but ends up in another issue
--> The add_batch function does not work
I found this approach which advises to use either a batched environment or to make the following adaptions for the replay observer (add_batch method).
from tf_agents.utils.nest_utils import batch_nested_array
#********* Adpations add_batch method - START *********#
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
#********* Adpations add_batch method - END *********#
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
single_deterministic_pass=False)
experience_dataset_fn = lambda: dataset
This helped me to solve the issue regarding this post but now I run into another problem where I need to ask someone of the tf-agents-team...
--> It seems that the Learner/Actor structure is no able to work with another buffer than the ReverbBuffer, because the data-spec which is processed by the PyUniformReplayBuffer sets up a wrong buffer structure...
For anyone who has the same problem: I just created this Github-Issue report to get further answers and/or fix my lack of knowledge.
the full fix is shown below...
--> The dimensionality issue was valid and should indicate the the (uploaded) batched samples are not in the correct shape
--> This issue happens due to the fact that the "add_batch" method loads values with the wrong shape
rb_observer = replay_buffer.add_batch
Long story short, this line should be replaced by
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
--> Afterwards the (replay buffer) inputs are of correct shape and the Learner Actor Setup starts training.
The full replay buffer is shown below:
# create buffer for storing experience
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
agent.collect_data_spec,
1,
max_length=1000000)
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
num_steps = 2,
single_deterministic_pass=False).prefetch(4)
# create batched nested array input for rb_observer
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
# create batched readout of dataset
experience_dataset_fn = lambda: dataset

Memory leak when running universal-sentence-encoder-large itterating on dataframe

I have 140K sentences I want to get embeddings for. I am using TF_HUB Universal Sentence Encoder and am iterating over the sentences(I know it's not the best way but when I try to feed over 500 sentences into the model it crashes).
My Environment is:
Ubuntu 18.04
Python 3.7.4
TF 1.14
Ram: 16gb
processor: i-5
my code is:
version 1
I iterate inside the tf.session context manager
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = None
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
version 2
I open and close a session within each iteration
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = None
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
While version 2 does seem to have some sort of GC and memory is cleared a bit. It still goes over 50 items and explodes.
version 1 just goes on gobbling memory.
The correct solution as given by arnoegw
def calculate_embeddings(dataframe, table_name):
sql_get_sentences = "SELECT * FROM semantic_similarity.sentences WHERE embedding IS NULL LIMIT 1500"
sql_update = 'UPDATE {} SET embedding = data.embedding FROM (VALUES %s) AS data(id, embedding) WHERE {}.id = data.id'.format(table_name, table_name)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
with hub.eval_function_for_module("https://tfhub.dev/google/universal-sentence-encoder-large/3") as embed:
while len(df) >= 0:
sentence_array = df['content'].values
sentence_embeddings = embed(sentence_array)
df['embedding'] = sentence_embeddings.tolist()
values = [tuple(x) for x in df[['id', 'embedding']].values]
pandas_repository.update_db_from_df('semantic_similarity.sentences', sql_update, values)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
I am a newbee to TF and can use any help I can get.
Your code uses tf.Session, so it falls under the TF1.x programming model of first building a dataflow graph and then running it repeatedly with inputs being fed and outputs being fetched from the graph.
But your code does not align well with that programming model. Both versions keep adding new applications of (calls to) the hub.Module to the default TensorFlow graph instead of applying it once and running the same graph repeatedly for the various inputs. Version 2 keeps going into and out of tf.Sessions, which frees some memory but is very inefficient.
Please see my answer to "Strongly increasing memory consumption when using ELMo from Tensorflow-Hub" for guidance how to do it right in the graph-based programming model of TensorFlow 1.x.
TensorFlow 2.0, which is going to be released soon, defaults to the programming model of "eager execution", which does away with graphs and sessions and would have avoided this confusion. TensorFlow Hub will be updated in due course for TF2.0. For a preview close to your use-case, see https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_text_classification.ipynb

TensorFlow example, MemoryError while run text_classification_character_cnn.py

I'm trying to run https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification_character_cnn.py for learning, but I get an error message:
File "C:\Users\natlun\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py", line 72, in load_csv_without_header
data = np.array(data)
MemoryError
I use CPU installation of TensorFlow and Python 3.5. Any ideas how to solve the problem?? Other scripts using a csv-file for input work fine.
I was having the same issue. And after many hours of reading and googling (and seeing your unanswered question), and just comparing the example with other examples that do run, I noticed that
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data, size='large')
should just be
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
Based off of what I've read about numpy, I'd bet the "size='large'" parameter causes an over allocation to a numpy array (which throws the memory error).
Or, when you don't set that parameter perhaps the input data is truncated.
Or some other thing. Anyway, I hope this helps others attempting to run this useful example!
--- Update ---
Without "size='large'" the load_dataset functions appears to create smaller training and test data sets (like 1/1000 the size).
After playing around with the example I realized I could manually load and use the whole data set without getting the memory error (assume it is saving the whole data set as it appears).
# Prepare training and testing data
##This was the provided method for setting up the data.
# dbpedia = tf.contrib.learn.datasets.load_dataset(
# 'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
# x_trainz = pandas.DataFrame(dbpedia.train.data)[1]
# y_trainz = pandas.Series(dbpedia.train.target)
# x_testz = pandas.DataFrame(dbpedia.test.data)[1]
# y_testz = pandas.Series(dbpedia.test.target)
##And this is my replacement.
x_train = []
y_train = []
x_test = []
y_test = []
with open("dbpedia_data/dbpedia_csv/train.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_train.append(row[2])
y_train.append(int(row[0]))
with open("dbpedia_data/dbpedia_csv/test.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_test.append(row[2])
y_test.append(int(row[0]))
x_train = pandas.Series(x_train)
y_train = pandas.Series(y_train)
x_test = pandas.Series(x_test)
y_test = pandas.Series(y_test)
The example seems to now be evaluating the whole training data set. But, the original code will probably need to be run once to get/put the data in the correct sub-folders. Also, even while evaluating the whole data set little memory is used (just a few hundred MB). Which, makes me think that the load_dataset function is broken in some way.

Restricting number of cores used

I'm trying to restrict the number of cores that a tf session uses but it's not working. This is how I'm initializing the session:
sess = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1,
use_per_session_threads=True))
The system has 12 cores / 24 threads, and I can see that 40-60% of them are being used at any given point in time. The system also has 8 GPUs, but I construct the whole graph with tf.device('/cpu:0').
UPDATE: To clarify, the graph itself is a simple LSTM-RNN, that hews very closely to the examples in the tf source code. For completeness here's the full graph:
node_input = tf.placeholder(tf.float32, [n_steps, batch_size, input_size], name = 'input')
list_input = [tf.reshape(i, (batch_size, input_size)) for i in tf.split(0, n_steps, node_input)]
node_target = tf.placeholder(tf.float32, [n_steps, batch_size, output_size], name = 'target')
node_target_flattened = tf.reshape(tf.transpose(node_target, perm = [1, 0, 2]), [-1, output_size])
node_max_length = tf.placeholder(tf.int32, name = 'batch_max_length')
node_cell_initializer = tf.random_uniform_initializer(-0.1, 0.1)
node_cell = LSTMCell(state_size, input_size, initializer = node_cell_initializer)
node_initial_state = node_cell.zero_state(batch_size, tf.float32)
nodes_output, nodes_state = rnn(node_cell,
list_input,
initial_state = node_initial_state,
sequence_length = node_max_length)
node_output_flattened = tf.reshape(tf.concat(1, nodes_output), [-1, state_size])
node_softmax_w = tf.Variable(tf.random_uniform([state_size, output_size]), name = 'softmax_w')
node_softmax_b = tf.Variable(tf.zeros([output_size]), name = 'softmax_b')
node_logit = tf.matmul(node_output_flattened, node_softmax_w) + node_softmax_b
node_cross_entropy = tf.nn.softmax_cross_entropy_with_logits(node_logit, node_target_flattened, name = 'cross_entropy')
node_loss = tf.reduce_mean(node_cross_entropy, name = 'loss')
node_optimizer = tf.train.AdamOptimizer().minimize(node_loss)
node_op_initializer = tf.initialize_all_variables()
One important thing to note is that if the first time I call tf.Session, I pass in the appropriate parameters, then the session does only run on a single core. The problem is that in subsequent runs, I am unable to change the behavior, even though I use use_per_session_threads which is supposed to specifically allow for session-specific settings. I.e. even after I close the session using sess.close() and start a new one with new options, the original behavior remains unchanged unless I restart the python kernel (which is very costly because it takes it nearly an hour to load my data).
use_per_session_threads will only affect the inter_op_parallelism_threads but not the intra_op_parallelism_threads. The intra_op_parallelism_threads will be used for the Eigen thread pool (see here) which is always global, thus subsequent sessions will not influence this anymore.
Note that there are other TF functions which can also trigger the initialization of the Eigen thread pool, so it can happen that it's already initialized before you create the first tf.Session. One example is tensorflow.python.client.device_lib.list_local_devices().
I solve this in a way that very early in my Python script, I create a dummy session with the appropriate values.
TensorFlow does an optimization where the first time a DirectSession is created it will create static thread pools which then will be reused. If you want to change this, specify multiple different thread pools in the session_inter_op_thread_pool flag and specify which one you want to use.
In tensorflow 2.3.2 I managed to limit cpus by using psutils lib.
I provided this the beginning of the function
pid = psutil.Process(os.getpid())
pid.cpu_affinity([0, 1])
The later call of model.fit utilized exactly 2 cpus