TensorFlow example, MemoryError while run text_classification_character_cnn.py - tensorflow

I'm trying to run https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification_character_cnn.py for learning, but I get an error message:
File "C:\Users\natlun\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py", line 72, in load_csv_without_header
data = np.array(data)
MemoryError
I use CPU installation of TensorFlow and Python 3.5. Any ideas how to solve the problem?? Other scripts using a csv-file for input work fine.

I was having the same issue. And after many hours of reading and googling (and seeing your unanswered question), and just comparing the example with other examples that do run, I noticed that
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data, size='large')
should just be
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
Based off of what I've read about numpy, I'd bet the "size='large'" parameter causes an over allocation to a numpy array (which throws the memory error).
Or, when you don't set that parameter perhaps the input data is truncated.
Or some other thing. Anyway, I hope this helps others attempting to run this useful example!
--- Update ---
Without "size='large'" the load_dataset functions appears to create smaller training and test data sets (like 1/1000 the size).
After playing around with the example I realized I could manually load and use the whole data set without getting the memory error (assume it is saving the whole data set as it appears).
# Prepare training and testing data
##This was the provided method for setting up the data.
# dbpedia = tf.contrib.learn.datasets.load_dataset(
# 'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
# x_trainz = pandas.DataFrame(dbpedia.train.data)[1]
# y_trainz = pandas.Series(dbpedia.train.target)
# x_testz = pandas.DataFrame(dbpedia.test.data)[1]
# y_testz = pandas.Series(dbpedia.test.target)
##And this is my replacement.
x_train = []
y_train = []
x_test = []
y_test = []
with open("dbpedia_data/dbpedia_csv/train.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_train.append(row[2])
y_train.append(int(row[0]))
with open("dbpedia_data/dbpedia_csv/test.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_test.append(row[2])
y_test.append(int(row[0]))
x_train = pandas.Series(x_train)
y_train = pandas.Series(y_train)
x_test = pandas.Series(x_test)
y_test = pandas.Series(y_test)
The example seems to now be evaluating the whole training data set. But, the original code will probably need to be run once to get/put the data in the correct sub-folders. Also, even while evaluating the whole data set little memory is used (just a few hundred MB). Which, makes me think that the load_dataset function is broken in some way.

Related

Loading dataset/dataloader object onto GPU

I am running code from another repository, but my issue is general so I am posting it here. Running their code, I get the error along the lines of Expected all tensors to be on the same device, found two: cpu and cuda:0. I have already verified that the model is on cuda:0; the issue is that the dataloader object used is not set to the device. Also, the dataset/models I use here are huggingface-transformers models and huggingface datasets.
Here is the relevant block of code where the issue arises:
eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
eval_dataloader = self.get_eval_dataloader(eval_dataset)
eval_examples = self.eval_examples if eval_examples is None else eval_examples
compute_metrics = self.compute_metrics
self.compute_metrics = None
eval_loop = (self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop)
try:
#this is where the error occurs
output = eval_loop(
eval_dataloader,
description="Evaluation",
prediction_loss_only=True if compute_metrics is None else None,
ignore_keys=ignore_keys,
)
For context, this occurs inside an evaluate() method of a class inheriting from Seq2SeqTrainer from huggingface. I have tried using something like
for i, (inputs, labels) in eval_dataloader:
inputs, labels = inputs.to(device), labels.to(device)
But that doesn't work (it gives an error of Too many values to unpack (expected 2). Is there any other way I can send this dataloader to the GPU? In particular, is there any way I can edit the evaluation_loop method of Transformers Trainer to move the batches to the GPU or something?

Tf-agent Actor/Learner: TFUniform ReplayBuffer dimensionality issue - invalid shape of Replay Buffer vs. Actor update

I try to adapt the this tf-agents actor<->learner DQN Atari Pong example to my windows machine using a TFUniformReplayBuffer instead of the ReverbReplayBuffer which only works on linux machine but I face a dimensional issue.
[...]
---> 67 init_buffer_actor.run()
[...]
InvalidArgumentError: {{function_node __wrapped__ResourceScatterUpdate_device_/job:localhost/replica:0/task:0/device:CPU:0}} Must have updates.shape = indices.shape + params.shape[1:] or updates.shape = [], got updates.shape [84,84,4], indices.shape [1], params.shape [1000,84,84,4] [Op:ResourceScatterUpdate]
The problem is as follows: The tf actor tries to access the replay buffer and initialize the it with a certain number random samples of shape (84,84,4) according to this deepmind paper but the replay buffer requires samples of shape (1,84,84,4).
My code is as follows:
def train_pong(
env_name='ALE/Pong-v5',
initial_collect_steps=50000,
max_episode_frames_collect=50000,
batch_size=32,
learning_rate=0.00025,
replay_capacity=1000):
# load atari environment
collect_env = suite_atari.load(
env_name,
max_episode_steps=max_episode_frames_collect,
gym_env_wrappers=suite_atari.DEFAULT_ATARI_GYM_WRAPPERS_WITH_STACKING)
# create tensor specs
observation_tensor_spec, action_tensor_spec, time_step_tensor_spec = (
spec_utils.get_tensor_specs(collect_env))
# create training util
train_step = train_utils.create_train_step()
# calculate no. of actions
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1
# create agent
agent = dqn_agent.DqnAgent(
time_step_tensor_spec,
action_tensor_spec,
q_network=create_DL_q_network(num_actions),
optimizer=tf.compat.v1.train.RMSPropOptimizer(learning_rate=learning_rate))
# create uniform replay buffer
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=1,
max_length=replay_capacity)
# observer of replay buffer
rb_observer = replay_buffer.add_batch
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=batch_size,
num_steps = 2,
single_deterministic_pass=False).prefetch(3)
# create callable function for actor
experience_dataset_fn = lambda: dataset
# create random policy for buffer init
random_policy = random_py_policy.RandomPyPolicy(collect_env.time_step_spec(),
collect_env.action_spec())
# create initalizer
init_buffer_actor = actor.Actor(
collect_env,
random_policy,
train_step,
steps_per_run=initial_collect_steps,
observers=[replay_buffer.add_batch])
# initialize buffer with random samples
init_buffer_actor.run()
(The approach is using the OpenAI Gym Env as well as the corresponding wrapper functions)
I worked with keras-rl2 and tf-agents without actor<->learner for other atari games to create the DQN and both worked quite well afer a some adaptions. I guess my current code will also work after a few adaptions in the tf-agent libary functions, but that would obviate the purpose of the libary.
My current assumption: The actor<->learner methods are not able to work with the TFUniformReplayBuffer (as I expect them to), due to the missing support of the TFPyEnvironment - or I still have some knowledge shortcomings regarding this tf-agents approach
Previous (successful) attempt:
from tf_agents.environments.tf_py_environment import TFPyEnvironment
tf_collect_env = TFPyEnvironment(collect_env)
init_driver = DynamicStepDriver(
tf_collect_env,
random_policy,
observers=[replay_buffer.add_batch],
num_steps=200)
init_driver.run()
I would be very grateful if someone could explain me what I'm overseeing here.
I fixed it...partly, but the next error is (in my opinion) an architectural problem.
The problem is that the Actor/Learner setup is build on a PyEnvironment whereas the
TFUniformReplayBuffer is using the TFPyEnvironment which ends up in the failure above...
Using the PyUniformReplayBuffer with a converted py-spec solved this problem.
from tf_agents.specs import tensor_spec
# convert agent spec to py-data-spec
py_collect_data_spec = tensor_spec.to_array_spec(agent.collect_data_spec)
# create replay buffer based on the py-data-spec
replay_buffer = py_uniform_replay_buffer.PyUniformReplayBuffer(
data_spec= py_collect_data_spec,
capacity=replay_capacity*batch_size
)
This snippet solved the issue of having an incompatible buffer in the background but ends up in another issue
--> The add_batch function does not work
I found this approach which advises to use either a batched environment or to make the following adaptions for the replay observer (add_batch method).
from tf_agents.utils.nest_utils import batch_nested_array
#********* Adpations add_batch method - START *********#
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
#********* Adpations add_batch method - END *********#
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
single_deterministic_pass=False)
experience_dataset_fn = lambda: dataset
This helped me to solve the issue regarding this post but now I run into another problem where I need to ask someone of the tf-agents-team...
--> It seems that the Learner/Actor structure is no able to work with another buffer than the ReverbBuffer, because the data-spec which is processed by the PyUniformReplayBuffer sets up a wrong buffer structure...
For anyone who has the same problem: I just created this Github-Issue report to get further answers and/or fix my lack of knowledge.
the full fix is shown below...
--> The dimensionality issue was valid and should indicate the the (uploaded) batched samples are not in the correct shape
--> This issue happens due to the fact that the "add_batch" method loads values with the wrong shape
rb_observer = replay_buffer.add_batch
Long story short, this line should be replaced by
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
--> Afterwards the (replay buffer) inputs are of correct shape and the Learner Actor Setup starts training.
The full replay buffer is shown below:
# create buffer for storing experience
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
agent.collect_data_spec,
1,
max_length=1000000)
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
num_steps = 2,
single_deterministic_pass=False).prefetch(4)
# create batched nested array input for rb_observer
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
# create batched readout of dataset
experience_dataset_fn = lambda: dataset

tf.io.decode_raw return tensor how to make it bytes or string

I'm struggling with this for a while. I searched stack and check tf2
doc a bunch of times. There is one solution indicated, but
I don't understand why my solution doesn't work.
In my case, I store a binary string (i.e., bytes) in tfrecords.
if I iterate over dataset via as_numpy_list or directly call numpy()
on each item, I can get back binary string.
while iterating the dataset, it does work.
I'm not sure what exactly map() passes to test_callback.
I see doesn't have a method nor property numpy, and the same about type
tf.io.decode_raw return. (it is Tensor, but it has no numpy as well)
Essentially I need to take a binary string, parse it via my
x = decoder.FromString(y) and then pass it my encoder
that will transform x binary string to tensor.
def test_callback(example_proto):
# I tried to figure out. can I use bytes?decode
# directly and what is the most optimal solution.
parsed_features = tf.io.decode_raw(example_proto, out_type=tf.uint8)
# tf.io.decoder returns tensor with N bytes.
x = creator.FromString(parsed_features.numpy)
encoded_seq = midi_encoder.encode(x)
return encoded_seq
raw_dataset = tf.data.TFRecordDataset(filenames=["main.tfrecord"])
raw_dataset = raw_dataset.map(test_callback)
Thank you, folks.
I found one solution but I would love to see more suggestions.
def test_callback(example_proto):
from_string = creator.FromString(example_proto.numpy())
encoded_seq = encoder.encoder(from_string)
return encoded_seq
raw_dataset = tf.data.TFRecordDataset(filenames=["main.tfrecord"])
raw_dataset = raw_dataset.map(lambda x: tf.py_function(test_callback, [x], [tf.int64]))
My understanding that tf.py_function has a penalty on performance.
Thank you

Dataset API 'flat_map' method producing error for same code which works with 'map' method

I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()

Feeding .npy (numpy files) into tensorflow data pipeline

Tensorflow seems to lack a reader for ".npy" files.
How can I read my data files into the new tensorflow.data.Dataset pipline?
My data doesn't fit in memory.
Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.
It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are tf.data.FixedLengthRecordDataset and tf.io.decode_raw, along with a look at the documentation of the NPY format. For simplicity, let's suppose that a float32 NPY file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a float32 array. An NPY file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:
def npy_header_offset(npy_path):
with open(str(npy_path), 'rb') as f:
if f.read(6) != b'\x93NUMPY':
raise ValueError('Invalid NPY file.')
version_major, version_minor = f.read(2)
if version_major == 1:
header_len_size = 2
elif version_major == 2:
header_len_size = 4
else:
raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))
header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))
header = f.read(header_len)
if not header.endswith(b'\n'):
raise ValueError('Invalid NPY file.')
return f.tell()
With this you can create a dataset like this:
import tensorflow as tf
npy_file = 'my_file.npy'
num_features = ...
dtype = tf.float32
header_offset = npy_header_offset(npy_file)
dataset = tf.data.FixedLengthRecordDataset([npy_file], num_features * dtype.size, header_bytes=header_offset)
Each element of this dataset contains a long string of bytes representing a single example. You can now decode it to obtain an actual array:
dataset = dataset.map(lambda s: tf.io.decode_raw(s, dtype))
The elements will have indeterminate shape, though, because TensorFlow does not keep track of the length of the strings. You can just enforce the shape since you know the number of features:
dataset = dataset.map(lambda s: tf.reshape(tf.io.decode_raw(s, dtype), (num_features,)))
Similarly, you can choose to perform this step after batching, or combine it in whatever way you feel like.
The limitation is that you had to know the number of features in advance. It is possible to extract it from the NumPy header, though, just a bit of a pain, and in any case very hardly from within TensorFlow, so the file names would need to be known in advance. Another limitation is that, as it is, the solution requires you to either use only one file per dataset or files that have the same header size, although if you know that all the arrays have the same size that should actually be the case.
Admittedly, if one considers this kind of approach it may just be better to have a pure binary file without headers, and either hard code the number of features or read them from a different source...
You can do it with tf.py_func, see the example here.
The parse function would simply decode the filename from bytes to string and call np.load.
Update: something like this:
def read_npy_file(item):
data = np.load(item.decode())
return data.astype(np.float32)
file_list = ['/foo/bar.npy', '/foo/baz.npy']
dataset = tf.data.Dataset.from_tensor_slices(file_list)
dataset = dataset.map(
lambda item: tuple(tf.py_func(read_npy_file, [item], [tf.float32,])))
Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:
Consuming NumPy arrays
If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().
# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
features = data["features"]
labels = data["labels"]
# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.
Here is a post with some instructions.
FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.
If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.
In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.
Problem setup
I had a folder with images that were being fed into an InceptionV3 model for extraction of features. This seemed to be a huge bottleneck for the entire process. As a workaround, I extracted features from each image and then stored them on disk in a .npy format.
Now I had two folders, one for the images and one for the corresponding .npy files. There was an evident problem with the loading of .npy files in the tf.data.Dataset pipeline.
Workaround
I came across TensorFlow's official tutorial on show attend and tell which had a great workaround for the problem this thread (and I) were having.
Load numpy files
First off we need to create a mapping function that accepts the .npy file name and returns the numpy array.
# Load the numpy files
def map_func(feature_path):
feature = np.load(feature_path)
return feature
Use the tf.numpy_function
With the tf.numpy_function we can wrap any python function and use it as a TensorFlow op. The function must accept numpy object (which is exactly what we want).
We create a tf.data.Dataset with the list of all the .npy filenames.
dataset = tf.data.Dataset.from_tensor_slices(feature_paths)
We then use the map function of the tf.data.Dataset API to do the rest of our task.
# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item: tf.numpy_function(
map_func, [item], tf.float16),
num_parallel_calls=tf.data.AUTOTUNE)