Skip Dataset entries in TFRecordDataset.map()

Skip Dataset entries in TFRecordDataset.map() - tensorflow

How do I skip entries in a TFRecord file when generating a TFRecordDataset?
Given a TFRecord file and tf.contrib.data.TFRecordDataset object, I create a new dataset by maping over the protobuf definition. For example,
features = {'some_data': tf.FixedLenFeature([], tf.string)}
def parser(example_proto):
e = tf.parse_single_example(example_proto, features)
data = e['some_data']
# ...do a bunch of stuff to data...
return data
x = TFRecordDataset(filename)
x = x.map(parser)
x = x.cache(cache_filename)
x = x.repeat()
x = x.batch(batch_size)
This lets me read in the data and do some preprocessing, then cache the results and batch it up for my model.
My question is, what if I want to skip one of the TFRecord entries (e.g., if the data is invalid/bad)? For example, in parser(), maybe I could return None, or some sort of tf.cond to indicate an invalid entry, or trip some assertion.

(Summarizing the comment as an answer)
The filter() method of Dataset could filter entries according to a predicate.

Related

TensorFlow Federated - Loading and preprocessing data on a remote client

Part of the simulation program that I am working on allows clients to load local data from their device without the server being able to access that data.
Following the idea from this post, I have the following code configured to assign the client a path to load the data from. Although the data is in svmlight format, loading it line-by-line can still allow it to be preprocessed afterwards.
client_paths = {
'client_0': '<path_here>',
'client_1': '<path_here>',
}
def create_tf_dataset_for_client_fn(id):
path = client_paths.get(id)
data = tf.data.TextLineDataset(path)
path_source = tff.simulation.datasets.ClientData.from_clients_and_fn(client_paths.keys(), create_tf_dataset_for_client_fn)
The code above allows a path to be loaded during runtime from the remote client's-side by the following line of code.
data = path_source.create_tf_dataset_for_client('client_0')
Here, the data variable can be iterated through and can be used to display the contents on the client on the remote device when calling tf.print(). But, I need to preprocess this data into an appropriate format before continuing. I am presently attempting to convert this from a string Tensor in svmlight format into a SparseTensor of the appropriate format.
The issue is that, although the defined preprocessing method works in a standalone scenario (i.e. when defined as a function and tested on a manually defined Tensor of the same format), it fails when the code is executed during the client update #tf.function in the tff algorithm. Below is the specified error when executing the notebook cell which contains a #tff.tf_computation function which calls an #tf.function which does the preprocessing and retrieves the data.
ValueError: Shape must be rank 1 but is rank 0 for '{{node Reshape_2}} = Reshape[T=DT_INT64, Tshape=DT_INT32](StringToNumber_1, Reshape_2/shape)' with input shapes: [?,?], [].
Since the issue occurs when executing the client's #tff.tf_computation update function which calls the #tf.function with the preprocessing code, I am wondering how I can allow the function to perform the preprocessing on the data without errors. I assume that if I can just get the functions to properly be run when defined that when called remotely it will work.
Any ideas on how to address this issue? Thank you for your help!
For reference, the preprocessing function uses tf computations to manipulate the data. Although not optimal yet, below is the code presently being used. This is inspired from this link on string_split examples. I have extracted the code to put directly into the client's #tf.function after loading the TextLineDataset as well, but this also fails.
def decode_libsvm(line):
# Split the line into columns, delimiting by a blank space
cols = tf.strings.split([line], ' ')
# Retrieve the labels from the first column as an integer
labels = tf.strings.to_number(cols.values[0], out_type=tf.int32)
# Split all column pairs
splits = tf.strings.split(cols.values[1:], ':')
# Convert splits into a sparse matrix to retrieve all needed properties
splits = splits.to_sparse()
# Reshape the tensor for further processing
id_vals = tf.reshape(splits.values, splits.dense_shape)
# Retrieve the indices and values within two separate tensors
feat_ids, feat_vals = tf.split(id_vals, num_or_size_splits=2, axis=1)
# Convert the indices into int64 numbers
feat_ids = tf.strings.to_number(feat_ids, out_type=tf.int64)
# To reload within a SparseTensor, add a dimension to feat_ids with a default value of 0
feat_ids = tf.reshape(feat_ids, -1)
feat_ids = tf.expand_dims(feat_ids, 1)
feat_ids = tf.pad(feat_ids, [[0,0], [0,1]], constant_values=0)
# Extract and flatten the values
feat_vals = tf.strings.to_number(feat_vals, out_type=tf.float32)
feat_vals = tf.reshape(feat_vals, -1)
# Configure a SparseTensor to contain the indices and values
sparse_output = tf.SparseTensor(indices=feat_ids, values=feat_vals, dense_shape=[1, <shape>])
return {"x": sparse_output, "y": labels}
Update (Fix)
Following the advice from Jakub's comment, the issue was fixed by enclosing the reshape and expand_dim calls in [], when needed. Now there is no issue running the code within tff.
def decode_libsvm(line):
# Split the line into columns, delimiting by a blank space
cols = tf.strings.split([line], ' ')
# Retrieve the labels from the first column as an integer
labels = tf.strings.to_number(cols.values[0], out_type=tf.int32)
# Split all column pairs
splits = tf.strings.split(cols.values[1:], ':')
# Convert splits into a sparse matrix to retrieve all needed properties
splits = splits.to_sparse()
# Reshape the tensor for further processing
id_vals = tf.reshape(splits.values, splits.dense_shape)
# Retrieve the indices and values within two separate tensors
feat_ids, feat_vals = tf.split(id_vals, num_or_size_splits=2, axis=1)
# Convert the indices into int64 numbers
feat_ids = tf.strings.to_number(feat_ids, out_type=tf.int64)
# To reload within a SparseTensor, add a dimension to feat_ids with a default value of 0
feat_ids = tf.reshape(feat_ids, [-1])
feat_ids = tf.expand_dims(feat_ids, [1])
feat_ids = tf.pad(feat_ids, [[0,0], [0,1]], constant_values=0)
# Extract and flatten the values
feat_vals = tf.strings.to_number(feat_vals, out_type=tf.float32)
feat_vals = tf.reshape(feat_vals, [-1])
# Configure a SparseTensor to contain the indices and values
sparse_output = tf.SparseTensor(indices=feat_ids, values=feat_vals, dense_shape=[1, <shape>])
return {"x": sparse_output, "y": labels}

Dataset API 'flat_map' method producing error for same code which works with 'map' method

I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.

As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()

Word2Vec + LSTM on API Sequence

I am trying to apply word2Vec and LSTM on a dataset that contains files' API trace log including API function calls and their parameters for a binary classification.
The data looks like:
File_ID, Label, API Trace log
1, M, kernel32 LoadLibraryA kernel32.dll
kernel32 GetProcAddress MZ\x90 ExitProcess
...
2, V, kernel32 GetModuleHandleA RPCRT4.dll
kernel32 GetCurrentThreadId d\x8B\x0D0 POINTER POINTER
...
The API trace including: module name, API function name, parameters (that separated by blank space)
Take first API trace of file 1 as example, kernel32 is the module name, LoadLibraryA is function name, kernel32.dll is parameter. Each API trace is separated by \n so that each line represents a API sequence information sequentially.
Firstly I trained a word2vec model based on the line sentence of all API trace log. There are about 5k API function calls, e.g. LoadLibraryA, GetProcAddress. However, because parameter value could be vary, the model becomes quite big (with 300,000 vocabulary) after including those parameters.
After that, I trained a LSTM by applying word2vec's embedding_wrights, the model structure looks like:
model = Sequential()
model.add(Embedding(output_dim=vocab_dim, input_dim=n_symbols, \
mask_zero=False, weights=[embedding_weights], \
trainable=False))
model.add(LSTM(dense_dim,kernel_initializer='he_normal', dropout=0.15,
recurrent_dropout=0.15, implementation=2))
model.add(Dropout(0.3))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=batch_size, callbacks=[early_stopping, parallel_check_cb])
The way I get embedding_weights is to create a matrix, for each vocabulary in word2vec model, map the index of the word in the model, to it's vector
def create_embedding_weights(model, max_index=0):
# dimensionality of your word vectors
num_features = len(model[model.vocab.keys()[0]])
n_symbols = len(model.vocab) + 1 # adding 1 to account for 0th index (for masking)
# Only word2vec feature set
embedding_weights = np.zeros((max(n_symbols + 1, max_index + 1), num_features))
for word, value in model.vocab.items():
embedding_weights[value.index, :] = model[word]
return embedding_weights
For training data, what I did is that for each word in API call, convert the actual word to the index of word2vec model so that it's consistent to the index in embedding_weights above. e.g. kernel32 -> 0, LoadLibraryA -> 1, kernel32.dll -> 2. GetProcAddress -> 4, MZ\x90 -> 5, ExitProcess ->6
So the train data for file 1 looks like [0, 1, 2, 3, 4, 5, 6]. Noted, I didn't do line split for each API trace. As a result, the model may not know where is the start and end of API trace? And the training accuracy of the model is pretty bad - accuracy is 50% :(
My question is that, when prepare the training and validation dataset, should I also split the line when mapping the actual words to their index? then The above training data would be changed to following, each API trace is separated by a line, and maybe padd the missing value to -1 which doesn't exist in word2vec's indexes.
[[0, 1, 2, -1]
[3, 4, 5, 6]]
Meanwhile I am using a very simple structure for training, while word2vec model is quite big, any suggestion on structure would also be appreciated.

I would at least split the trace lines in three:
Module (make a dictionary and an embedding)
Function (make a dictionary and an embedding)
Parameters (make a dictionary and an embedding - see details later)
Since this is a very specific application, I believe it would be best to keep the embeddings trainable (the whole point of the embeddings is to create meaningful vectors, and the meanings depend a lot on the model that is going to use them. Question: how did you create the word2vec model? From what data does it learn?).
This model would have more inputs. All of them as integers from zero to max dictionary index. Consider using mask_zero=True and padding all files to maxFileLines.
moduleInput = Input(maxFileLines,)
functionInput = Input(maxFileLines,)
For the parameters, I'd probably make a subsequence as if the list of parameters were a sentence. (Again, mask_zero=True, and pad up to maxNumberOfParameters)
parametersInput = Input(maxFileLines, maxNumberOfParameters)
Function and module embeddings:
moduleEmb = Embedding(.....mask_zero=True,)(moduleInput)
functionEmb = Embedding(.....mask_zero=True)(functionInput)
Now, for the parameters, I though of creating a sequence of sequences (maybe this is too much). For that, I first transfer the lines dimension to the batch dimension and work with only length = maxNumberOfParameters:
paramEmb = Lambda(lambda x: K.reshape(x,(-1,maxNumberOfParameters)))(parametersInput)
paramEmb = Embedding(....,mask_zero=True)(paramEmb)
paramEmb = Lambda(lambda x: K.reshape(x,(-1,maxFileLines,embeddingSize)))(paramEmb)
Now we concatenate all of them in the last dimension and we're ready to get into the LSTMs:
joinedEmbeddings = Concatenate()([moduleEmb,functoinEmb,paramEmb])
out = LSTM(...)(joinedEmbeddings)
out = ......
model = Model([moduleInput,functionInput,parametersInput], out)
How to prepare the inputs
With this model, you need three separate inputs. One for the module, one for the function and one for the parameters.
These inputs will contain only indices (no vectors). And they don't need a previous word2vec model. Embeddings are word2vec transformers.
So, get the file lines and split. First we split by commas, then we split the API calls by spaces:
import numpy as np
#read the file
loadedFile = open(fileName,'r')
allLines = [l.strip() for l in loadedFile.readlines()]
loadedFile.close()
#split by commas
splitLines = []
for l in allLines[1:]: #use 1 here only if you have headers in the file
splitLines.append (l.split(','))
splitLines = np.array(splitLines)
#get the split values and separate ids, targets and calls
ids = splitLines[:,0]
targets = splitLines[:,1]
calls = splitLines[:,2]
#split the calls by space, adding dummy parameters (spaces) to the max length
splitCalls = []
for c in calls:
splitC = c.strip().split(' ')
#pad the parameters (space for dummy params)
for i in range(len(splitC),maxParams+2):
splitC.append(' ')
splitCalls.append(splitC)
splitCalls = np.array(splitCalls)
modules = splitCalls[:,0]
functions = splitCalls[:,1]
parameters = splitCalls[:,2:] #notice the parameters have an extra dimension
Now lets make the indices:
modIndices, modCounts = np.unique(modules,return_counts=True)
funcIndices, funcCounts = np.unique(functions,return_counts=True)
#for de parameters, let's flatten the array first (because we have 2 dimensions)
flatParams = parameters.reshape((parameters.shape[0]*parameters.shape[1],))
paramIndices, paramCounts = np.unique(flatParams,return_counts=True)
These will create a list of unique words and get their counts. Here you can customize which words you're going to group in "another word" class. (Maybe based on the counts, if the count is too little, make it an "another word").
Let's then make the dictionaries:
def createDic(uniqueWords):
dic = {}
for i,word in enumerate(uniqueWords):
dic[word] = i + 1 # +1 because we want to reserve the zeros for padding
return dic
Just take care with the parameters, because we used a dummy space there:
moduleDic = createDic(modIndices)
funcDic = createDic(funcIndices)
paramDic = createDic(paramIndices[1:]) #make sure the space got the first position here
paramDic[' '] = 0
Well, now we just replace the original values:
moduleData = [moduleDic[word] for word in modules]
funcData = [funcDic[word] for word in functions]
paramData = [[paramDic[word] for word in paramLine] for paramLine in parameters]
Pad them:
for i in range(len(moduleData),maxFileLines):
moduleData.append(0)
funcData.append(0)
paramData.append([0] * maxParams)
Do this for every file, and store in a list of files:
moduleTrainData = []
functionTrainData = []
paramTrainData = []
for each file do the above and:
moduleTrainData.append(moduleData)
functionTrainData.append(funcData)
paramTrainData.append(paramData)
moduleTrainData = np.asarray(moduleTrainData)
functionTrainData = np.asarray(functionTrainData)
paramTrainData = np.asarray(paramTrainData)
That's all for the inputs.
model.fit([moduleTrainData,functionTrainData,paramTrainData],outputLabels,...)

tensorflow, decode_csv with variable data length

I want to write sequence to sequence, using tensorflow.
my input data shape is
[input_length, target_length, input , target]
and they have all different lengths.
how can I use tf.decode_csv?
I tried to make record_defaults with maximum input length.
but All shapes must be fully defined in record_defaults............
I can't figure out about this.
csv_file = tf.train.string_input_producer([file_name], name='file_name')
reader = tf.TextLineReader()
_, line = reader.read(csv_file)
record_defaults = [[0] for row in range(20)]
data = tf.decode_csv(line,record_defaults=record_defaults,field_delim=',')
len_error = tf.slice(data,[0],[1])
len_target = tf.slice(data, [1], [1])
error = tf.slice(data,[2],len_error)
target = tf.slice(data, 2+len_error , len_target)

Yes, tf.decode_csv does require all rows to be the same size. If this does not work for you, consider filing a feature request on Github.
You could also preprocess your CSV file to pad all of the entries out to the same number of columns; you can use the record_defaults argument to tf.decode_csv to leave the fields empty but supply default values.

How to read data from numpy files in TensorFlow? [duplicate]

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project.
The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.
The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this
[[1,12,34,24,53,...,105,234,102],
[12,112,43,24,52,...,115,244,98],
....
]
The single row of the above array is of length 22501 size where the first element is the label.
I dumped the file to using pickle and the tried to read from the file using the
tf.FixedLengthRecordReader to read from the file as demonstrated in example
I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.
Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.
The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.

Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:
Write out a binary file containing the contents of your numpy array.
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.
Write a modified version of read_cifar10() that handles your record format.
def read_my_data(filename_queue):
class ImageRecord(object):
pass
result = ImageRecord()
# Dimensions of the images in the dataset.
label_bytes = 1
# Set the following constants as appropriate.
result.height = IMAGE_HEIGHT
result.width = IMAGE_WIDTH
result.depth = IMAGE_DEPTH
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
assert record_bytes == 22501 # Based on your question.
# Read a record, getting filenames from the filename_queue. No
# header or footer in the binary, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
Modify distorted_inputs() to use your new dataset:
def distorted_inputs(data_dir, batch_size):
"""[...]"""
filenames = ["/tmp/images.bin"] # Or a list of filenames if you
# generated multiple files in step 1.
for f in filenames:
if not gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
# Read examples from files in the filename queue.
read_input = read_my_data(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# [...] (Maybe modify other parameters in here depending on your problem.)
This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.

In your question, you specifically asked:
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
You can feed the numpy array to a queue directly, but it will be a more invasive change to the cifar10_input.py code than my other answer suggests.
As before, let's assume you have the following array from your question:
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
You can then define a queue that contains the entire data as follows:
q = tf.FIFOQueue([tf.uint8, tf.uint8], shapes=[[], [22500]])
enqueue_op = q.enqueue_many([image_and_labels_array[:, 0], image_and_labels_array[:, 1:]])
...then call sess.run(enqueue_op) to populate the queue.
Another—more efficient—approach would be to feed records to the queue, which you could do from a parallel thread (see this answer for more details on how this would work):
# [With q as defined above.]
label_input = tf.placeholder(tf.uint8, shape=[])
image_input = tf.placeholder(tf.uint8, shape=[22500])
enqueue_single_from_feed_op = q.enqueue([label_input, image_input])
# Then, to enqueue a single example `i` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i, 0],
image_input: image_and_labels_array[i, 1:]})
Alternatively, to enqueue a batch at a time, which will be more efficient:
label_batch_input = tf.placeholder(tf.uint8, shape=[None])
image_batch_input = tf.placeholder(tf.uint8, shape=[None, 22500])
enqueue_batch_from_feed_op = q.enqueue([label_batch_input, image_batch_input])
# Then, to enqueue a batch examples `i` through `j-1` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i:j, 0],
image_input: image_and_labels_array[i:j, 1:]})

I want to know if I can pass the numpy array as defined above instead
of the filenames to some reader and it can fetch records one by one
from that array instead of the files.
tf.py_func, that wraps a python function and uses it as a TensorFlow operator, might help. Here's an example.
However, since you've mentioned that your images are stored in png files, I think the simplest solution would be to replace this:
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
with this:
result.key, value = tf.WholeFileReader().read(filename_queue))
value = tf.image.decode_jpeg(value)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas