What is an effective way to pad a variable length dataset for batching in Tensorflow that does not have have exact - tensorflow

I am trying to integrate the Dataset API into my input pipeline. Before this integration, the program used tf.train.batch_join(), which had dynamic padding enabled. Hence, this would batch elements and pad them according to the largest one in the mini-batch.
image, width, label, length, text, filename = tf.train.batch_join(
data_tuples,
batch_size=batch_size,
capacity=queue_capacity,
allow_smaller_final_batch=final_batch,
dynamic_pad=True)
For dataset, however, I was unable to find the exact alternative to this. I cannot use padded batch, since the dimensions of the images does not have a set threshold. The image width could be anything. My partner and I were able to come up with a work around for this using tf.contrib.data.bucket_by_sequence(). Here is an excerpt:
dataset = dataset.apply(tf.contrib.data.bucket_by_sequence_length
(element_length_func=_element_length_fn,
bucket_batch_sizes=np.full(len([0]) + 1, batch_size),
bucket_boundaries=[0]))
What this does is basically dumps all the elements into the overflow bucket since the boundary is set to 0. Then, it batches it from that bucket since bucketing pads the elements according to the largest one.
Is there a better way to achieve this functionality?

I meet exactly the same problem. Now I know how to solve this. If your input_data only has one dimension that is of variable length, try to use tf.contrib.data.bucket_by_sequence_length to dataset.apply() function, make bucket_batch_sizes = [batch_size] * (len(buckets) + 1). And there is another way to do so just as #mrry has said in comments.
iterator = dataset.make_one_shot_iterator()
item = iterator.get_next()
padded_shapes = []
for i in item:
padded_shapes.append(i.get_shape())
padded_shapes = tf.contrib.framework.nest.pack_sequence_as(item, padded_shapes)
dataset = dataset.padded_batch(batch_size, padded_shapes)
If one dimension in the shapes of a tensor is None or -1, then padded_batch will pad the tensor on that dimension to max length of the batch.
My training data has two features of varibale length, And this method works fine.

Related

How to batch an object detection dataset?

I am working on implementing a face detection model on the wider face dataset. I learned it was built into Tensorflow datasets and I am using it.
However, I am facing an issue while batching the data. Since, an Image can have multiple faces, therefore the number of bounding boxes output are different for each Image. For example, an Image with 2 faces will have 2 bounding box, whereas one with 4 will have 4 and so on.
But the problem is, these unequal number of bounding boxes is causing each of the Dataset object tensors to be of different shapes. And in TensorFlow afaik we cannot batch tensors of unequal shapes ( source - Tensorflow Datasets: Make batches with different shaped data). So I am unable to batch the dataset.
So after loading the following code and batching -
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.batch(12)
for step, (x,y,z) in enumerate(ds1) :
print(step)
break
I am getting this kind of error on run Link to Error Image
In general any help on how can I batch the Tensorflow object detection datasets will be very helpfull.
It might be a bit late but I thought I should post this anyways. The padded_batch feature ought to do the trick here. It kind of goes around the issue by matching dimension via padding zeros
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.padded_batch(12)
for step, (x,y,z) in enumerate(ds1) :
print(step)
break
Another solution would be to process not use batch and process with custom buffers with for loops but that kind of defeats the purpose. Just for posterity I'll add the sample code here as an example of a simple workaround.
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
batch_size = 12
image_annotations_pair = [x['image'], x['faces']['bbox'] for n, x in enumerate(ds) if n < batch_size]
Then use a train_step modified for this.
For details one may refer to - https://www.kite.com/python/docs/tensorflow.contrib.autograph.operators.control_flow.dataset_ops.DatasetV2.padded_batch

Removing zero images from a tensor in TensorFlow efficiently

I have a set of 50 images of shape (50,128,128,1) which is represented as a tensor in TensorFlow. Let's say that the 25th and 30th image are just zero images but I do not know which images are all zero beforehand (25th and 30th in this example are just to make the case clearer). I wish to remove such images and have a tensor which is of size (48,128,128,1). How can this be achieved in Tensorflow without looping through the 0th dimension of the tensor 50 times and checking for each image if tf.reduce_sum(tf.abs(image_i))>0.
You can use Dataset.map(some_fn). Here you can define some_fn that will check each tensor's value with your logic tf.reduce_sum(). SO, if the value of total is zero then you can neglect it otherwise you can keep it.
def some_fn():
image = tf.fill([8,8], 0)# dummy tensor values
image_row = tf.slice(image, [1,0], [1, -1])
total = tf.reduce_sum(tf.abs(image_row)) # total = 0
return total
You can read here more. This is not a loop, it works in parallel on each element (each image in your case.) So it is fast.

OneHotEncoder Multiple Columns

I am trying to encode a data table with multiple columns to a given set of categories
ohe1 = OneHotEncoder(categories = [list_names_data_rest.values],dtype = 'int8')
data_rest1 = ohe1.fit_transform(data_rest.values).toarray()
Here, list_names_data_rest.values is an array of shape (664,). I have 664 unique features and i am trying to encode data_rest which is (5050,6). After encoding, I am expecting a shape (5050,664)
I am one hot encoding to a pre-defined features set because, I am downloading data sets in chunks (due to ram limitations) and I would like the input shape to my neural network to be consistent
If i use pd.get_dummies, depending on my data set, I could get different categories and different input shape for my NN
ohe1.fit_transform does require a shape (n_values, n_features) but, I do not know how to handle this.
HashingVectorizer maybe a good solution for your case.It is independent from number of input features , just set initial size big enough.
If you wish to use pd.get_dummies there is an option to iteratively include your encodings for every batch.
For your first batch:
ohe = pd.get_dummies(data_rest, columns=['label_col'])
For every subsequent batch:
for b in batches:
batch_ohe = pd.get_dummies(b, columns=['label_col'])
ohe = pd.concat([ohe, batch_ohe], axis=0)
ohe = ohe.fillna(0)

Variable size multi-label candidate sampling in tensorflow?

nce_loss() asks for a static int value for num_true. That works well for problems where we have the same amount of labels per training example and we know it in advance.
When labels have a variable shape [None], and being batched and/or bucketed by bucket size with .padded_batch() + .group_by_window() it is necessary to provide a variable size num_true in order to accustom for all training examples. This is currently unsupported to my knowledge (correct me if I'm wrong).
In other words suppose we have either a dataset of images with an arbitrary amount of labels per each image (dog, cat, duck, etc.) or a text dataset with numerous multiple classes per sentence (class_1, class_2, ..., class_n). Classes are NOT mutually exclusive, and can vary in size between examples.
But as the amount of possible labels can be huge 10k-100k is there a way to do a sampling loss to improve performance (in comparison with a sigmoid_cross_entropy)?
Is there a proper way to do this or any other workarounds?
nce_loss = tf.nn.nce_loss(
weights=nce_weights,
biases=nce_biases,
labels=labels,
inputs=inputs,
num_sampled=num_sampled,
# Something like this:
# `num_true=(tf.shape(labels)[-1])` instead of `num_true=const_int`
# , would be preferable here
num_classes=self.num_classes)
I see two issues:
1) Work with NCE with different numbers of true values;
2) Classes that are NOT mutually exclusive.
To the first issue, as #michal said, there is an expectative of including this functionality in the future. I have tried almost the same thing: to use labels with shape=(None, None), i.e., true_values dimension None. The sampled_values parameter has the same problem: true_values number must be a fixed integer number. The recomended work around is to use a class (0 is the best one) representing <PAD> and complete the number of true_values. In my case, 0 is an special token that represents <PAD>. Part of code is here:
assert len(labels) <= (window_size * 2)
zeros = ((window_size * 2) - len(labels)) * [0]
labels = labels + zeros
labels.sort()
I sorted the label because considering another recommendation:
Note: By default this uses a log-uniform (Zipfian) distribution for
sampling, so your labels must be sorted in order of decreasing
frequency to achieve good results.
In my case, the special tokens and more frequent words have lower indexes, otherwise, less frequent words have higher indexes. I included all label classes associated to the input at same time and completed with zero till the true_values number. Of course, you must ignore the 0 class at the end.

Setting up the input on an RNN in Keras

So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original