OneHotEncoder Multiple Columns - pandas

I am trying to encode a data table with multiple columns to a given set of categories
ohe1 = OneHotEncoder(categories = [list_names_data_rest.values],dtype = 'int8')
data_rest1 = ohe1.fit_transform(data_rest.values).toarray()
Here, list_names_data_rest.values is an array of shape (664,). I have 664 unique features and i am trying to encode data_rest which is (5050,6). After encoding, I am expecting a shape (5050,664)
I am one hot encoding to a pre-defined features set because, I am downloading data sets in chunks (due to ram limitations) and I would like the input shape to my neural network to be consistent
If i use pd.get_dummies, depending on my data set, I could get different categories and different input shape for my NN
ohe1.fit_transform does require a shape (n_values, n_features) but, I do not know how to handle this.

HashingVectorizer maybe a good solution for your case.It is independent from number of input features , just set initial size big enough.

If you wish to use pd.get_dummies there is an option to iteratively include your encodings for every batch.
For your first batch:
ohe = pd.get_dummies(data_rest, columns=['label_col'])
For every subsequent batch:
for b in batches:
batch_ohe = pd.get_dummies(b, columns=['label_col'])
ohe = pd.concat([ohe, batch_ohe], axis=0)
ohe = ohe.fillna(0)

Related

Mixing datasets in set ratio

In tensorlfow dataset, how do I mix 2 datasets, taking 75% of the set from my original data and 25% from the augmented data?
d = tf.data.Dataset.list_files("raw_data/")\
.flat_map(tf.data.TFRecordDataset)
ad = tf.data.Dataset.list_files("augmented_data/")\
.flat_map(tf.data.TFRecordDataset)
The problem is you can't use len() on a dataset object, so it's sometimes hard to know exact number of examples until you iterate a full epoch. But you can approximate this with take and skip methods.
train_dataset = dataset.take(number_examples_for_train)
test_dataset = dataset.skip(number_examples_for_train)
Those methods are a direct alternative to each other.
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take

What is an effective way to pad a variable length dataset for batching in Tensorflow that does not have have exact

I am trying to integrate the Dataset API into my input pipeline. Before this integration, the program used tf.train.batch_join(), which had dynamic padding enabled. Hence, this would batch elements and pad them according to the largest one in the mini-batch.
image, width, label, length, text, filename = tf.train.batch_join(
data_tuples,
batch_size=batch_size,
capacity=queue_capacity,
allow_smaller_final_batch=final_batch,
dynamic_pad=True)
For dataset, however, I was unable to find the exact alternative to this. I cannot use padded batch, since the dimensions of the images does not have a set threshold. The image width could be anything. My partner and I were able to come up with a work around for this using tf.contrib.data.bucket_by_sequence(). Here is an excerpt:
dataset = dataset.apply(tf.contrib.data.bucket_by_sequence_length
(element_length_func=_element_length_fn,
bucket_batch_sizes=np.full(len([0]) + 1, batch_size),
bucket_boundaries=[0]))
What this does is basically dumps all the elements into the overflow bucket since the boundary is set to 0. Then, it batches it from that bucket since bucketing pads the elements according to the largest one.
Is there a better way to achieve this functionality?
I meet exactly the same problem. Now I know how to solve this. If your input_data only has one dimension that is of variable length, try to use tf.contrib.data.bucket_by_sequence_length to dataset.apply() function, make bucket_batch_sizes = [batch_size] * (len(buckets) + 1). And there is another way to do so just as #mrry has said in comments.
iterator = dataset.make_one_shot_iterator()
item = iterator.get_next()
padded_shapes = []
for i in item:
padded_shapes.append(i.get_shape())
padded_shapes = tf.contrib.framework.nest.pack_sequence_as(item, padded_shapes)
dataset = dataset.padded_batch(batch_size, padded_shapes)
If one dimension in the shapes of a tensor is None or -1, then padded_batch will pad the tensor on that dimension to max length of the batch.
My training data has two features of varibale length, And this method works fine.

Variable size multi-label candidate sampling in tensorflow?

nce_loss() asks for a static int value for num_true. That works well for problems where we have the same amount of labels per training example and we know it in advance.
When labels have a variable shape [None], and being batched and/or bucketed by bucket size with .padded_batch() + .group_by_window() it is necessary to provide a variable size num_true in order to accustom for all training examples. This is currently unsupported to my knowledge (correct me if I'm wrong).
In other words suppose we have either a dataset of images with an arbitrary amount of labels per each image (dog, cat, duck, etc.) or a text dataset with numerous multiple classes per sentence (class_1, class_2, ..., class_n). Classes are NOT mutually exclusive, and can vary in size between examples.
But as the amount of possible labels can be huge 10k-100k is there a way to do a sampling loss to improve performance (in comparison with a sigmoid_cross_entropy)?
Is there a proper way to do this or any other workarounds?
nce_loss = tf.nn.nce_loss(
weights=nce_weights,
biases=nce_biases,
labels=labels,
inputs=inputs,
num_sampled=num_sampled,
# Something like this:
# `num_true=(tf.shape(labels)[-1])` instead of `num_true=const_int`
# , would be preferable here
num_classes=self.num_classes)
I see two issues:
1) Work with NCE with different numbers of true values;
2) Classes that are NOT mutually exclusive.
To the first issue, as #michal said, there is an expectative of including this functionality in the future. I have tried almost the same thing: to use labels with shape=(None, None), i.e., true_values dimension None. The sampled_values parameter has the same problem: true_values number must be a fixed integer number. The recomended work around is to use a class (0 is the best one) representing <PAD> and complete the number of true_values. In my case, 0 is an special token that represents <PAD>. Part of code is here:
assert len(labels) <= (window_size * 2)
zeros = ((window_size * 2) - len(labels)) * [0]
labels = labels + zeros
labels.sort()
I sorted the label because considering another recommendation:
Note: By default this uses a log-uniform (Zipfian) distribution for
sampling, so your labels must be sorted in order of decreasing
frequency to achieve good results.
In my case, the special tokens and more frequent words have lower indexes, otherwise, less frequent words have higher indexes. I included all label classes associated to the input at same time and completed with zero till the true_values number. Of course, you must ignore the 0 class at the end.

Setting up the input on an RNN in Keras

So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original

Incorporating very large constants in Tensorflow

For example, the comments for the Tensorflow image captioning example model state:
NOTE: This script will consume around 100GB of disk space because each image
in the MSCOCO dataset is replicated ~5 times (once per caption) in the output.
This is done for two reasons:
1. In order to better shuffle the training data.
2. It makes it easier to perform asynchronous preprocessing of each image in
TensorFlow.
The primary goal of this question is to see if there is an alternative to this type of duplication. In my use case, storing the data in this way would require each image to be duplicated in the TFRecord files many more times, on the order of 20 - 50 times.
I should note first that I have already fed the images through VGGnet to extract 4096 dim features, and I have these stored as a mapping between filename and the vectors.
Before switching over to Tensorflow, I had been feeding batches containing filename strings and then looking up the corresponding vector on a per-batch basis. This allows me to store all of the image data in ~15GB without needing to duplicate the data on disk.
My first attempt to do this in in Tensorflow involved storing indices in the TFExample buffers and then doing a "preprocessing" step to slice into the corresponding matrix:
img_feat = pd.read_pickle("img_feats.pkl")
img_matrix = np.stack(img_feat)
preloaded_images = tf.Variable(img_matrix)
first_image = tf.slice(preloaded_images, [0,0], [1,4096])
However, in this case, Tensorflow disallows a variable larger than 2GB. So my next thought was to partition this across several variables:
img_tensors = []
for i in range(NUM_SPLITS):
with tf.Graph().as_default():
img_tensors.append(tf.Variable(img_matrices[i], name="preloaded_images_%i"%i))
first_image = tf.concat(1, [tf.slice(t, [0,0], [1,4096//NUM_SPLITS]) for t in img_tensors])
In this case, I'm forced to store each partition on a separate graph, because it seems any one graph cannot be this large either. However, now the concat fails because each tensor I am concatenating is on a separate graph.
Any advice on incorporating a large amount (~15GB) of preloaded into the Tensorflow graph.
Potentially related is this question; however in this case I'd like to override the decoding of the actual JPEG file with the preprocessed value in a tensor op.