How to gather the last X indices embeddings before every grouping of False in a boolean mask for every row in a batch - tensorflow

Mask for input is given, its shape is [batch size, no of timesteps]. From this, I need to collect X number of embeddings of shape [batch size, timestep index, embedding size] such that they are grouped just before each grouping of False.
say mask of a batch size of 1 is T,T,T,F,F,F,F,|T,T,F,F,F,F,F and X=2 (by '|', I assumed a break which indicates a row split length=7), then should get list of concatenated embeddings given by indices (1,2) (8, 9).
Should be able to replicate the above when batch size is variable without having to do individually for each batch as my batch size is pretty high. Output should be [ [ (1,2) , (.,.),.. for other batches at first row split (0:7) ] , [ (8,9) , (.,.) for other batches at second row split (7:14)], ..]

Related

How to create a boolean mask in tensorflow with middle range only set to True using indexing specified in another tensor

I have a tensor of shape [None, 1] consisting of a value for each batch. Using this tensor, I have to create a boolean mask with values set to true starting from the index value (from the tensor) of the corresponding batch to a fixed length and rest all set to false.
Example, consider the index tensor to be
[[1],[2],[2]]
Suppose the desired length of timesteps for each batch is 5 and the fixed length is 2, then for the first batch, in the indexes starting from 1 and ending at 2 (as fixed length=2), the values are to be set True. Likewise for other batches. i.e, I want my boolean mask created to be
[[False,True,True,False,False],
[False,False,True,True,False],
[False,False,True,True,False]]
How to achieve the above without having to do individually for each batch? And preferably without using ragged feature in tensorflow?
index < tf.range(number_of_timesteps)
The above can be used for setting True in the extremes, but I could not find a way to set True in the middle.
Solving your example can be done with a combination of tf.one_hot and tf.roll.
input = [[1],[2],[2]]
intermediate_output = tf.one_hot(input, 5)
output = intermediate_output + tf.roll(intermediate_output, shift=1,axis=2)
If you need to convert to boolean from zeros and ones
output = tf.where(tf.equal(output, 1), True, False)
Explanation:
tf.one_hot is used to convert your indices to an intermediate one hot representation. Next tf.roll shifts the intermediate representation by 1. Combining this with the intermediate representation and converting to boolean returns your desired output.
EDIT:
I don't see a good way other than a for loop to extend this to multiple timesteps. The below code will generate your desired output
input = [[1],[2],[2]]
intermediate_output = tf.one_hot(input, 5)
outputs = []
timesteps = 3
for i in range(timesteps):
outputs.append(tf.roll(intermediate_output, shift=i,axis=2))
output = tf.reduce_sum(tf.unstack(outputs, timesteps), 2)
output = tf.where(tf.equal(output, 1), True, False)

In Tensorflow Classification, how are the labels ordered when using "predict"?

I'm using the MNIST handwritten numerals dataset to train a CNN.
After training the model, i use predict like this:
predictions = cnn_model.predict(test_images)
predictions[0]
and i get output as:
array([2.1273775e-06, 2.9292005e-05, 1.2424786e-06, 7.6307842e-05,
7.4305902e-08, 7.2301691e-07, 2.5368356e-08, 9.9952960e-01,
1.2401938e-06, 1.2787555e-06], dtype=float32)
In the output, there are 10 probabilities, one for each of numeral from 0 to 9. But how do i know which probability refers to which numeral ?
In this particular case, the probabilities are arranged sequentially for numerals 0 to 9. But why is that ? I didn't define that anywhere.
I tried going over documentation and example implementations found elsewhere on the internet, but no one seems to have addressed this particular behaviour.
Edit:
For context, I've defined my train/test data by:
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = (np.expand_dims(train_images, axis=-1)/255.).astype(np.float32)
train_labels = (train_labels).astype(np.int64)
test_images = (np.expand_dims(test_images, axis=-1)/255.).astype(np.float32)
test_labels = (test_labels).astype(np.int64)
And my model consists of a a few convulution and pooling layers, then a Flatten layer, then a Dense layer with 128 neurons and an output Dense layer with 10 neurons.
After that I simply fit my model and use predict like this:
model.fit(train_images, train_labels, batch_size=BATCH_SIZE, epochs=EPOCHS)
predictions = cnn_model.predict(test_images)
I don't see where I've instructed my code to output first neuron as digit 0, second neuron as digit 1 etc
And if i wanted to change the the sequence in which the resulting digits are output, where do i do that ?
This is really confusing me a lot.
Models work with numbers. Your classes/labels should be represented as numbers (e.g., 0, 1, ...., n). The prediction is always indexed to show probabilities for class 0 at index 0, class 1 at index 1. Now in the MNIST case, you are lucky the labels are integers 0 to 9. Suppose you had to classify images into three classes: cars, bicycles, trucks. You must represent those classes as numerical values. You can arrange it as you wish. If you choose this: {cars: 0, bicycles: 1, trucks: 2}, in other words, if you label your cars as 0, bicycles as 1, and trucks as 2, then your prediction would show probability for cars at index 0, bicycles at index 1 and trucks at index 2.
You could have also decided to choose this setting: {cars: 2, bicycles: 0, trucks: 1}, then your prediction would show probability for cars at index 2, bicycles at index 0 and trucks at index 1, and so on.
The point is, you have to show your classes (as many as you have) as integers indexed from 0 to n where n is the num_classes-1. Your probabilities at prediction would be indexed as such. You don't have to tell the model.
Hope this is now clear.
It depends on how you prepare your labels during training. With MNIST classification, usually, there are two different ways:
One-hot Labels: There are 10 labels in the MNIST data, therefore for each example (image), you create a label array (vector) of length 10 where all the elements are zero except the index corresponding to the digit that your input image is showing. For example, if your input image is showing the digit 8, your label contains zeros everywhere except at the 8th index (e.g. [0,0,0,0,0,0,0,0,1,0]). If your image is showing the digit 2, your label would be something like [0,0,1,0,0,0,0,0,0,0] and so on.
Sparse Labels: you just label each image directly by what digit it is showing, for example if your image is showing the digit 8, your label is a single number with value 8.
In both cases, you could choose the labels however you want, in the MNIST classification it is just intuitive to use the labels 0-9 to show digits 0-9.
Thus, in the prediction, the probability at index 0 is for digit 0, index 1 for digit 1, and so on.
You could choose to prepare your labels differently. For example you could decide to show your labels as follows:
label for digit 0: 9
label for digit 1: 8
label for digit 2: 7
label for digit 3: 6
label for digit 4: 5
label for digit 5: 4
label for digit 6: 3
label for digit 7: 2
label for digit 8: 1
label for digit 9: 0
You could train your model the same way but in this case, the probabilities in the prediction would be inverted. Probability at index 0 would be for digit 9, index 1 for digit 8, and so on.
In short, you have to define your labels using integer indices, but it is up to you to decide and remember what index you chose to refer to which label/class.

what the meaning of slim.metrics.streaming_sparse_average_precision_at_k?

This function refer to tf.contrib.metrics.streaming_sparse_average_precision_at_k,and the explanation in source code is follow,any one can explain it by giving some simply example? And I wonder whether this metric is same as the average precision calculation used in PASCAL VOC 2012 challenge.Thanks a lot.
def sparse_average_precision_at_k(labels,
predictions,
k,
weights=None,
metrics_collections=None,
updates_collections=None,
name=None):
"""Computes average precision#k of predictions with respect to sparse labels.
`sparse_average_precision_at_k` creates two local variables,
`average_precision_at_<k>/total` and `average_precision_at_<k>/max`, that
are used to compute the frequency. This frequency is ultimately returned as
`average_precision_at_<k>`: an idempotent operation that simply divides
`average_precision_at_<k>/total` by `average_precision_at_<k>/max`.
For estimation of the metric over a stream of data, the function creates an
`update_op` operation that updates these variables and returns the
`precision_at_<k>`. Internally, a `top_k` operation computes a `Tensor`
indicating the top `k` `predictions`. Set operations applied to `top_k` and
`labels` calculate the true positives and false positives weighted by
`weights`. Then `update_op` increments `true_positive_at_<k>` and
`false_positive_at_<k>` using these values.
If `weights` is `None`, weights default to 1. Use weights of 0 to mask values.
Args:
labels: `int64` `Tensor` or `SparseTensor` with shape
[D1, ... DN, num_labels] or [D1, ... DN], where the latter implies
num_labels=1. N >= 1 and num_labels is the number of target classes for
the associated prediction. Commonly, N=1 and `labels` has shape
[batch_size, num_labels]. [D1, ... DN] must match `predictions`. Values
should be in range [0, num_classes), where num_classes is the last
dimension of `predictions`. Values outside this range are ignored.
predictions: Float `Tensor` with shape [D1, ... DN, num_classes] where
N >= 1. Commonly, N=1 and `predictions` has shape
[batch size, num_classes]. The final dimension contains the logit values
for each class. [D1, ... DN] must match `labels`.
k: Integer, k for #k metric. This will calculate an average precision for
range `[1,k]`, as documented above.
weights: `Tensor` whose rank is either 0, or n-1, where n is the rank of
`labels`. If the latter, it must be broadcastable to `labels` (i.e., all
dimensions must be either `1`, or the same as the corresponding `labels`
dimension).
metrics_collections: An optional list of collections that values should
be added to.
updates_collections: An optional list of collections that updates should
be added to.
name: Name of new update operation, and namespace for other dependent ops.
Returns:
mean_average_precision: Scalar `float64` `Tensor` with the mean average
precision values.
update: `Operation` that increments variables appropriately, and whose
value matches `metric`.
Raises:
ValueError: if k is invalid.
"""

PCA sklearn - Which dimension does it take

Does sklearn PCA consider the columns of the dataframe as the vectors to reduce or the rows as vectors to reduce ?
Because when doing this:
df=pd.DataFrame([[1,-21,45,3,4],[4,5,89,-5,6],[7,-4,58,1,19]‌​,[10,11,74,20,12],[1‌​3,14,15,45,78]]) #5 rows 5 columns
pca=PCA(n_components=3)
pca.fit(df)
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
I get the following error:
ValueError: Shape of passed values is (5, 3), indices imply (5, 5)
Rows represent samples and columns represent features. PCA reduces the dimensionality of the data, ie features. So columns.
So if you are talking about vectors, then it considers a row as single feature vector and reduces its size.
If you have a dataframe of shape say [100, 6] and PCA n_components is set to 3. So your output will be [100, 3].
# You need this
df_pcs=pca.transform(df)
# This produces error because shapes dont match.
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
pca.components_ is an array of [3,5] and your index parameter is using the df.index which is of shape [5,]. Hence the error. pca.components_ represents a completely different thing.
According to documentation:-
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the
directions of maximum variance in the data.

Why use tf.mul in the word2vec training process?

The Word2vec model uses noise-contrastive estimation (NCE) loss to train the model.
Why does it use tf.mul in the true sample logit calculation, but uses tf.matmul in the negative calculation?
See the source code.
One way you can think of the NCE loss calculation is as a batch of independent, binary logistic regression classifications problems. In both cases we are performing the same calculations, even though it does not look like it at the first place.
To show you that we are actually calculating the same thing, assume the follwoing for the true input part:
emb_dim = 3 # dimensions of your embedding vector
batch_size = 2 # number of examples in your trainings batch
vocab_size = 6 # number of total words in your text
# (so your word ids range from 0 - 5)
Furthermore, assume the following training example in your batch:
1 => 0 # given word with word_id=1, I expect word with word_id=0
1 => 2 # given word with word_id=1, I expect word with word_id=2
Then your embedding matrix example_emb has the dimensions [2,3] and your true weight matrix true_w also has the dimensions [2,3], and should look like this:
example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word
true_w = [ [w1,w2,w3], [w4,w5,w5] ] # [2,3] target word
The example_emb is a subset of the total word embeddings (emb) that you are tryin to learn, and true_w is a a subset of the weights (smb_w_t). Each row in example_emb represents and input vector , and each row in the weight represent a target vector.
So [e1,e2,e3] is the word vector of the input word with word_id = 1 taken from emb, and [w1,w2,w3] is the word vector of the expected target word with word_id = 0.
Now intuitively stated, the classification task you are trying to solve is: given i see input word and target word is this observation correct?
The two classification tasks then are (without the bias, and tensorflow has this handy 'sigmoid_cross_entropy_with_logits' function, which applies the sigmoid later):
logit( 1=>0 ) = dot( [e1,e2,e3], transpose( [w1,w2,w3] ) =>
logit( 1=>0 ) = e1*w1 + e2*w2 + e3*w3
and
logit( 1=>2 ) = dot( [e1,e2,e3], transpose( [w4,w5,w6] ) =>
logit( 1=>2 ) = e1*w4 + e2*w5 + e3*w6
We can calculate [[logit(1=>0)],[logit(1=>2)]] the easiest if we perform an element-wise multiplication tf.mul() and then summing up each row.
The output of this calculations will be a [batch_size, 1] matrix containing the logits for the correct words. We do know the ground truth/label (y') for this examples, which is 1 because these are the correct examples.
true_logits = [
[logit(1=>0)], # first input word of the batch
[logit(1=>2)] # second input word of the batch
]
Now for the second part of your question why you we use tf.matmul() in the negative sampling, let's assume that we draw 3 negative samples (num_sampled=3). So sampled_ids = [3,4,5].
Intuitively, this means that you add six more training examples to your batch, namely:
1 => 3 # given word_id=1, do i expect word_id=3? No, because these are negative examples.
1 => 4
1 => 5
1 => 3 # second input word is also word_id=1
1 => 4
1 => 5
So you look up your sampled_w, which turns out to be a [3, 3] matrix. Your parameters now look like this:
example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word
sampled_w = [ [w6,w7,w8], [w9,w10,w11], [w12,w13,w14] ] # [3,3] sampled target words
Similar to the true case, what we want is the logits for all negative training examples. E.g., for the first example:
logit(1 => 3) = dot( [e1,e2,e3], transpose( [w6,w7,w8] ) =>
logit(1 => 3) = e1*w6 + e2*w7 + e3*w8
Now in this case, we can use the matrix multiplication after we transpose the sampled_w matrix. This is achieved using the transpose_b=True parameter in the tf.matmul() call. The transposed weight matrix looks like this:
sampled_w_trans = [ [w6,w9,w12], [w7,w10,w13], [w8,w11,w14] ] # [3,3]
So now the tf.matmul() operation will return a [batch_size, 3] matrix, where each row are the logits for one example of the input batch. Each element represents a logit for a classification task.
The whole result matrix of the negative sampling contains this:
sampled_logits = [
[logit(1=>3), logit(1,4), logit(1,5)], # first input word of the batch
[logit(1=>3), logit(1,4), logit(1,5)] # second input word of the batch
]
The labels / ground truth for the sampled_logits are all zeros, because these are the negative examples.
In both cases we perform the same calculation, that is the calculation for a binary classification logistic regression (without the sigmoid, which is applied later).