What is the difference between global_max_pool/global_avg_pool and avg_pool_2d/1d/3d? - tflearn

I've tried to compare the tutorial code for text classification from tflearn : https://github.com/tflearn/tflearn/blob/master/examples/nlp/cnn_sentence_classification.py
And the one from dennybritz :
https://github.com/dennybritz/cnn-text-classification-tf
These 2 codes shows different result, i understand that it can be because the tflearn tutorial use 1d convolution, but there is one line of code that i don't understand:
network = global_max_pool(network)
What is the difference between global_max_pool and max_pool_2d?

Looking at the code, they make different calls to the tensor flow library:
2d_max_pool
Does broadly what you would expect and returns (as well as doing some other things):
tf.nn.max_pool(incoming, kernel, strides, padding)
With the specified arguments. This is a 4d tensor, similar to the input one
global_max_pool
Actually performs a pretty drastic reduction in the input tensor. The input tensor is of dimension:
[batch, height, width, in_channels]
The function global_max_pool then returns (as well as doing some other things)
tf.reduce_max(incoming, [1, 2])
I think gives the maximum value of each tensor along all of each of the in_channels

Related

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

Using TensorFlow hessians for second partial derivative test

Second partial derivative test is a simple way to tell whether a critical point is a minimum, a maximum, or a saddle. I am currently toying with the idea of implementing such test for a simple neural network in tensorflow. The following set of weights is used for modeling an XOR neural network with 2 inputs, 1 hidden layer with 2 hidden units, and 1 output unit:
weights = {
'h1': tf.Variable(np.empty([2, 2]), name="h1", dtype=tf.float64),
'b1': tf.Variable(np.empty([2]), name="b1", dtype=tf.float64),
'h2': tf.Variable(np.empty([2, 1]), name="h2", dtype=tf.float64),
'b2': tf.Variable(np.empty([1]), name="b2", dtype=tf.float64)
}
Both the gradients and the hessians can now be obtained as follows:
gradients = tf.gradients(mse_op, [weights['h1'], weights['b1'], weights['h2'], weights['b2']])
hessians = tf.hessians(mse_op, [weights['h1'], weights['b1'], weights['h2'], weights['b2']])
Where mse_op is the MSE error of the network.
Both gradients and hessians compute just fine. The dimensionality of the gradients is equal to the dimensionality of the original inputs. The dimensionality of the hessians obviously differs.
The question: is it a good idea, and is it even possible to conveniently compute the eigenvalues of the hessians generated by tf.hessian applied to the given set of weights? Will the eigenvalues be representative of what I think they represent - i.e., will I be able to say that if overall, both positive and negative values are present, then we can conclude that the point is a saddle point?
So far, I have tried the following out-of-the-box approach to calculate the eigenvalues of each of the hessians:
eigenvals1 = tf.self_adjoint_eigvals(hessians[0])
eigenvals2 = tf.self_adjoint_eigvals(hessians[1])
eigenvals3 = tf.self_adjoint_eigvals(hessians[2])
eigenvals4 = tf.self_adjoint_eigvals(hessians[3])
1,2, and 4 work, but the 3rd one bombs out, complaining that Dimensions must be equal, but are 2 and 1 for 'SelfAdjointEigV2_2' (op: 'SelfAdjointEigV2') with input shapes: [2,1,2,1]. Should I just reshape the hessian somehow and carry on, or am I on the wrong track entirely?
After some fiddling, I have figured out that, given n*m matrix of input variables, TensorFlow's tf.hessians produces [n,m,n,m] tensor, which can be reshaped into square [n*m, n*m] Hessian matrix as follows:
sq_hess = tf.reshape(hessians[0], [tf.size(weights['h1']), tf.size(weights['h1'])])
Further, one can calculate the eigenvalues of the resulting square hessian:
eigenvals = tf.self_adjoint_eigvals(sq_hess)
This might be trivial, but it took me some time to wrap my head around this. I believe the behaviour of tf.hessians is not very well documented. Once you put together the dimensionalities, though, everything makes sense!

LSTM Followed by Mean Pooling (TensorFlow)

I am aware that there is a similar topic at LSTM Followed by Mean Pooling, but that is about Keras and I work in pure TensorFlow.
I have an LSTM network where the recurrence is handled by:
outputs, final_state = tf.nn.dynamic_rnn(cell,
embed,
sequence_length=seq_lengths,
initial_state=initial_state)
where I pass the correct sequence lengths for each sample (padding by zeros). In any case, outputs contains irrelevant outputs since some samples produce longer outputs than others, based on sequence lengths.
Right now I'm extracting the last relevant output by means of the following method:
def extract_axis_1(data, ind):
"""
Get specified elements along the first axis of tensor.
:param data: Tensorflow tensor that will be subsetted.
:param ind: Indices to take (one for each element along axis 0 of data).
:return: Subsetted tensor.
"""
batch_range = tf.range(tf.shape(data)[0])
indices = tf.stack([batch_range, ind], axis=1)
res = tf.reduce_mean(tf.gather_nd(data, indices), axis=0)
where I pass sequence_length - 1 as indices. In reference to the last topic, I would like to select all relevant outputs followed by average pooling, instead of just the last one.
Now, I tried passing nested lists as indeces to extract_axis_1 but tf.stack does not accept this.
Any solution directions for this?
You can exploit the weight parameter of the tf.contrib.seq2seq.sequence_loss function.
From the documentation:
weights: A Tensor of shape [batch_size, sequence_length] and dtype float. weights constitutes the weighting of each prediction in the sequence. When using weights as masking, set all valid timesteps to 1 and all padded timesteps to 0, e.g. a mask returned by tf.sequence_mask.
You need to compute a binary mask that distinguish between your valid outputs and invalid ones. Then you can just provide this mask to the weights parameter of the loss function (probably, you will want to use a loss like this one); the function will not consider the outputs with a 0 weight in the computation of the loss.
If you can't/don't need to use a sequence loss you can do exactly the same thing manually. You compute a binarymask and then multiply your outputs by this mask and provide these as inputs to your fully connected layer.

Shape of tensor for 2D image in Keras

I am a newbie to Keras (and somehow to TF) but I have found shape definition for the input layer very confusing.
So in the examples, when we have a 1D vector of length 20 for input, shape gets defined as
...Input(shape=(20,)...)
And when a 2D tensor for greyscale images needs to be defined for MNIST, it is defined as:
...Input(shape=(28, 28, 1)...)
So my question is why the tensor is not defined as (20) and (28, 28)? Why in the first case a second dimension is added and left empty? Also in second, number of channels have to be defined?
I understand that it depends on the layer so Conv1D, Dense or Conv2D take different shapes but it seems the first parameter is implicit?
According to docs, Dense needs be (batch_size, ..., input_dim) but how is this related the example:
Dense(32, input_shape=(784,))
Thanks
Tuples vs numbers
input_shape must be a tuple, so only (20,) can satisfy it. The number 20 is not a tuple. -- There is the parameter input_dim, to make your life easier if you have only one dimension. This parameter can take 20. (But really, I find it just confusing, I always work with input_shape and use tuples, to keep a consistent understanding).
Dense(32, input_shape=(784,)) is the same as Dense(32, input_dim=784).
Images
Images don't have only pixels, they also have channels (red, green, blue).
A black/white image has only one channel.
So, (28pixels, 28pixels, 1channel)
But notice that there isn't any obligation to follow this shape for images everywhere. You can shape them the way you like. But some kinds of layers do demand a certain shape, otherwise they couldn't work.
Some layers demand specific shapes
It's the case of the 2D convolutional layers, which need (size1,size2,channels). They need this shape because they must apply the convolutional filters accordingly.
It's also the case of recurrent layers, which need (timeSteps,featuresPerStep) to perform their recurrent calculations.
MNIST models
Again, there isn't any obligation to shape your image in a specific way. You must do it according to which first layer you choose and what you intend to achieve. It's a free thing.
Many examples simply don't care about an image being a 2d structured thing, and they just use models that take 784 pixels. That's enough. They probably start with Dense layers, which demand shapes like (size,)
Other examples may care, and use a shape (28,28), but then these models will have to reshape the input to fit the needs of the next layer.
Convolutional layers 2D will demand (28,28,1).
The main idea is: input arrays must match input_shape or input_dim.
Tensor shapes
Be careful, though, when reading Keras error messages or working with custom / lambda layers.
All these shapes we defined before omit an important dimension: the batch size or the number of samples.
Internally all tensors will have this additional dimension as the first dimension. Keras will report it as None (a dimension that will adapt to any batch size you have).
So, input_shape=(784,) will be reported as (None,784).
And input_shape=(28,28,1) will be reported as (None,28,28,1)
And your actual input data must have a shape that matches that reported shape.

understanding tensorflow sequence_loss parameters

The sequence_Loss module's source_code has three parameters that are required they list them as outputs, targets, and weights.
Outputs and targets are self explanatory, but I'm looking to better understand is what is the weight parameter?
The other thing I find confusing is that it states that the targets should be the same length as the outputs, what exactly do they mean by the length of a tensor? Especially if its a 3 dimensional tensor.
Think of the weights as a mask applied to the input tensor. In some NLP applications, we often have different sentence length for each sentence. In order to parallel/batch multiple instance sentences into a minibatch to feed into a neural net, people use a mask matrix to denotes which element in the the input tensor is actually a valid input. For instance, the weight can be a np.ones([batch, max_length]) that means all of the input elements are legit.
We can also use a matrix of the same shape as the labels such as np.asarray([[1,1,1,0],[1,1,0,0],[1,1,1,1]]) (we assume the labels shape is 3x4), then the crossEntropy of the first row last column will be masked out as 0.
You can also use weight to calculate weighted accumulation of cross entropy.
We used this in a class and our professor said we could just pass it ones of the right shape (the comment says "list of 1D batch-sized float-Tensors of the same length as logits"). That doesn't help with what they mean, but maybe it will help you get your code to run. Worked for me.
This code should do the trick: [tf.ones(batch_size, tf.float32) for _ in logits].
Edit: from TF code:
for logit, target, weight in zip(logits, targets, weights):
if softmax_loss_function is None:
# TODO(irving,ebrevdo): This reshape is needed because
# sequence_loss_by_example is called with scalars sometimes, which
# violates our general scalar strictness policy.
target = array_ops.reshape(target, [-1])
crossent = nn_ops.sparse_softmax_cross_entropy_with_logits(
logit, target)
else:
crossent = softmax_loss_function(logit, target)
log_perp_list.append(crossent * weight)
The weights that are passed are multiplied by the loss for that particular logit. So I guess if you want to take a particular prediction extra-seriously you can increase the weight above 1.