Why the norm of the embedding vector exceeds the limit, it is necessary to normalize - intel-tensorflow

I would like to ask you guys, I saw that max_norm=1 in a piece of code and I checked that he said that the maximum norm is 1. What does this mean? Why does the norm of the embedded vector exceed the limit, so it needs to be normalized.
Randomly initialize the user's vector,
self.users = nn.Embedding( n_users, dim, max_norm=1 )
max_norm (python:float, optional) – The maximum norm, if the norm of the embedding vector exceeds this limit, renormalization will be performed.
There is also this, assuming n_users is specified, then the specified dim=10 is a 10-dimensional user, so why does this need a concept of dimension, is there any difference between different dimensions?

embedding_dim (int) – the size of each embedding vector
max_norm (float, optional) – If given, each embedding vector with norm larger than max_norm is renormalized to have norm max_norm.
Embedding is to map a number to a vector. embedding_dim is the dimension of this vector. max_norm specifies the maximum allowed norm for the vector.


Meaning of sparse in "sparse cross entropy loss"?

I read from the documentation:
from_logits=False, reduction="auto", name="sparse_categorical_crossentropy"
Computes the crossentropy loss between the labels and predictions.
Use this crossentropy loss function when there are two or more label
classes. We expect labels to be provided as integers. If you want to
provide labels using one-hot representation, please use
CategoricalCrossentropy loss. There should be # classes floating point
values per feature for y_pred and a single floating point value per
feature for y_true.
Why is this called sparse categorical cross entropy? If anything, we are providing a more compact encoding of class labels (integers vs one-hot vectors).
I think this is because integer encoding is more compact than one-hot encoding and thus more suitable for encoding sparse binary data. In other words, integer encoding = better encoding for sparse binary data.
This can be handy when you have many possible labels (and samples), in which case a one-hot encoding can be significantly more wasteful than a simple integer per example.
Why exactly it is called like that is probably best answered by Keras devs. However, note that this sparse cross-entropy is only suitable for "sparse labels", where exactly one value is 1 and all others are 0 (if the labels were represented as a vector and not just an index).
On the other hand, the general CategoricalCrossentropy also works with targets that are not one-hot, i.e. any probability distribution. The values just need to be between 0 and 1 and sum to 1. This tends to be forgotten because the use case of one-hot targets is so common in current ML applications.

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

What is MobileNetv1 depth_multiplier?

Refering to tensorflow mobilenetv1 model: https://github.com/tensorflow/models/blob/9f7a5fa353df0ee2010f8e7a5494ca6b188af8bc/research/slim/nets/mobilenet_v1.py#L171
The param depth_multiplier is documented as:
depth_multiplier: Float multiplier for the depth (number of channels)
for all convolution ops. The value must be greater than zero. Typical
usage will be to set this value in (0, 1) to reduce the number of
parameters or computation cost of the model
But in the (paper), they mention 2 types of multipliers: width multiplier and resolution multiplier, so which one correspond to depth multiplier?
On Keras, they say that:
depth_multiplier: depth multiplier for depthwise convolution (also
called the resolution multiplier)
I'm so confused!
As described in the paper:
The role of the width multiplier α is to thin a network uniformly at each layer. for a given layer and width multiplier α, the number of input channels M becomes αM and the number of output channels N becomes αN.
The resolution multiplier ρ is applied to the input image and the internal representation of every layer is subsequently reduced by the same multiplier. In practice we implicitly set ρ by setting the input resolution.
In the code:
The depth_multiplier is used to reduce the number of channels at each layer. So the depth_multiplier corresponds the width multiplier α.

Tensorflow num_classes parameter of nce_loss()

My understanding of noise contrastive estimation is that we sample some vectors from our word embeddings (the negative sample), and then calculate the log-likelihood of each. Then we want to maximize the difference between the probability of the target word and the log-likelihood of each of the negative sample words (So if I'm correct about this, we want to optimize the loss function so that it gets as close to 1 as possible).
My question is this:
What is the purpose of the num_classes parameters to the nce_loss function? My best guess is that the number of classes is passed in so that Tensorflow knows the size of the distribution from which the negative samples our drawn, but this might not make sense, since we could just infer the size of the distribution from the variable itself. Otherwise, I can't think of a reason for why we would need to know the total possible number of classes, especially if the language model is only outputting k + 1 predictions (negative sample size + 1 for the target word).
Your guess is correct. The num_classes argument is used to sample negative labels from the log-uniform (Zipfian) distribution.
Here's the link to the source code:
# Sample the negative labels.
# sampled shape: [num_sampled] tensor
# true_expected_count shape = [batch_size, 1] tensor
# sampled_expected_count shape = [num_sampled] tensor
if sampled_values is None:
sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
The range_max=num_classes argument basically defines the shape of this distribution and also the range of the sampled values - [0, range_max). Note that this range can't be accurately inferred from the labels, because a particular mini-batch can have only small word ids, which would skew the distribution significantly.

variable-length rnn padding and mask out padding gradients

I'm building an rnn and using the sequene_length parameter to supply a list of lengths for sequences in a batch, and all of sequences in a batch are padded to the same length.
However, when doing backprop, is it possible to mask out the gradients corresponding to the padded steps, so these steps would have 0 contribution to the weight updates? I'm already masking out their corresponding costs like this (where batch_weights is a vector of 0's and 1's, where the elements corresponding to the padding steps are 0's):
loss = tf.mul(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, tf.reshape(self._targets, [-1])), batch_weights)
self._cost = cost = tf.reduce_sum(loss) / tf.to_float(tf.reduce_sum(batch_weights))
the problem is I'm not sure by doing the above whether the gradients from the padding steps are zeroed out or not?
For all framewise / feed-forward (non-recurrent) operations, masking the loss/cost is enough.
For all sequence / recurrent operations (e.g. dynamic_rnn), there is always a sequence_length parameter which you need to set to the corresponding sequence lengths. Then there wont be a gradient for the zero-padded steps, or in other terms, it will have 0 contribution.