Normalized Mutual Information in Tensorflow - tensorflow

Is that possible to implement normalized mutual information in Tensorflow? I was wondering if I can do that and if I will be able to differentiate it. Let's say that I have predictions P and labels Y in two different tensors. Is there an easy way to use normalized mutual information?
I want to do something similar to this:

Assume your clustering method gives probability predictions/membership functions p(c|x), e.g., p(c=1|x) is the probability of x in the first cluster. Assume y is the ground truth class label for x.
The normalized mutual information is .
The entropy H(Y) can be estimated following this thread:
By definition, the entropy H(C) is , where .
The conditional mutual information where , and .
All terms involving integral can be estimated using sampling, i.e., average over training samples. The overall NMI is differentiable.
I did not misunderstand your question. I was assuming you used a neural network model which outputs logits as you did not provide any info. Then you need to normalise the logits to get p(c|x).
There may be other ways to estimate NMI, but if you discretize the output of whatever model you use, you cannot differentiate them.
TensorFlow code
Assume we have label matrix p_y_on_x and cluster predictions p_c_on_x. Each row of them corresponds to an observation x; each column corresponds to the probability of x in each class and cluster (so each row sums up to one). Further assume uniform probability for p(x) and p(x|y).
Then NMI can then be estimated as below:
p_y = tf.reduce_sum(p_y_on_x, axis=0, keepdim=True) / num_x # 1-by-num_y
h_y = -tf.reduce_sum(p_y * tf.math.log(p_y))
p_c = tf.reduce_sum(p_c_on_x, axis=0) / num_x # 1-by-num_c
h_c = -tf.reduce_sum(p_c * tf.math.log(p_c))
p_x_on_y = p_y_on_x / num_x / p_y # num_x-by-num_y
p_c_on_y = tf.matmul(p_c_on_x, p_x_on_y, transpose_a=True) # num_c-by-num_y
h_c_on_y = -tf.reduce_sum(tf.reduce_sum(p_c_on_y * tf.math.log(p_c_on_y), axis=0) * p_y)
i_y_c = h_c - h_c_on_y
nmi = 2 * i_y_c / (h_y + h_c)
In practice, please be very careful on the probabilities as they should be positive to avoid numeric overflow in tf.math.log.
Please comment if you find any mistakes.


Does variational autoencoder make distribution based on only latent representation?

If my latent representation of variational autoencoder(VAE) is r, and my dataset is x, does vae's latent representation follows normalization based on r or x?
If r= 10, that means it has 10 means and variance (multi-gussain) and distribution comes from data whole data x?
Or r = 10 constructs one distribution based on r, and every sample try to follow this distribution
I'm confused about which one is correct
VAE constructs a mapping e(x) -> Z (encoder), and d(z) -> X (decoder). This means that every elements of your input space x will be mapped through an encoder e(x) into a single, r-dimensional Gaussian. It is not a "mixture", it is just a single gaussian with diagonal covariance matrix.
I'll add my 2 cents to #lejlot answer.
Your encoder in VAE will map your sample to a distribution, that in your case has 10 dimensions... that distribution is used to say "ok my best estimate of this property of this sample is mu, but I'm not too sure, so consider that it might have variance sigma"
Therefore, you have a distribution for each sample.
However, in order to make sampling easier in VAE, we ask the VAE to keep the distributions as close to a known one, that is the standard normal distribution (we know "where the distributions are located", if you check the latent space in a normal AE you will see that you will have groups far from eachother).

After quantisation in neural network, will the output need to be scaled with the inverse of the weight scaling

I'm currently writing a script to quantise a Keras model down to 8 bits. I'm doing a fairly basic linear scaling on the weights, by assuming a normal distribution of weights and biases, and then interpolating all the values within 2 standard deviations of the mean, to the range [-128, 127].
This all works, and I run the model through inference, but my image out is crazy bad. I know there will be a small performance hit, but I'm seeing roughly 10x performance degradation.
My question is, after this scaling of the weights, do I need to do the inverse scaling operation to my output? None of the papers I've been reading seem to mention this, but I'm unsure why else my results would be so bad.
The network is for image demosaicing. It takes in a RAW image, and is meant to output an image with very low noise, and no demosaicing artefacts. My full precision model is very good, with image PSNRs of around 40-43dB, but after quantisation, I'm getting 4-8dB, and incredibly bad looking images.
Code for anyone who's bothered to read it
for i in layer_index:
count = count+1
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
std = np.std(weights_act)
if (std > max_std):
max_std = std
mean = np.mean(weights_act)
mean_of_mean = mean_of_mean + mean
mean_of_mean = mean_of_mean / count
max_bound = mean_of_mean + 2*max_std
min_bound = mean_of_mean - 2*max_std
print(max_bound, min_bound)
for i in layer_index:
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
weights_shape = weights_act.shape;
bias_shape = bias_act.shape;
new_weights = np.empty(weights_shape, dtype = np.int8)
new_biass = np.empty(bias_shape, dtype = np.int8)
for a in range(weights_shape[0]):
for b in range(weights_shape[1]):
for c in range(weights_shape[2]):
for d in range(weights_shape[3]):
new_weight = (((weights_act[a,b,c,d] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_weights[a,b,c,d] = np.int8(new_weight)
#print(new_weights[a,b,c,d], weights_act[a,b,c,d])
for e in range(bias_shape[0]):
new_bias = (((bias_act[e] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_biass[e] = np.int8(new_bias)
new_weight_layer = (new_weights, new_biass)
You dont do what you think you are doing, I'll explain.
If you wish to take pre-trained model and quantize it you have to add scales after each operation that involves weights, lets take for example the convolution operation.
As we know convolution operation is linear in my explantion i will ignore the bias for the sake of simplicity (adding him is relatively easy), Let's assume X is our input Y is our output and W is the weights, convolution can be written as:
where '*' represent the convolution operation, what you are basically doing is taking the weights and multiple them by some scalar (lets call it 'a') and shift them by some other scalar (let's call it 'b') so in your model you use W' where: W'= Wa+b
So if we return to the convolution operation we get that in your quantized network you basically do the next operation: Y' = W'*X = (Wa+b)*X
Because convolution is linear we get: Y' = a(W*X) + b*X'
Don't forget that in your network you want to receive Y not Y' at the output of the convolution therefore you must do shift + re scale to get the correct answer.
So after that explanation (which i hope was clear enough) i hope you can understand what is the problem in your network, you do this scale and shift to all of weights and you never compensate for it, I think your confusion is because your read papers that trained models in quantized mode from the beginning and didn't take pretrained model quantized it.
For you problem i think tensorflow graph transform tool might help, take a look at:
If you wish to read more about quantizing pre trained model you can find more information in (for more academic info just go to

How to compute dot product with negative samples

CNTK currently provides function to perform Cosine distance with negative samples. I am wondering how one could possibly do a simple dot product with negative sampling in CNTK.
Assuming query is the thing you want to match up against candidates and normalize across all candidates in the batch, you can use something like this:
def all_pairs_loss(query, candidates):
qry_matrix = C.unpack_batch(query)
cnd_matrix = C.unpack_batch(candidates)
all_inner_products = C.to_batch(C.times_transpose(cnd_matrix, qry_matrix))
positive_inner_products = C.reduce_sum(qry * candidates, axis=0)
loss = C.reduce_log_sum_exp(all_inner_products) - positive_inner_products
return loss

Tensorflow: What exact formula is applied in `tf.nn.sparse_softmax_cross_entropy_with_logits`?

I tried to manually recompute the outputs of this function so I created a minimal example:
logits = tf.pack(np.array([[[[0,1,2]]]],dtype=np.float32)) # img of shape (1, 1, 1, 3)
labels = tf.pack(np.array([[[1]]],dtype=np.int32)) # gt of shape (1, 1, 1)
softmaxCrossEntropie = tf.nn.sparse_softmax_cross_entropy_with_logits(logits,labels)
softmaxCrossEntropie.eval() # --> output is [1.41]
Now according to my own calculation I only get [1.23]
When manually calculating, I'm simply applying softmax
and cross-entropy:
where q(x) = sigma(x_j) or (1-sigma(x_j)) depending whether j is the correct ground truth class or not and p(x) = labels which are then one-hot-encoded
I'm not sure where the difference might originate from. I cannot really imagine that some epsilon causes such a big difference. Does someone know where I can lookup, which exact formula is used by tensorflow?
Is the source code of that exact part available?
I could only find, but it only uses another function called gen_nn_ops._sparse_softmax_cross_entropy_with_logits which I couldn't find on github...
Well, usually p(x) in cross-entropy equation is true distribution, while q(x) is the distribution obtained from softmax. So, if p(x) is one-hot (and this is so, otherwise sparse cross-entropy could not be applied), cross entropy is just negative log for probability of true category.
In your example, softmax(logits) is a vector with values [0.09003057, 0.24472847, 0.66524096], so the loss is -log(0.24472847) = 1.4076059 which is exactly what you got as output.

R-Squared of alternative model

In order to reduce the influence of outliers and obtain a more robust regression, I've applied a winsorization technique to modify the values of a series ('x'). I then regress these values against series 'y'.
The R-squared of this model is naturally much higher, but I'm not making the right comparison.
How do I use scipy or statsmodels to obtain the R-squared of the original data using the beta estimates from the winsorized model?
You need to calculate it yourself, essentially by replicating the formula for rsquared.
For example
>>> res_tmp = OLS(np.random.randn(100), np.column_stack((np.ones(100),np.random.randn(100, 2)))).fit()
>>> y_orig = res_tmp.model.endog
>>> res_tmp.rsquared
>>> (1 - ((y_orig - res_tmp.fittedvalues)**2).sum() / ((y_orig - y_orig.mean())**2).sum())
The last expression would apply to your case if res_tmp.fittedvalues are the predicted or fitted values of your winsorized model, and y_orig is your original unchanged response variable. This definition of R squared applies if there is a constant in the model.
Note: The most frequent naming for the linear model corresponds to y = X b, where y is the response variable and X are the explanatory variables. IIUC, then you reversed the labeling in your question.