Distance between words in tensorflow embedding - tensorflow

I'd like to use one of the models on TensorFlow Hub to look at the distances between words (specifically this one https://tfhub.dev/google/nnlm-en-dim128/1). But I can't find a good example of how to find the distance between two words or two groups of words... is this something that is possible with an embedding like this?
I'm 100% not a Data Scientist and so this might be a complete lack of understanding so apologies if it's a dumb question.
Ideally I'd like to look at the distance of a single word compared to two different sets of words.

I think the most common measure of distance between two embedded vectors is the cosine similarity.
We can calculate the cosine similarity using the formula:
which we can translate into tensorflow code as follows:
def cosine_similarity(a, b):
mag_a = tf.sqrt(tf.reduce_sum(tf.multiply(a, a)))
mag_b = tf.sqrt(tf.reduce_sum(tf.multiply(b, b)))
return tf.reduce_sum(tf.multiply(a, b)) / (mag_a * mag_b)
so we have a complete example as follows:
import tensorflow as tf
import tensorflow_hub as hub
embed = hub.Module("https://tfhub.dev/google/nnlm-en-dim128/1")
embeddings = embed(["cat is on the mat", "tiger sat on the mat"])
def cosine_similarity(a, b):
mag_a = tf.sqrt(tf.reduce_sum(tf.multiply(a, a)))
mag_b = tf.sqrt(tf.reduce_sum(tf.multiply(b, b)))
return tf.reduce_sum(tf.multiply(a, b)) / (mag_a * mag_b)
a = embeddings[0]
b = embeddings[1]
cos_similarity = cosine_similarity(a, b)
with tf.Session() as sess:
sess.run(tf.initialize_all_tables())
sess.run(tf.global_variables_initializer())
print (sess.run(cos_similarity))
which outputs 0.78157.
Note that some folks advocate using a rearrangement to the formula which gives the same results (+/- minuscule 'rounding errors') and may or may not be slightly better optimised.
This alternative formula is calculated as:
def cosine_similarity(a, b):
norm_a = tf.nn.l2_normalize(a,0)
norm_b = tf.nn.l2_normalize(b,0)
return tf.reduce_sum(tf.multiply(norm_a,norm_b))
Personally, I can't see how the difference could be anything other than negligible and I happen know the first formulation so I tend to stick with it but I certainly make no claim that its best and don't claim to know which is fastest! :-)

Related

Automatic Differentiation with respect to rank-based computations

I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff – it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.

Confused by random.randn()

I am a bit confused by the numpy function random.randn() which returns random values from the standard normal distribution in an array in the size of your choosing.
My question is that I have no idea when this would ever be useful in applied practices.
For reference about me I am a complete programming noob but studied math (mostly stats related courses) as an undergraduate.
The Python function randn is incredibly useful for adding in a random noise element into a dataset that you create for initial testing of a machine learning model. Say for example that you want to create a million point dataset that is roughly linear for testing a regression algorithm. You create a million data points using
x_data = np.linspace(0.0,10.0,1000000)
You generate a million random noise values using randn
noise = np.random.randn(len(x_data))
To create your linear data set you follow the formula
y = mx + b + noise_levels with the following code (setting b = 5, m = 0.5 in this example)
y_data = (0.5 * x_data ) + 5 + noise
Finally the dataset is created with
my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_data,columns=['Y'])],axis=1)
This could be used in 3D programming to generate non-overlapping random values. This would be useful for optimization of graphical effects.
Another possible use for statistical applications would be applying a formula in order to test against spacial factors affecting a given constant. Such as if you were measuring a span of time with some formula doing something but then needing to know what the effectiveness would be given various spans of time. This would return a statistic measuring for example that your formula is more effective in the shorter intervals or longer intervals, etc.
np.random.randn(d0, d1, ..., dn) Return a sample (or samples) from the “standard normal” distribution(mu=0, stdev=1).
For random samples from , use:
sigma * np.random.randn(...) + mu
This is because if Z is a standard normal deviate, then will have a normal distribution with expected value and standard deviation .
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html
https://en.wikipedia.org/wiki/Normal_distribution

How to assign values to a subset of a tensor in tensorflow?

Two parts to this question:
(1) What is the best way to update a subset of a tensor in tensorflow? I've seen several related questions:
Adjust Single Value within Tensor -- TensorFlow
and
How to update a subset of 2D tensor in Tensorflow?
and I'm aware that Variable objects can be assigned using Variable.assign() (and/or scatter_update, etc.), but it seems very strange to me that tensorflow does not have a more intuitive way to update a part of a Tensor object. I have searched through the tensorflow api docs and stackoverflow for quite some time now and can't seem to find a simpler solution than what is presented in the links above. This seems particularly odd, especially given that Theano has an equivalent version with Tensor.set_subtensor(). Am I missing something or is there no simple way to do this through the tensorflow api at this point?
(2) If there is a simpler way, is it differentiable?
Thanks!
I suppose the immutability of Tensors is required for the construction of a computation graph; you can't have a Tensor update some of its values without becoming another Tensor or there will be nothing to put in the graph before it. The same issue comes up in Autograd.
It's possible to do this (but ugly) using boolean masks (make them variables and use assign, or even define them prior in numpy). That would be differentiable, but in practice I'd avoid having to update subtensors.
If you really have to, and I really hope there is a better way to do this, but here is a way to do it in 1D using tf.dynamic_stitch and tf.setdiff1d:
def set_subtensor1d(a, b, slice_a, slice_b):
# a[slice_a] = b[slice_b]
a_range = tf.range(a.shape[0])
_, a_from = tf.setdiff1d(a_range, a_range[slice_a])
a_to = a_from
b_from, b_to = tf.range(b.shape[0])[slice_b], a_range[slice_a]
return tf.dynamic_stitch([a_to, b_to],
[tf.gather(a, a_from),tf.gather(b, b_from)])
For higher dimensions this could be generalised by abusing reshape (where nd_slice could be implemented like this but there is probably a better way):
def set_subtensornd(a, b, slice_tuple_a, slice_tuple_b):
# a[*slice_tuple_a] = b[*slice_tuple_b]
a_range = tf.range(tf.reduce_prod(tf.shape(a)))
a_idxed = tf.reshape(a_range, tf.shape(a))
a_dropped = tf.reshape(nd_slice(a_idxed, slice_tuple_a), [-1])
_, a_from = tf.setdiff1d(a_range, a_dropped)
a_to = a_from
b_range = tf.range(tf.reduce_prod(tf.shape(b)))
b_idxed = tf.reshape(b_range, tf.shape(b))
b_from = tf.reshape(nd_slice(b_idxed, slice_tuple_b), [-1])
b_to = a_dropped
a_flat, b_flat = tf.reshape(a, [-1]), tf.reshape(b, [-1])
stitched = tf.dynamic_stitch([a_to, b_to],
[tf.gather(a_flat, a_from),tf.gather(b_flat, b_from)])
return tf.reshape(stitched, tf.shape(a))
I have no idea how slow this will be. I'd guess quite slow. And, I haven't tested it much beyond running it on a couple of tensors.

Perplexing behavior numpy.linalg.eig (A possibly serious issue)

Consider the following simple piece of code:
import numpy as np
A = np.array([[0,1,1,0],[1,0,0,1],[1,0,0,1],[0,1,1,0]], dtype=float)
eye4 = np.eye(4, dtype=float) # 4x4 identity
H1 = np.kron(A,eye4)
w1,v1 = np.linalg.eig(H1)
H1copy = np.dot(np.dot(v1,np.diag(w1)),np.transpose(v1)) # reconstructing from eigvals and eigvecs
H2 = np.kron(eye4,A)
w2,v2 = np.linalg.eig(H2)
H2copy = np.dot(np.dot(v2,np.diag(w2)),np.transpose(v2))
print np.sum((H1-H1copy)**2) # sum of squares of elements
print np.sum((H2-H2copy)**2)
It produces the output
1.06656622138
8.7514256673e-30
This is very perplexing. These two matrices differ only in the order of the kronecker product. And yet, the accuracy is so low in just one of them. Further, an norm-square error > 1.066 is highly unacceptable according to me. What is going wrong here?
Further, what is the best way to work around this issue, given that the eigenvalue decomposition is a small part of a code that has to be run several (>100) times.
Your matrices are symmetric. Use eigh instead of eig.
If you use eig, the transpose of v1 is not necessarily equal to the inverse of v1.

Multinomial Naive Bayes with scikit-learn for continuous and categorical data

I'm new to scikit-learn, I'm trying to create a Multinomial Bayes model to predict movies box office. Below is just a toy example, I'm not sure if it is logically correct (suggestions are welcome!). The Y's corresponds to the estimate gross I'm trying to predict (1: < $20mi, 2: > $20mi). I also discretized the number of screens the movie was shown.
The question is, is this a good approach to the problem? Or would it be better to assign numbers to all categories? Also, is it correct to embed the labels (e.g. "movie: Life of Pie") in the DictVectorizer object?
def get_data():
measurements = [ \
{'movie': 'Life of Pi', 'screens': "some", 'distributor': "fox"},\
{'movie': 'The Croods', 'screens': "some", 'distributor': "fox"},\
{'movie': 'San Fransisco', 'screens': "few", 'distributor': "TriStar"},\
]
vec = DictVectorizer()
arr = vec.fit_transform(measurements).toarray()
return arr
def predict(X):
Y = np.array([1, 1, 2])
clf = MultinomialNB()
clf.fit(X, Y)
print(clf.predict(X[2]))
if __name__ == "__main__":
vector = get_data()
predict(vector)
In principle this is correct, I think.
Maybe it would be more natural to formulate the problem as a regression on the box-office sales.
The movie feature is useless. The DictVectorizer encodes each possible value as a different feature. As each movie will have a different title, they will all have completely independent features, and no generalization is possible there.
It might also be better to encode screens as a number, not as a one-hot-encoding of different ranges.
Needless to say, you need much better features that what you have here to get any reasonable prediction.