I want to be able to calculate the cosine distance between row vectors using MXNet. Additionally I am working with batches of samples, and would like to calculate the cosine distance for each pair of samples (i.e. cosine distance of 1st row vector of batch #1 with 1st row vector of batch #2).
Cosine distance between two vectors is defined as in scipy.spatial.distance.cosine:
You can use mx.nd.batch_dot to perform this batch-wise cosine distance:
import mxnet as mx
def batch_cosine_dist(a, b):
a1 = mx.nd.expand_dims(a, axis=1)
b1 = mx.nd.expand_dims(b, axis=2)
d = mx.nd.batch_dot(a1, b1)[:,0,0]
a_norm = mx.nd.sqrt(mx.nd.sum((a*a), axis=1))
b_norm = mx.nd.sqrt(mx.nd.sum((b*b), axis=1))
dist = 1.0 - d / (a_norm * b_norm)
return dist
And it will return an array with batch_size number of distances.
batch_size = 3
dim = 2
a = mx.random.uniform(shape=(batch_size, dim))
b = mx.random.uniform(shape=(batch_size, dim))
dist = batch_cosine_dist(a, b)
print(dist.asnumpy())
# [ 0.04385382 0.25792354 0.10448891]
Related
I wanted to create a pearson correlation coefficient metrics using tensorflow tensor. They do have a tensorflow probability package https://www.tensorflow.org/probability/api_docs/python/tfp/stats/correlation but this have dependency issues with the current version of tensorflow. I am afraid that this will cause the cuda to break. Any standalone implementation of pearson correlation coefficient metrics in tensorflow will help...
So I want something like this:
def p_corr(y_true, y_pred):
# calculate the pearson correlation coefficient here
return pearson_correlation_coefficient
Here y_true and y_pred will be a list of numbers of same dimension.
This works fine:
from keras import backend as K
def pearson_r(y_true, y_pred):
# use smoothing for not resulting in NaN values
# pearson correlation coefficient
# https://github.com/WenYanger/Keras_Metrics
epsilon = 10e-5
x = y_true
y = y_pred
mx = K.mean(x)
my = K.mean(y)
xm, ym = x - mx, y - my
r_num = K.sum(xm * ym)
x_square_sum = K.sum(xm * xm)
y_square_sum = K.sum(ym * ym)
r_den = K.sqrt(x_square_sum * y_square_sum)
r = r_num / (r_den + epsilon)
return K.mean(r)
I have a 3D numpy array with the probabilities of each category in the last dimension. Something like:
import numpy as np
from scipy.special import softmax
array = np.random.normal(size=(10, 100, 5))
probabilities = softmax(array, axis=2)
How can I sample from a categorical distribution with those probabilities?
EDIT:
Right now I'm doing it like this:
def categorical(x):
return np.random.multinomial(1, pvals=x)
samples = np.apply_along_axis(categorical, axis=2, arr=probabilities)
But it's very slow so I want to know if there's a way to vectorize this operation.
Drawing samples from a given probability distribution is done by building the evaluating the inverse cumulative distribution for a random number in the range 0 to 1. For a small number of discrete categories - like in the question - you can find the inverse using a linear search:
## Alternative test dataset
probabilities[:, :, :] = np.array([0.1, 0.5, 0.15, 0.15, 0.1])
n1, n2, m = probabilities.shape
cum_prob = np.cumsum(probabilities, axis=-1) # shape (n1, n2, m)
r = np.random.uniform(size=(n1, n2, 1))
# argmax finds the index of the first True value in the last axis.
samples = np.argmax(cum_prob > r, axis=-1)
print('Statistics:')
print(np.histogram(samples, bins=np.arange(m+1)-0.5)[0]/(n1*n2))
For the test dataset, a typical test output was:
Statistics:
[0.0998 0.4967 0.1513 0.1498 0.1024]
which looks OK.
If you have many, many categories (thousands), it's probably better to do a bisection search using a numba compiled function.
I have 2 lists of points as numpy.ndarray, each row is the coordinate of a point, like:
a = np.array([[1,0,0],[0,1,0],[0,0,1]])
b = np.array([[1,1,0],[0,1,1],[1,0,1]])
Here I want to calculate the euclidean distance between all pairs of points in the 2 lists, for each point p_a in a, I want to calculate the distance between it and every point p_b in b. So the result is
d = np.array([[1,sqrt(3),1],[1,1,sqrt(3)],[sqrt(3),1,1]])
How to use matrix multiplication in numpy to compute the distance matrix?
Using direct numpy broadcasting, you can do this:
dist = np.sqrt(((a[:, None] - b[:, :, None]) ** 2).sum(0))
Alternatively, scipy has a routine that will compute this slightly more efficiently (particularly for large matrices)
from scipy.spatial.distance import cdist
dist = cdist(a, b)
I would avoid solutions that depend on factoring-out matrix products (of the form A^2 + B^2 - 2AB), because they can be numerically unstable due to floating point roundoff errors.
To compute the squared euclidean distance for each pair of elements off them - x and y, we need to find :
(Xik-Yjk)**2 = Xik**2 + Yjk**2 - 2*Xik*Yjk
and then sum along k to get the distance at coressponding point as dist(Xi,Yj).
Using associativity, it reduces to :
dist(Xi,Yj) = sum_k(Xik**2) + sum_k(Yjk**2) - 2*sum_k(Xik*Yjk)
Bringing in matrix-multiplication for the last part, we would have all the distances, like so -
dist = sum_rows(X^2), sum_rows(Y^2), -2*matrix_multiplication(X, Y.T)
Hence, putting into NumPy terms, we would end up with the euclidean distances for our case with a and b as the inputs, like so -
np.sqrt((a**2).sum(1)[:,None] + (b**2).sum(1) - 2*a.dot(b.T))
Leveraging np.einsum, we could replace the first two summation-reductions with -
np.einsum('ij,ij->i',a,a)[:,None] + np.einsum('ij,ij->i',b,b)
More info could be found on eucl_dist package's wiki page (disclaimer: I am its author).
If you have 2 each 1-dimensional arrays, x and y, you can convert the arrays into matrices with repeating columns, transpose, and apply the distance formula. This assumes that x and y are coordinated pairs. The result is a symmetrical distance matrix.
x = [1, 2, 3]
y = [4, 5, 6]
xx = np.repeat(x,3,axis = 0).reshape(3,3)
yy = np.repeat(y,3,axis = 0).reshape(3,3)
dist = np.sqrt((xx-xx.T)**2 + (yy-yy.T)**2)
dist
Out[135]:
array([[0. , 1.41421356, 2.82842712],
[1.41421356, 0. , 1.41421356],
[2.82842712, 1.41421356, 0. ]])
L2 distance = (a^2 + b^2 - 2ab)^0.5
a = np.random.randn(5, 3)
b = np.random.randn(2, 3)
a2 = np.sum(np.square(a), axis = 1)[..., None]
b2 = np.sum(np.square(b), axis = 1)[None, ...]
ab = -2*np.dot(a, b.T)
dist = np.sqrt(a2 + b2 + ab)
I'm using the Kobe Bryant Dataset.
I wish to predict the shot_made_flag with KnnRegressor.
I've used game_date to extract year and month features:
# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(\d+)-').findall(x)[0]))
# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])
and I wish to use season, year, month features to give them more weight in the distance function so events with closer date to the current event will be closer neighbors but still maintain reasonable distances to potential other datapoints, so for example I don't wish an event withing the same day would be the closest neighbor just because of the date features but it'll take into account the other features such as shot_range etc..
To give it more weight I've tried to use metric argument with custom distance function but the arguments of the function are just numpy array without column information of pandas so I'm not sure what I can do and how to implement what I'm trying to do.
EDIT:
Using larger weights for date features to find the optimal k with cv of 10 running on k from [1, 100]:
from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features
not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)
neighbors = [x for x in range(1, 100)]
cv_scores = []
weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
X.columns.get_loc("year"),
X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum() #Normalize weights
def my_distance(x, y):
dist = ((x-y)**2)
return np.dot(dist, weight)
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))
#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()
Runs really slow, any idea on how to make it faster?
The idea of the weighted features is to find neighbors more close to the data point date to avoid data leakage and cv for finding optimal k.
First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:
weight = np.ones((M,)) # M is no of features
weight[[1,7,10]] = 2 # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum() #Normalize weights
You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.
Now define a distance function, which by guideline have to take two 1D numpy array.
def my_dist(x,y):
global weight #1D array, same shape as x or y
dist = ((x-y)**2) #1D array, same shape as x or y
return np.dot(dist,weight) # a scalar float
And initialize KNeighborsRegressor as:
knn = KNeighborsRegressor(metric=my_dist)
EDIT:
To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -
dist = np.zeros((len(X),len(X))) #Computing NXN distance matrix
for i in range(len(X)): # You can halve this by using the fact that dist[i,j] = dist[j,i]
for j in range(len(X)):
dist[i,j] = my_dist(X[i],X[j])
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed'
cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X
I couldn't test it, so let me know if something isn't alright.
Just add on Shihab's answer regarding distance computation. Can use scipy pdist as suggested in this post, which is faster and more efficient.
from scipy.spatial.distance import pdist, minkowski, squareform
# create the custom weight array
weight = ...
# calculate pairwise distances, using Minkowski norm with custom weights
distances = pdist(X, minkowski, 2, weight)
# reformat the result as a square matrix
distances_as_2d_matrix = squareform(distances)
I have a vector as input for a layer.
For this vector I would like to calculate the cosine similariy to several other vectors (that can be arranged in a matrix)
Example (other vectors: c1,c2,c3 ...):
Input:
v
(len(v) = len(c1) = len(c2) ...)
Output:
[cosinsSimilarity(v,c1),cosineSimilarity(v,c2),cosineSimilarity(v,c3),consinSimilarity(v,...)]
I think the problem could be solved by an approach like the following:
cosineSimilarity (v, matrix (c1, c2, c3, ...))
but unfortunately I have no idea how I can implement that in a keras layer with input_shape(1,len(v)) and output_shape(1,columns(matrix))
okay it was so easy now. I simply inserted this lambda layer
because the mean function also works for vector - matrix multiplication.
def cosine_similarity(x):
#shape x: (10,)
y = tf.constant([c1,c2])
#shape c1,c2: (10,)
#shape y: (2,10)
x = K.l2_normalize(x, -1)
y = K.l2_normalize(y, -1)
s = K.mean(x * y, axis=-1, keepdims=False) * 10
return s
input is in my case a vector with shape (10,). Output is a vector with the cosine-similarity-values of the input vector to c1 and c2 with shape (2,)