How to Calculate Cosine Similarity Using TensorFlow - tensorflow

import pandas as pd
import numpy as np
df = pd.read_csv('shops.csv', sep='|')
df.columns = ['name', # 상호명
'cate_1', # 중분류명
'cate_2', # 소분류명
'cate_3', # 표준산업분류명
'dong', # 행정동명
'lon', # 위도
'lat' # 경도
]
df['cate_mix'] = df['cate_1'] + df['cate_2'] + df['cate_3']
df['cate_mix'] = df['cate_mix'].str.replace("/", " ")
from sklearn.feature_extraction.text import CountVectorizer # 피체 벡터화
from sklearn.metrics.pairwise import cosine_similarity # 코사인 유사도
count_vect_category = CountVectorizer(min_df=0, ngram_range=(1,2))
place_category = count_vect_category.fit_transform(df['cate_mix'])
place_simi_cate = cosine_similarity(place_category, place_category)
place_simi_cate_sorted_ind = place_simi_cate.argsort()[:, ::-1]
At this time, I want to calculate the cosine similarity as above,
via tensorflow
Is there any way to calculate it?

Example:
y_true = [[0., 1.], [1., 1.]]
y_pred = [[1., 0.], [1., 1.]]
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
cosine_loss(y_true, y_pred).numpy()
Source: TensorFlow docs

Related

Jacobian of a vector in Tensorflow

I think this question has never been properly answered 8see How to calculate the Jacobian of a vector function with tensorflow or Computing Jacobian in TensorFlow 2.0), so I will try again:
I want to compute the jacobian of the vector valued function z = [x**2 + 2*y, y**2], that is, I want to obtain the matrix of the partial derivatives
[[2x, 0],
[2, 2y]]
(being automatic differentiation, this matrix will be for an specific point).
with tf.GradientTape() as g:
x = tf.Variable(1.0)
y = tf.Variable(4.0)
z = tf.convert_to_tensor([x**2 + 2*y, y**2])
jacobian = g.jacobian(z, [x, y])
print(jacobian)
Obtaining
[<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2., 0.], dtype=float32)>, <tf.Tensor: shape=(2,), dtype=float32, numpy=array([2., 8.], dtype=float32)>]
I want to obtain naturally the tensor
[[2., 0.],
[2., 8.]]
not that intermediate result. Can it be done?
Try some thing like this
import numpy as np
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.Variable(1.0)
y = tf.Variable(4.0)
z = tf.convert_to_tensor([x**2 + 2*y, y**2])
jacobian = g.jacobian(z, [x, y])
print(np.array([jacob.numpy() for jacob in jacobian]))
Result
[[2. 0.]
[2. 8.]]

tf.keras.losses.CategoricalCrossentropy gives different values than plain implementation

Any one knows why raw implementation of Categorical Crossentropy function is so different from the tf.keras's api function?
import tensorflow as tf
import math
tf.enable_eager_execution()
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
ce = tf.keras.losses.CategoricalCrossentropy()
res = ce(y_true, y_pred).numpy()
print("use api:")
print(res)
print()
print("implementation:")
step1 = -y_true * np.log(y_pred )
step2 = np.sum(step1, axis=1)
print("step1.shape:", step1.shape)
print(step1)
print("sum step1:", np.sum(step1, ))
print("mean step1", np.mean(step1))
print()
print("step2.shape:", step2.shape)
print(step2)
print("sum step2:", np.sum(step2, ))
print("mean step2", np.mean(step2))
Above gives:
use api:
0.3239681124687195
implementation:
step1.shape: (3, 3)
[[0.10536052 0. 0. ]
[0. 0.11653382 0. ]
[0. 0. 0.0618754 ]]
sum step1: 0.2837697356318653
mean step1 0.031529970625762814
step2.shape: (3,)
[0.10536052 0.11653382 0.0618754 ]
sum step2: 0.2837697356318653
mean step2 0.09458991187728844
If now with another y_true and y_pred:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
It gives:
use api:
16.11809539794922
implementation:
step1.shape: (1, 2)
[[-0. 25.32843602]]
sum step1: 25.328436022934504
mean step1 12.664218011467252
step2.shape: (1,)
[25.32843602]
sum step2: 25.328436022934504
mean step2 25.328436022934504
The difference is because of these values: [.5, .89, .6], since it's sum is not equal to 1. I think you have made a mistake and you meant this instead: [.05, .89, .06].
If you provide the values with sum equal to 1, then both formulas results will be the same:
import tensorflow as tf
import numpy as np
y_true = np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.05, .89, .06], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#output
#[0.10536052 0.11653382 0.0618754 ]
#[0.10536052 0.11653382 0.0618754 ]
However, let's explore how is calculated if the y_pred tensor is not scaled (the sum of values is not equal to 1)? If you look at the source code of categorical cross entropy here, you will see that it scales y_pred so that the class probas of each sample sum to 1:
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
keep_dims=True)
since we passed a pred which the sum of probas is not 1, let's see how this operation changes our tensor [.5, .89, .6]:
output = tf.constant([.5, .89, .6])
output /= tf.reduce_sum(output,
axis=len(output.get_shape()) - 1,
keepdims=True)
print(output.numpy())
# array([0.2512563 , 0.44723618, 0.30150756], dtype=float32)
So, it should be equal if we replace the above operation output (scaled y_pred), and pass it to your own implemented categorical cross entropy, with the unscaled y_pred passing to tensorflow implementation:
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
#unscaled y_pred
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
#scaled y_pred (categorical_crossentropy scales above tensor to this internally)
y_pred = np.array([[.9, .05, .05], [0.2512563 , 0.44723618, 0.30150756], [.05, .01, .94]])
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
[0.10536052 0.80466845 0.0618754 ]
[0.10536052 0.80466846 0.0618754 ]
Now, let's explore the results of your second example. Why your second example shows different output?
If you check the source code again, you will see this line:
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
which clips values below than a threshold. Your input [0.99999999999, 0.00000000001] will be converted to [0.9999999, 0.0000001] in this line, so it gives you a different result:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#now let's first clip the values less than epsilon, then compare loss
epsilon=1e-7
y_pred = tf.clip_by_value(y_pred, epsilon, 1. - epsilon)
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
#results without clipping values
[16.11809565]
[25.32843602]
#results after clipping values if there is a value less than epsilon (1e-7)
[16.11809565]
[16.11809565]

From softmax output to class prediction

Is there an easy way to go from a Softmax output to a class prediction?
For instance,
from this:
[0.83128697, 0.06161868, 0.10709436]
to this:
[1, 0, 0]
You can use np.argmax to retrieve the index of max value:
import numpy as np
a = [0.83128697, 0.06161868, 0.10709436]
r = np.zeros(len(a)) # a.size if a is a numpy array
r[np.argmax(a)]=1
r
array([1., 0., 0.])

How to print Recall and Accuracy along with Parameters used in a GridSearch in Sklearn?

I want to print the accuracy,recall along with each parameters used in Grid, How that can be done.
My Gridsearch code
from sklearn.grid_search import GridSearchCV
rf1=RandomForestClassifier(n_jobs=-1, max_features='sqrt')
#fit_rf1=rf.fit(X_train_res,y_train_res)
# Use a grid over parameters of interest
param_grid = {
"n_estimators" : [50, 100, 150, 200],
"max_depth" : [2, 5, 10],
"min_samples_leaf" : [10,20,30]}
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_score,recall_score
scoring = {'precision': make_scorer(precision_score), 'Recall': make_scorer(recall_score)}
CV_rfc = GridSearchCV(estimator=rf1, param_grid=param_grid, cv= 10,scoring=scoring)
CV_rfc.fit(X_train_res, y_train_res)
My Expected Output
{'max_depth': 10, 'min_samples_leaf': 2, 'n_estimators': 50,'accuracy':.97,'recall':.89}
{'max_depth': 5, 'min_samples_leaf':10 , 'n_estimators': 100,'accuracy':.98,'recall':.92}
If you set scoring as a list of scorers, you can get the mean score for each scorer in CV_rfc.cv_results_.
For example:
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
X, y = make_classification()
base_clf = RandomForestClassifier()
param_grid = {
"n_estimators" : [50, 100, 150, 200],}
CV_rf = GridSearchCV(base_clf, param_grid, scoring=['accuracy', 'roc_auc'], refit=False)
CV_rf.fit(X, y)
print(CV_rf.cv_results_)
and you get output like:
{'mean_fit_time': array([ 0.05867839, 0.10268728, 0.15536443, 0.19937317]),
'mean_score_time': array([ 0.00600123, 0.01033529, 0.0146695 , 0.02000403]),
'mean_test_accuracy': array([ 0.9 , 0.91, 0.89, 0.91]),
'mean_test_roc_auc': array([ 0.91889706, 0.94610294, 0.94253676, 0.94308824]),
'mean_train_accuracy': array([ 1., 1., 1., 1.]),
'mean_train_roc_auc': array([ 1., 1., 1., 1.]),
[...]
}
So the mean_test_[scoring] is what you are after. Note that you can import cv_results_ as a Pandas DataFrame. That helps readability a lot!

tflearn to_categorical type error

I keep getting a typeError when I try to use to_categorical from tflearn. The output error is:`
trainY = to_categorical(y = trainY, nb_classes=2)
File "C:\Users\saleh\Anaconda3\lib\site-packages\tflearn\data_utils.py", line 46, in to_categorical
return (y[:, None] == np.unique(y)).astype(np.float32)
TypeError: list indices must be integers or slices, not tuple
This is the reproducible code that I am trying to run:
import tflearn
from tflearn.data_utils import to_categorical
from tflearn.datasets import imdb
#IMDB dataset loading
train, test, _ = imdb.load_data(path = 'imdb.pkl', n_words = 10000, valid_portion = 0.1)
trainX, trainY = train
testX, testY = test
#converting labels to binary vectors
trainY = to_categorical(y = trainY, nb_classes=2) # **This is where I get the error**
testY = to_categorical(y = testY, nb_classes=2)
Cannot reproduce your error:
import tflearn
from tflearn.data_utils import to_categorical
from tflearn.datasets import imdb
train, test, _ = imdb.load_data(path = 'imdb.pkl', n_words = 10000, valid_portion = 0.1)
trainX, trainY = train
testX, testY = test
trainY[0:5]
# [0, 0, 0, 1, 0]
trainY = to_categorical(y = trainY, nb_classes=2)
trainY[0:5]
# array([[ 1., 0.],
# [ 1., 0.],
# [ 1., 0.],
# [ 0., 1.],
# [ 1., 0.]])
System configuration:
Python 2.7.12
Tensorflow 1.3.0
TFLearn 0.3.2
Ubuntu 16.04
UPDATE: It seems that some recent TFLearn commit has broken to_categorical - see here and here. I suggest to uninstall your current version and install the latest stable one with pip install tflearn (this is actually what I have done myself above).