Related
I have trained a Siamese neural network that uses triplet loss. It was a pain, but I think I managed to do it. However, I am struggling to understand how to make evaluations with this model.
The SNN:
def triplet_loss(y_true, y_pred):
margin = K.constant(1)
return K.mean(K.maximum(K.constant(0), K.square(y_pred[:,0]) - 0.5*(K.square(y_pred[:,1])+K.square(y_pred[:,2])) + margin))
def euclidean_distance(vects):
x, y = vects
return K.sqrt(K.maximum(K.sum(K.square(x - y), axis=1, keepdims=True), K.epsilon()))
anchor_input = Input((max_len, ), name='anchor_input')
positive_input = Input((max_len, ), name='positive_input')
negative_input = Input((max_len, ), name='negative_input')
Shared_DNN = create_base_network(embedding_dim = EMBEDDING_DIM, max_len=MAX_LEN, embed_matrix=embed_matrix)
encoded_anchor = Shared_DNN(anchor_input)
encoded_positive = Shared_DNN(positive_input)
encoded_negative = Shared_DNN(negative_input)
positive_dist = Lambda(euclidean_distance, name='pos_dist')([encoded_anchor, encoded_positive])
negative_dist = Lambda(euclidean_distance, name='neg_dist')([encoded_anchor, encoded_negative])
tertiary_dist = Lambda(euclidean_distance, name='ter_dist')([encoded_positive, encoded_negative])
stacked_dists = Lambda(lambda vects: K.stack(vects, axis=1), name='stacked_dists')([positive_dist, negative_dist, tertiary_dist])
model = Model([anchor_input, positive_input, negative_input], stacked_dists, name='triple_siamese')
model.compile(loss=triplet_loss, optimizer=adam_optim, metrics=[accuracy])
history = model.fit([Anchor,Positive,Negative],y=Y_dummy,validation_data=([Anchor_test,Positive_test,Negative_test],Y_dummy2), batch_size=128, epochs=25)
I understand that once a model is trained with triplets, the evaluation shouldn't actually require that triplets be used. However, how do I finagle this reshaping?
Because this is a SNN, I would want to feed two inputs into model.evaluate, along with a categorical variable denoting if the two inputs are similar or not (1 = similar, 0 = not similar).
So basically, I want model.evaluate(input1, input2, y_label). But I am not sure how to get this with the model that I trained. As shown above, I trained with three inputs: model.fit([Anchor,Positive,Negative],y=Y_dummy ... ) .
I know I should save the weights of my trained model, but I just don't know what model to load the weights onto.
Your help is greatly appreciated!
EDIT:
I am aware of the below approach for prediction, but I am not looking for prediction, I am looking to use model.evaluate as I want to get some final measure of loss/accuracy for the model. Also this approach only feeds the anchor into the model (wheras I'm interested in text similarity, so would want to feed in 2 inputs)
eval_model = Model(inputs=anchor_input, outputs=encoded_anchor)
eval_model.load_weights('weights.hdf5')
Considering that eval_model is trained to produce embeddings, I think that should be good to evaluate the similarity between two embeddings using cosine similarity.
Following the TF documentation, the cosine similarity is a number between -1 and 1. When it is a negative number closer to -1, it indicates greater similarity. When it is a positive number closer to 1, it indicates greater dissimilarity.
We can simply calculate the cosine similarity between Positive and Negative inputs for all the samples at disposal. When the cosine similarity is < 0 we can say that the two inputs are similar (1 = similar, 0 = not similar). In the end, is possible to calculate the binary accuracy as a final metric.
We can make all the calculations using TF and without the need of using model.evaluate.
eval_model = Model(inputs=anchor_input, outputs=encoded_anchor)
eval_model.load_weights('weights.hdf5')
cos_sim = tf.keras.losses.cosine_similarity(
eval_model(X1), eval_model(X2)
).numpy().reshape(-1,1)
accuracy = tf.reduce_mean(tf.keras.metrics.binary_accuracy(Y, -cos_sim, threshold=0))
Another approach consists in computing the cosine similarity between the anchor and positive images and comparing it with the similarity between the anchor and the negative images.
eval_model = Model(inputs=anchor_input, outputs=encoded_anchor)
eval_model.load_weights('weights.hdf5')
positive_similarity = tf.keras.losses.cosine_similarity(
eval_model(X_anchor), eval_model(X_positive)
).numpy().mean()
negative_similarity = tf.keras.losses.cosine_similarity(
eval_model(X_anchor), eval_model(X_negative)
).numpy().mean()
We should expect the similarity between the anchor and positive images to be larger than the similarity between the anchor and the negative images.
I'm using the MNIST handwritten numerals dataset to train a CNN.
After training the model, i use predict like this:
predictions = cnn_model.predict(test_images)
predictions[0]
and i get output as:
array([2.1273775e-06, 2.9292005e-05, 1.2424786e-06, 7.6307842e-05,
7.4305902e-08, 7.2301691e-07, 2.5368356e-08, 9.9952960e-01,
1.2401938e-06, 1.2787555e-06], dtype=float32)
In the output, there are 10 probabilities, one for each of numeral from 0 to 9. But how do i know which probability refers to which numeral ?
In this particular case, the probabilities are arranged sequentially for numerals 0 to 9. But why is that ? I didn't define that anywhere.
I tried going over documentation and example implementations found elsewhere on the internet, but no one seems to have addressed this particular behaviour.
Edit:
For context, I've defined my train/test data by:
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = (np.expand_dims(train_images, axis=-1)/255.).astype(np.float32)
train_labels = (train_labels).astype(np.int64)
test_images = (np.expand_dims(test_images, axis=-1)/255.).astype(np.float32)
test_labels = (test_labels).astype(np.int64)
And my model consists of a a few convulution and pooling layers, then a Flatten layer, then a Dense layer with 128 neurons and an output Dense layer with 10 neurons.
After that I simply fit my model and use predict like this:
model.fit(train_images, train_labels, batch_size=BATCH_SIZE, epochs=EPOCHS)
predictions = cnn_model.predict(test_images)
I don't see where I've instructed my code to output first neuron as digit 0, second neuron as digit 1 etc
And if i wanted to change the the sequence in which the resulting digits are output, where do i do that ?
This is really confusing me a lot.
Models work with numbers. Your classes/labels should be represented as numbers (e.g., 0, 1, ...., n). The prediction is always indexed to show probabilities for class 0 at index 0, class 1 at index 1. Now in the MNIST case, you are lucky the labels are integers 0 to 9. Suppose you had to classify images into three classes: cars, bicycles, trucks. You must represent those classes as numerical values. You can arrange it as you wish. If you choose this: {cars: 0, bicycles: 1, trucks: 2}, in other words, if you label your cars as 0, bicycles as 1, and trucks as 2, then your prediction would show probability for cars at index 0, bicycles at index 1 and trucks at index 2.
You could have also decided to choose this setting: {cars: 2, bicycles: 0, trucks: 1}, then your prediction would show probability for cars at index 2, bicycles at index 0 and trucks at index 1, and so on.
The point is, you have to show your classes (as many as you have) as integers indexed from 0 to n where n is the num_classes-1. Your probabilities at prediction would be indexed as such. You don't have to tell the model.
Hope this is now clear.
It depends on how you prepare your labels during training. With MNIST classification, usually, there are two different ways:
One-hot Labels: There are 10 labels in the MNIST data, therefore for each example (image), you create a label array (vector) of length 10 where all the elements are zero except the index corresponding to the digit that your input image is showing. For example, if your input image is showing the digit 8, your label contains zeros everywhere except at the 8th index (e.g. [0,0,0,0,0,0,0,0,1,0]). If your image is showing the digit 2, your label would be something like [0,0,1,0,0,0,0,0,0,0] and so on.
Sparse Labels: you just label each image directly by what digit it is showing, for example if your image is showing the digit 8, your label is a single number with value 8.
In both cases, you could choose the labels however you want, in the MNIST classification it is just intuitive to use the labels 0-9 to show digits 0-9.
Thus, in the prediction, the probability at index 0 is for digit 0, index 1 for digit 1, and so on.
You could choose to prepare your labels differently. For example you could decide to show your labels as follows:
label for digit 0: 9
label for digit 1: 8
label for digit 2: 7
label for digit 3: 6
label for digit 4: 5
label for digit 5: 4
label for digit 6: 3
label for digit 7: 2
label for digit 8: 1
label for digit 9: 0
You could train your model the same way but in this case, the probabilities in the prediction would be inverted. Probability at index 0 would be for digit 9, index 1 for digit 8, and so on.
In short, you have to define your labels using integer indices, but it is up to you to decide and remember what index you chose to refer to which label/class.
I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data
The Word2vec model uses noise-contrastive estimation (NCE) loss to train the model.
Why does it use tf.mul in the true sample logit calculation, but uses tf.matmul in the negative calculation?
See the source code.
One way you can think of the NCE loss calculation is as a batch of independent, binary logistic regression classifications problems. In both cases we are performing the same calculations, even though it does not look like it at the first place.
To show you that we are actually calculating the same thing, assume the follwoing for the true input part:
emb_dim = 3 # dimensions of your embedding vector
batch_size = 2 # number of examples in your trainings batch
vocab_size = 6 # number of total words in your text
# (so your word ids range from 0 - 5)
Furthermore, assume the following training example in your batch:
1 => 0 # given word with word_id=1, I expect word with word_id=0
1 => 2 # given word with word_id=1, I expect word with word_id=2
Then your embedding matrix example_emb has the dimensions [2,3] and your true weight matrix true_w also has the dimensions [2,3], and should look like this:
example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word
true_w = [ [w1,w2,w3], [w4,w5,w5] ] # [2,3] target word
The example_emb is a subset of the total word embeddings (emb) that you are tryin to learn, and true_w is a a subset of the weights (smb_w_t). Each row in example_emb represents and input vector , and each row in the weight represent a target vector.
So [e1,e2,e3] is the word vector of the input word with word_id = 1 taken from emb, and [w1,w2,w3] is the word vector of the expected target word with word_id = 0.
Now intuitively stated, the classification task you are trying to solve is: given i see input word and target word is this observation correct?
The two classification tasks then are (without the bias, and tensorflow has this handy 'sigmoid_cross_entropy_with_logits' function, which applies the sigmoid later):
logit( 1=>0 ) = dot( [e1,e2,e3], transpose( [w1,w2,w3] ) =>
logit( 1=>0 ) = e1*w1 + e2*w2 + e3*w3
and
logit( 1=>2 ) = dot( [e1,e2,e3], transpose( [w4,w5,w6] ) =>
logit( 1=>2 ) = e1*w4 + e2*w5 + e3*w6
We can calculate [[logit(1=>0)],[logit(1=>2)]] the easiest if we perform an element-wise multiplication tf.mul() and then summing up each row.
The output of this calculations will be a [batch_size, 1] matrix containing the logits for the correct words. We do know the ground truth/label (y') for this examples, which is 1 because these are the correct examples.
true_logits = [
[logit(1=>0)], # first input word of the batch
[logit(1=>2)] # second input word of the batch
]
Now for the second part of your question why you we use tf.matmul() in the negative sampling, let's assume that we draw 3 negative samples (num_sampled=3). So sampled_ids = [3,4,5].
Intuitively, this means that you add six more training examples to your batch, namely:
1 => 3 # given word_id=1, do i expect word_id=3? No, because these are negative examples.
1 => 4
1 => 5
1 => 3 # second input word is also word_id=1
1 => 4
1 => 5
So you look up your sampled_w, which turns out to be a [3, 3] matrix. Your parameters now look like this:
example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word
sampled_w = [ [w6,w7,w8], [w9,w10,w11], [w12,w13,w14] ] # [3,3] sampled target words
Similar to the true case, what we want is the logits for all negative training examples. E.g., for the first example:
logit(1 => 3) = dot( [e1,e2,e3], transpose( [w6,w7,w8] ) =>
logit(1 => 3) = e1*w6 + e2*w7 + e3*w8
Now in this case, we can use the matrix multiplication after we transpose the sampled_w matrix. This is achieved using the transpose_b=True parameter in the tf.matmul() call. The transposed weight matrix looks like this:
sampled_w_trans = [ [w6,w9,w12], [w7,w10,w13], [w8,w11,w14] ] # [3,3]
So now the tf.matmul() operation will return a [batch_size, 3] matrix, where each row are the logits for one example of the input batch. Each element represents a logit for a classification task.
The whole result matrix of the negative sampling contains this:
sampled_logits = [
[logit(1=>3), logit(1,4), logit(1,5)], # first input word of the batch
[logit(1=>3), logit(1,4), logit(1,5)] # second input word of the batch
]
The labels / ground truth for the sampled_logits are all zeros, because these are the negative examples.
In both cases we perform the same calculation, that is the calculation for a binary classification logistic regression (without the sigmoid, which is applied later).
I am looking for algorithm to solve the following problem :
I have two sets of vectors, and I want to find the matrix that best approximate the transformation from the input vectors to the output vectors.
vectors are 3x1, so matrix is 3x3.
This is the general problem. My particular problem is I have a set of RGB colors, and another set that contains the desired color. I am trying to find an RGB to RGB transformation that would give me colors closer to the desired ones.
There is correspondence between the input and output vectors, so computing an error function that should be minimized is the easy part. But how can I minimize this function ?
This is a classic linear algebra problem, the key phrase to search on is "multiple linear regression".
I've had to code some variation of this many times over the years. For example, code to calibrate a digitizer tablet or stylus touch-screen uses the same math.
Here's the math:
Let p be an input vector and q the corresponding output vector.
The transformation you want is a 3x3 matrix; call it A.
For a single input and output vector p and q, there is an error vector e
e = q - A x p
The square of the magnitude of the error is a scalar value:
eT x e = (q - A x p)T x (q - A x p)
(where the T operator is transpose).
What you really want to minimize is the sum of e values over the sets:
E = sum (e)
This minimum satisfies the matrix equation D = 0 where
D(i,j) = the partial derivative of E with respect to A(i,j)
Say you have N input and output vectors.
Your set of input 3-vectors is a 3xN matrix; call this matrix P.
The ith column of P is the ith input vector.
So is the set of output 3-vectors; call this matrix Q.
When you grind thru all of the algebra, the solution is
A = Q x PT x (P x PT) ^-1
(where ^-1 is the inverse operator -- sorry about no superscripts or subscripts)
Here's the algorithm:
Create the 3xN matrix P from the set of input vectors.
Create the 3xN matrix Q from the set of output vectors.
Matrix Multiply R = P x transpose (P)
Compute the inverseof R
Matrix Multiply A = Q x transpose(P) x inverse (R)
using the matrix multiplication and matrix inversion routines of your linear algebra library of choice.
However, a 3x3 affine transform matrix is capable of scaling and rotating the input vectors, but not doing any translation! This might not be general enough for your problem. It's usually a good idea to append a "1" on the end of each of the 3-vectors to make then a 4-vector, and look for the best 3x4 transform matrix that minimizes the error. This can't hurt; it can only lead to a better fit of the data.
You don't specify a language, but here's how I would approach the problem in Matlab.
v1 is a 3xn matrix, containing your input colors in vertical vectors
v2 is also a 3xn matrix containing your output colors
You want to solve the system
M*v1 = v2
M = v2*inv(v1)
However, v1 is not directly invertible, since it's not a square matrix. Matlab will solve this automatically with the mrdivide operation (M = v2/v1), where M is the best fit solution.
eg:
>> v1 = rand(3,10);
>> M = rand(3,3);
>> v2 = M * v1;
>> v2/v1 - M
ans =
1.0e-15 *
0.4510 0.4441 -0.5551
0.2220 0.1388 -0.3331
0.4441 0.2220 -0.4441
>> (v2 + randn(size(v2))*0.1)/v1 - M
ans =
0.0598 -0.1961 0.0931
-0.1684 0.0509 0.1465
-0.0931 -0.0009 0.0213
This gives a more language-agnostic solution on how to solve the problem.
Some linear algebra should be enough :
Write the average squared difference between inputs and outputs ( the sum of the squares of each difference between each input and output value ). I assume this as definition of "best approximate"
This is a quadratic function of your 9 unknown matrix coefficients.
To minimize it, derive it with respect to each of them.
You will get a linear system of 9 equations you have to solve to get the solution ( unique or a space variety depending on the input set )
When the difference function is not quadratic, you can do the same but you have to use an iterative method to solve the equation system.
This answer is better for beginners in my opinion:
Have the following scenario:
We don't know the matrix M, but we know the vector In and a corresponding output vector On. n can range from 3 and up.
If we had 3 input vectors and 3 output vectors (for 3x3 matrix), we could precisely compute the coefficients αr;c. This way we would have a fully specified system.
But we have more than 3 vectors and thus we have an overdetermined system of equations.
Let's write down these equations. Say that we have these vectors:
We know, that to get the vector On, we must perform matrix multiplication with vector In.In other words: M · I̅n = O̅n
If we expand this operation, we get (normal equations):
We do not know the alphas, but we know all the rest. In fact, there are 9 unknowns, but 12 equations. This is why the system is overdetermined. There are more equations than unknowns. We will approximate the unknowns using all the equations, and we will use the sum of squares to aggregate more equations into less unknowns.
So we will combine the above equations into a matrix form:
And with some least squares algebra magic (regression), we can solve for b̅:
This is what is happening behind that formula:
Transposing a matrix and multiplying it with its non-transposed part creates a square matrix, reduced to lower dimension ([12x9] · [9x12] = [9x9]).
Inverse of this result allows us to solve for b̅.
Multiplying vector y̅ with transposed x reduces the y̅ vector into lower [1x9] dimension. Then, by multiplying [9x9] inverse with [1x9] vector we solved the system for b̅.
Now, we take the [1x9] result vector and create a matrix from it. This is our approximated transformation matrix.
A python code:
import numpy as np
import numpy.linalg
INPUTS = [[5,6,2],[1,7,3],[2,6,5],[1,7,5]]
OUTPUTS = [[3,7,1],[3,7,1],[3,7,2],[3,7,2]]
def get_mat(inputs, outputs, entry_len):
n_of_vectors = inputs.__len__()
noe = n_of_vectors*entry_len# Number of equations
#We need to construct the input matrix.
#We need to linearize the matrix. SO we will flatten the matrix array such as [a11, a12, a21, a22]
#So for each row we combine the row's variables with each input vector.
X_mat = []
for in_n in range(0, n_of_vectors): #For each input vector
#populate all matrix flattened variables. for 2x2 matrix - 4 variables, for 3x3 - 9 variables and so on.
base = 0
for col_n in range(0, entry_len): #Each original unknown matrix's row must be matched to all entries in the input vector
row = [0 for i in range(0, entry_len ** 2)]
for entry in inputs[in_n]:
row[base] = entry
base+=1
X_mat.append(row)
Y_mat = [item for sublist in outputs for item in sublist]
X_np = np.array(X_mat)
Y_np = np.array([Y_mat]).T
solution = np.dot(np.dot(numpy.linalg.inv(np.dot(X_np.T,X_np)),X_np.T),Y_np)
var_mat = solution.reshape(entry_len, entry_len) #create square matrix
return var_mat
transf_mat = get_mat(INPUTS, OUTPUTS, 3) #3 means 3x3 matrix, and in/out vector size 3
print(transf_mat)
for i in range(0,INPUTS.__len__()):
o = np.dot(transf_mat, np.array([INPUTS[i]]).T)
print(f"{INPUTS[i]} x [M] = {o.T} ({OUTPUTS[i]})")
The output is as such:
[[ 0.13654096 0.35890767 0.09530002]
[ 0.31859558 0.83745124 0.22236671]
[ 0.08322497 -0.0526658 0.4417611 ]]
[5, 6, 2] x [M] = [[3.02675088 7.06241873 0.98365224]] ([3, 7, 1])
[1, 7, 3] x [M] = [[2.93479472 6.84785436 1.03984767]] ([3, 7, 1])
[2, 6, 5] x [M] = [[2.90302805 6.77373212 2.05926064]] ([3, 7, 2])
[1, 7, 5] x [M] = [[3.12539476 7.29258778 1.92336987]] ([3, 7, 2])
You can see, that it took all the specified inputs, got the transformed outputs and matched the outputs to the reference vectors. The results are not precise, since we have an approximation from the overspecified system. If we used INPUT and OUTPUT with only 3 vectors, the result would be exact.