Seq2seq multiple input features (Passing multiple word/word tokens as input) - tensorflow

Is there a way to pass extra feature tokens along with the existing word token (training features/source file vocabulary) and feed it to the encoder RNN of seq2seq?. Since, it currently accepts only one word token from the sentence at a time.
Let me put this in a more concrete fashion; Consider the example of machine translation/nmt - say I have 2 more feature columns for the corresponding source vocabulary set( Feature1 here ). For example, consider this below:
+---------+----------+----------+
|Feature1 | Feature2 | Feature3 |
+---------+----------+----------+
|word1 | x | a |
|word2 | y | b |
|word3 | y | c |
|. | | |
|. | | |
+---------+----------+----------+
To summarise, currently seq2seq dataset is the parallel data corpora has a one-to one mapping between he source feature(vocabulary,i.e Feature1 alone) and the target(label/vocabulary). I'm looking for a way to map more than one feature(i.e Feature1, Feature2,Feature3) to the target(label/vocabulary).
Moreover, I believe this is glossed over in the seq2seq-pytorch tutorial(https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb) as quoted below:
When using a single RNN, there is a one-to-one relationship between
inputs and outputs. We would quickly run into problems with different
sequence orders and lengths that are common during translation…….With
the seq2seq model, by encoding many inputs into one vector, and
decoding from one vector into many outputs, we are freed from the
constraints of sequence order and length. The encoded sequence is
represented by a single vector, a single point in some N dimensional
space of sequences. In an ideal case, this point can be considered the
"meaning" of the sequence.
Furthermore, I tried tensorflow and took me a lot of time to debug and make appropriate changes and got nowhere. And heard from my colleagues that pytorch would have the flexibility to do so and would be worth checking out.
Please share your thoughts on how to achieve the same in tensorflow or pytorch. Would be great of anyone tells how to practically implement/get this done. Thanks in advance.

Related

Person recolonization using ML.NET/TensorFlow

I am noob in ML. I have a Person table that have,
-----------------------------------
User
-----------------------------------
UserId | UserName | UserPicturePath
1 | MyName | MyName.jpeg
Now I have tens of millions of persons in my database. I wanna train my model to predict the UserId by giving images(png/jpeg/tiff) in bytes. So, input will be images and the output I am looking is UserId. Right now I am looking for a solution in ML.NET but I am open to switch to TensorFlow.
Well, this is nothing but a mapping problem, particularly an id-to-face mapping problem, and neural nets excell at this more than on anything else.
As you have understood by now, you can do this using tensorflow, pytorch or any library of the same purpose.
But if you want to use tensorflow, read on for a ready code at the end. It is easiest to achieve your task by transfer learning, i.e. by loading some pretrained model, freezing all but last layer and then training the network to produce a latent one-dimensional vector for a given face image. Then you can save this vector into a database and map it into an id.
Then, whenever there is a new image and you want to predict an id for the image, you run your image through the network, get your vector and compute cosine similarity with vectors in your database. If the similarity is above some threshold and it is the highest among other similarities, you have found your id.
There are many ways to go about this. Sure you have to preprocess your data, and augment it at the same time, but if you want some ready code to play with then have a look at this famous happy house tutorial from Andrew NG and his team:
https://github.com/gemaatienza/Deep-Learning-Coursera/blob/master/4.%20Convolutional%20Neural%20Networks/Keras%20-%20Tutorial%20-%20Happy%20House%20v2.ipynb
This should suffice your needs.
Hope it helps!

Stan vs PYMC3 for Discrete Mixture Models

I am studying zero-inflated count temporal data. I have built a stan model that deals with this zero-inflated data with an if statement in the model block. This is as they advise in the Stan Reference Guide. e.g.,
model {
for (n in 1:N) {
if (y[n] == 0)
target += log_sum_exp(bernoulli_lpmf(1 | theta), bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n] | lambda));
else
target += bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n] | lambda);
}
}
This if statement is clearly necessary as Stan uses NUTS as the sampler which does not deal with discrete variables (and thus we are marginalising over this discrete random variable instead of sampling from it). I have not had very much experience with pymc3 but my understanding is that it can deal with a Gibbs update step (to sample from the discrete bernoulli likelihood). Then conditioned on the zero-inflated value, it could perform a Metropolis or NUTS update for the parameters that depend on the Poisson likelihood.
My question is: Can (and if so how can) pymc3 be used in such a way to sample from the discrete zero-inflated variable with the updates to the continuous variable being performed with a NUTS update? If it can, is the performance significantly improved over the above implementation in stan (which marginalises out the discrete random variable)? Further, if pymc3 can only support a Gibbs + Metropolis update, is this change away from NUTS worth considering?

Training with tensorflow seq2seq model

I am currently working with lstm. I have a dataset of a number of sentences about transactional info and I want to extract information, suppose amount, date and transactionWith. I already tried with basic lstm where my system tried to predict each word of a given sequence as amount, date, transactionWith or irrelevant.
I have made my training data like this:
Input:
You gave 100.00 to John on 13-08-2018
Target:(labelled every word)
ir ir amount ir transactionWith ir date
You can see that the entire dataset has a lot of "ir" or irrelevant tag and I think that will make my system biased to predict "ir" for test data.
Now I want to try using seq2seq model of tensorflow where the input is a transactional sentence and the target is a seq of the extracted information. An example would be like this -
Input:
You gave 100.00 to John on 13-08-2018.
Target:
100.00 13-08-2018 John
Here all my target seq will maintain a fixed format like the first one is the amount, the second one is date the third one is the transactionWith etc.
Can I do this like a language translation model with encoder for input seq and decoder for target seq and how can I make sure that my predicted seq for test data is from within the vocabulary of the given single input sentence and not from the entire target vocabulary?
Thank you all the awesome people in advance. :)

How do I generate Tensorflow Confusion Matrix from "output_graph.pb" and "output_labels.txt"?

So I have followed this tutorial and retrained using my own images.
https://www.tensorflow.org/tutorials/image_retraining
So I now have an "output_graph.pb" and a "output_labels.txt" (which I can use with other code to classify images).
But how do I actually generate a confusion matrix using a folder of testing images (or at least with the images it was trained on)?
There is https://www.tensorflow.org/api_docs/python/tf/confusion_matrix
but that doesnt seem very helpful.
This thread seems to just be using numbers to represent labels rather than actual files, but not really sure: how to create confusion matrix for classification in tensorflow
And Im not really sure how to use the code in this thread either:
How do i create Confusion matrix of predicted and ground truth labels with Tensorflow?
I would try to create you confusion matrix manually, using something like this steps:
Modify the label_image example to print out just the top label.
Write a script to call the modified label_image repeatedly for all images in a folder.
Have the script print out the ground truth label, and then call label_image to print the predicted one.
You should now have a text list of all your labels in the console, something like this:
apple,apple
apple,pear
pear,pear
pear,orange
...
Now, create a spreadsheet with both row and column names for all the labels:
| apple | pear | orange
-------+----------------------
apple |
pear |
orange |
The value for each cell will be the number of pairs that show up in your console list for row, column. For a small set of images you can compute this manually, or you can write a script to calculate this if there's too many.

Understanding this application of a Naive Bayes Classifier

I'm a little confused with this example I've been following online. Please correct me if anything is wrong before I get to my question! I know Bayes theorem is this:
P(A│B)= P(B│A) * P(A)
----------
P(B)
In the example I'm looking at, classifying is being done on text documents. The text documents are all either "terrorism" or "entertainment", so:
Prior probability for either, i.e. P(A) = 0.5
There are six documents with word frequencies like so:
The example goes on to break down the frequency of these words in relation to each class, applying Laplace estimation:
So to my understanding each of these numbers represents the P(B|A), i.e. the probability of that word appearing given a particular class (either terrorism or entertainment).
Now a new document arrives, with this breakdown:
The example calculates the probability of this new text document relating to terrorism by doing this:
P(Terrorism | W) = P(Terrorism) x P(kill | Terrorism) x P(bomb | Terrorism) x P(kidnap | Terrorism) x P(music | Terrorism) x P(movie | Terrorism) x P(TV | Terrorism)
which works out as:
0.5 x 0.2380 x 0.1904 x 0.3333 x 0.0476 x 0.0952 x 0.0952
Again, up to now I think I'm following. The P(Terrorism | W) is P (A|B), P(Terrorism) = P(A) = 0.5 and P(B|A) = all the results for "terrorism" in the above table multiplied together.
But to apply it to this new document, the example calculates each of the P(B|A) above to the power of the new frequency. So the above calculation becomes:
0.5 x 0.2380^2 x 0.1904^1 x 0.3333^2 x 0.0476^0 x 0.0952^0 x 0.0952^1
From there they do a few sums which I get and find the answer. My question is:
Where in the formula does it say to apply the new frequency as a power to the current P(B|A)?
Is this just something statistical I don't know about? Is this universal or just a particular example of how to do it? I'm asking because all the examples I find are slightly different, using slightly different keywords and terms and I'm finding it just a tad confusing!
First of all, the formula
P(Terrorism | W) = P(Terrorism) x P(kill | Terrorism) x P(bomb | Terrorism) x P(kidnap | Terrorism) x P(music | Terrorism) x P(movie | Terrorism) x P(TV | Terrorism)
isn't quite right. You need to divide that by P(W). But you hint that this is taken care of later when it says that "they do a few sums", so we can move on to your main question.
Traditionally when doing Naive Bayes on text classification, you only look at the existence of words, not their counts. Of course you need the counts to estimate P(word | class) at train time, but at test time P("music" | Terrorism) typically means the probability that the word "music" is present at least once in a Terrorism document.
It looks like what the implementation you are dealing with is doing is it's trying to take into account P("occurrences of kill" = 2 | Terrorism) which is different from P("at least 1 occurrence of kill" | Terrorism). So why do they end up raising probabilities to powers? It looks like their reasoning is that P("kill" | Terrorism) (which they estimated at train time) represents the probability of an arbitrary word in a Terrorism document to be "kill". So by simplifying assumption, the probability of a second arbitrary word in a Terrorism document to be "kill" is also P("kill" | Terrorism).
This leaves a slight problem for the case that a word does not occur in a document. With this scheme, the corresponding probability is raised to the 0th power, in other words it goes away. In other words, it is approximating that P("occurrences of music" = 0 | Terrorism) = 1. It should be clear that in general, this is strictly speaking false since it would imply that P(occurrences of music" > 0 | Terrorism) = 0. But for real world examples where you have long documents and thousands or tens of thousands of words, most words don't occur in most documents. So instead of bothering with accurately calculating all those probabilities (which would be computationally expensive), they are basically swept under the rug because for the vast majority of cases, it wouldn't change the classification outcome anyway. Also note that on top of it being computationally intensive, it is numerically unstable because if you are multiplying thousands or tens of thousands of numbers less than 1 together, you will underflow and it will spit out 0; if you do it in log space, you are still adding tens of thousands of numbers together which would have to be handled delicately from a numerical stability point of view. So the "raising it to a power" scheme inherently removes unnecessary fluff, decreasing computational intensity, increasing numerical stability, and still yields nearly identical results.
I hope the NSA doesn't think I'm a terrorist for having used the word Terrorism so much in this answer :S