Keras/TensorFlow: How do I transform text to use as input? - file-io

I've been reading tutorials for the last few days, but they all seem to start at the step of "I have my data from this pre-prepared data set, let's go".
What I'm trying to do is take a set of emails I've tokenized, and figure out how to get them into a model as the training and evaluation data.
Example email:
0 0 0 0 0 0 0 0 0 0 0 0 32192 6675 16943 1380 433 8767 2254 8869 8155
I have a folder of emails (one file per email) for each spam and not spam:
/spam/
93451.txt
...
/not-spam/
112.txt
...
How can I get Keras to read this data?
Alternatively, how can I generate a CSV or some other format that it wants to use to input it?

There are many ways to do this, but ill try in this order:
You need to create dictionary of all the words in dataset and then assign a token for each of them. When inputing to the network you can convert it into a one-hot encoded form.
You can convert the input text by feeding it to a pretrained word embeddings model like glove or word-2-vec and obtain a embeddings vector.
You can use the one-hot vector from 1 and train your own embeddings.

As I understood from your task description (please guide me if I'm wrong), you need to classify texts into either spam or not spam category.
Basically, if you want to create the universal text data classification input solution, your
data input stage code should contain 3 steps:
1. Reading list of folders ("spam", "not spam" in your case) and iterating each folder to the list of files.
At the end you should have:
a) a dictionary containing (label_id -> label_name).
So in your case, you should stay with (0-> spam, 1->not_spam).
b) A pair of (file_content, label).
As you understand, this is out of scope of both keras and tensorflow. It is typical python' code.
2. For each part (file_content, label) you should process the first element, and that's the most interesting part usually.
In your example I can see 0 0 0 0 0 0 0 0 0 0 0 0 32192 6675 16943 1380 433 8767 2254 8869 8155. So you already have the indexes of the words, but they are in the text form. All you need is to transform the string to the array having the 300 items (words in your message).
For the further text machine learning projects, I suggest to use raw text data as a source and transform it to the word indexes using tf.contrib.learn.preprocessing.VocabularyProcessor.
3. Transform labels(categories) to the one-hot vector.
So at the end of these step you have a pair of (word_indexes_as_array, label_as_one_hot).
Then you can use these data as input data for training.
Naturally, you would divide this tuple into two, treating the first 80% of data as training set and 20% as testing set (please do not focus on 80/20, numbers it is just a sample).
You may look at at the text classification with keras examples. They are rather straightforward and may be helpful for you as they are starting from the data input step.
Also, please, look at the load_data_and_labels() method in the data input step example. It is a very similar case to yours (positive/negative).

Related

Tensorflow-Deeplearning - Correlation between input and output

I'm experimenting with tensorflow for speech recognition.
I have inputs as waveforms and words as output.
The waveform would look like this
[0,0,0,-2,3,-4,-1,7,0,0,0...0,0,0,20,-11,4,0,0,1,...]
The words would be an array of numbers while each number represents a word:
[12,4,2,3]
After training I also want to find out the correlation between input and output for each output label.
For example I want to know which input neurons | samples are responsible for the first label (here 12).
[0,0.01,0.10,0.99,0.77,0.89,0.99,0.79,0.22,0.11,0...0,0,0,0,0,0,0,0,0,...]
The original values of the input would be replaced with the correlation while 0 means no correlation and 1 means total correlation.
The goal is to get the position when a word starts.
Is there a function in tensorflow to get this correlation?
Question
I have a sequence of data (X) that I want to translate into another sequence of data (Y) as well as report what part of (X) contributed to (Y).
Answer
This is a well known problem and Tensorflow.org actually has a fantastic example neural machine translation with attention
The example code show how to translate X (Spanish) into Y (English) and report what part of X contributes to the decision of each part of Y (attention)
The exact same principle and code can be used to translate X (wave data) into Y (words) and report what part of the wave data contributes to each word via the attention readout.
The attention layer in the example is called attention_layer.

How to embed discrete IDs in Tensorflow?

There are many discrete IDs and I want to embed them to feed into a neural network. tf.nn.embedding_lookup only supports the fixed range of IDs, i.e., ID from 0 to N. How to embed the discrete IDs with the range of 0 to 2^62.
Just to clarify how I understand your question, you want to do something like word embeddings, but instead of words you want to use discrete IDs (not indices). Your IDs can be very large (2^62). But the number of distinct IDs is much less.
If we were to process words, then we would build a dictionary of the words and feed the indices within the dictionary to the neural network (into the embedding layer). That is basically what you need to do with your discrete IDs too. Usually you'd also reserve one number (such as 0) for not previously seen values. You could also later trim the dictionary to only include the most frequent values and put all others into the same unknown bucket (exactly the same options you would have when doing word embeddings or other nlp).
e.g.:
unknown -> 0
84588271 -> 1
92238356 -> 2
78723958 -> 3

Input images in tensorflow graph as a couple

I want to input the images in my siamese model as a couple. But if I read them from a list of image names, using
file_name_q = tf.train.string_input_producer(string_tensor=name, shuffle=False, )
the images are read read sequentially and their integrity as a couple is destroyed. Here, name is the list of all the images names, such that index 0 and 1 is a pair, 2 and 3 is a pair and so on.
Any insights?
Consider using tf.train.batch_join, as that maintains grouping between two tensors. See the documentation at: https://www.tensorflow.org/api_docs/python/io_ops/input_pipeline

Neural network non-linear input

I have a question about the input choice for my neural network. I have a geographical area that is split into 40 smaller parts which i wish to give as input to my network. I have labeled those from 0-40 and passed them as ints to the network togeather with some other parameters to find a relation. However the desired result from these area inputs are completely unrelated, so the input area 1 and 2 is just as different as 1 and 25.
Often when i read exammples the input value is quite logical. 0 or 1 if the input is a simple true/false alternative. Or maybe if the image is a 32*32 grayscale picture the input is 1024 neurons accepting values from 0-255.
In my case when the 'area' parameter is not linear, what is the proper method to pass it to my network? Or is the whole setup faulty?
I would recommend 40 input variables. Every input variable would correspond to exactly one of your 40 areas. You would set only the input variable corresponding to the correct location to ''1'', and all others to ''0''

Strict class labels in SVM

I'm using one-vs-all to do a 21-class svm categorization.
I want the label -1 to mean "not in this class" and the label 1 to mean "indeed in this class" for each of the 21 kernels.
I've generated my pre-computed kernels and my test vectors using this standard.
Using easy.py everything went well for 20 of the classes, but for one of them the labels were switched so that all the inputs that should have been labelled with 1 for being in the class were instead labelled -1 and vice-versa.
The difference in that class was that the first vector in the pre-computed kernel was labelled 1, while in all the other kernels the first vector was labelled -1. This suggests that LibSVM relabels all of my vectors.
Is there a way to prevent this or a simple way to work around it?
You already discovered that libsvm uses the label -1 for whatever label it encounters first.
The reason is, that it allows arbitrary labels and changes them to -1 and +1 according to the order in which they appear in the label vector.
So you can either check this directly or you look at the model returned by libsvm.
It contains an entry called Label which is a vector containing the order in which libsvm encountered the labels. You can also use this information to switch the sign of your scores.
If during training libsvm encounters label A first, then during prediction
libsvm will use positive values for assigning object the label A and negative values for another label.
So if you use label 1 for positive class and 0 for negative, then to obtain right output values you should do the following trick (Matlab).
%test_data.y contains 0-s and 1-s
[labels,~,values] = svmpredict(test_data.y, test_data.X, model, ' ');
if (model.Label(1) == 0) % we check which label was encountered by libsvm first
values = -values;
end