Tensorflow-Deeplearning - Correlation between input and output - tensorflow

I'm experimenting with tensorflow for speech recognition.
I have inputs as waveforms and words as output.
The waveform would look like this
[0,0,0,-2,3,-4,-1,7,0,0,0...0,0,0,20,-11,4,0,0,1,...]
The words would be an array of numbers while each number represents a word:
[12,4,2,3]
After training I also want to find out the correlation between input and output for each output label.
For example I want to know which input neurons | samples are responsible for the first label (here 12).
[0,0.01,0.10,0.99,0.77,0.89,0.99,0.79,0.22,0.11,0...0,0,0,0,0,0,0,0,0,...]
The original values of the input would be replaced with the correlation while 0 means no correlation and 1 means total correlation.
The goal is to get the position when a word starts.
Is there a function in tensorflow to get this correlation?

Question
I have a sequence of data (X) that I want to translate into another sequence of data (Y) as well as report what part of (X) contributed to (Y).
Answer
This is a well known problem and Tensorflow.org actually has a fantastic example neural machine translation with attention
The example code show how to translate X (Spanish) into Y (English) and report what part of X contributes to the decision of each part of Y (attention)
The exact same principle and code can be used to translate X (wave data) into Y (words) and report what part of the wave data contributes to each word via the attention readout.
The attention layer in the example is called attention_layer.

Related

How to train data of different lengths in machine learning?

I am analyzing the text of some literary works and I want to look at the distance between certain words in the text. Specifically, I am looking for parallelism.
Since I can’t know the specific number of tokens in a text I can’t simply put all words in the text in the training data because it would not be uniform across all training data.
For example, the text:
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today."
Is not the same text length as
"My fellow Americans, ask not what your country can do for you, ask what you can do for your country."
So therefore I could not columns out of each word and then assign the distance in a row because the lengths would be different.
How could I go about representing this in training data? I was under the assumption that training data had to be the same type and length.
In order to solve this problem you can use something called pad_sequence,so follow this process, sure you are going to transform the data throught some word embedding techniques like TF-IDF or any other algorithm, and after finishing the process of converting the textual data into vectors and by using the shape method you can figure the maximum length you have and than use that maximum in the pad-sequence method, and here is a how you implement this method:
'''
from keras.preprocessing.sequence import pad_sequences
padded_data= pad_sequences(name-of-your-data, maxlen=your-maximum-shape, padding='post', truncating='post')
'''

LIBSVM Data Preparation: Excel data to LIBSVM format

I want to study how to perform LIBSVM for regression and I'm currently stuck in preparing my data. Currently I have this form of data in .csv and .xlsx format and I want to convert it into libsvm data format.
So far, I understand that the data should be in this format so that it can be used in LIBSVM:
Based on what I read, for regression, "label" is the target value which can be any real number.
I am doing a electric load prediction study. Can anyone tell me what it is? And finally, how should I organized my columns and rows?
The LIBSVM data format is given by:
<label> <index1>:<value1> <index2>:<value2> .........
As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix. If you specify a value for each index, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>, only the indices 5 and 8 and of course label will have a custom value, all other values are set to 0. This is just for notational simplicity or to save space, since datasets can be huge.
For the meanig of the tags, I cite the ReadMe file:
<label> is the target value of the training data. For classification,
it should be an integer which identifies a class (multi-class
classification is supported). For regression, it's any real
number. For one-class SVM, it's not used so can be any number.
is an integer starting from 1, <value> is a real number. The indices
must be in an ascending order.
As you can see, the label is the data you want to predict. The index marks a feature of your data and its value. A feature is simply an indicator to associate or correlate your target value with, so a better prediction can be made.
Totally Fictional story time: Gabriel Luna (a totally fictional character) wants to predict his energy consumption for the next few days. He found out, that the outside temperature from the day before is a good indicator for that, so he selects Temperature with index 1 as feature. Important: Indices always start at one, zero can sometimes cause strange LIBSVM behaviour. Then, he surprisingly notices, that the day of the week (Monday to Sunday or 0 to 6) also affects his load, so he selects it as a second feature with index 2. A matrix row for LIBSVM now has the following format:
<myLoad_Value> <1:outsideTemperatureFromYesterday_Value> <2:dayOfTheWeek_Value>
Gabriel Luna (he is Batman at night) now captures these data over a few weeks, which could look something like this (load in kWh, temperature in °C, day as mentioned above):
0.72 1:25 2:0
0.65 1:21 2:1
0.68 2:29 2:2
...
Notice, that we could leave out 2:0, because of the sparse matrix format. This would be your training data to train a LIBSVM model. Then, we predict the load of tomorrow as follows. You know the temperature of today, let us say 23°C and today is Tuesday, which is 1, so tomorrow is 2. So, this is the line or vector to use with the model:
0 1:23 2:2
Here, you can set the <label> value arbitrarily. It will be overwritten with the predicted value. I hope this helps.

Using tensorflow for sequence tagging : Synced sequence input and output

I would like to use Tensorflow for sequence tagging namely Part of Speech tagging. I tried to use the same model outlined here: http://tensorflow.org/tutorials/seq2seq/index.md (which outlines a model to translate English to French).
Since in tagging, the input sequence and output sequence have exactly the same length, I configured the buckets so that input and output sequences have same length and tried to learn a POS tagger using this model on ConLL 2000.
However it seems that the decoder sometimes outputs a taggedsequence shorter than the input sequence (it seems to feel that the EOS tag appears prematurely)
For example:
He reckons the current account deficit will narrow to only # 1.8 billion in September .
The above sentence is tokenized to have 18 tokens which gets padded to 20 (due to bucketing).
When asked to decode the above, the decoder spits out the following:
PRP VBD DT JJ JJ NN MD VB TO VB DT NN IN NN . _EOS . _EOS CD CD
So here it ends the sequence (EOS) after 15 tokens not 18.
How can I force the sequence to learn that the decoded sequence should be the same length as the encoded one in my scenario.
If your input and output sequences are the same length you probably want something simpler than a seq2seq model (since handling different sequence lengths is one of it's strengths)
Have you tried just training (word -> tag) ?
note: that for something like pos tagging where there is clear signal from tokens on either side you'll definitely get a benefit from a bidirectional net.
If you want to go all crazy there would be some fun character level variants too where you only emit the tag at the token boundary (the rationale being that pos tagging benefits from character level features; e.g. things like out of vocab names). So many variants to try! :D
There are various ways of specifying an end of sequence parameter. The translate demo uses a flag <EOS> to determine the end of sequence. However, you can also specify end of sequence by counting the number of expected words in the output. In the lines 225-227 of the translate.py:
# If there is an EOS symbol in outputs, cut them at that point.
if data_utils.EOS_ID in outputs:
outputs = outputs[:outputs.index(data_utils.EOS_ID)]
You can see that outputs are being cut off whenever <EOS> is encountered. You can easily tweak it to constrain the number of output words. You might also consider getting rid of <EOS> flag altogether while training, considering your application.
I came to the same problem. At the end I found ptb_word_lm.py example in tensorflow's examples is exactly what we need for tokenization, NER and POS tagging.
If you look into details of the language model example, you can find out that it treats the input character sequence as X and right shift X for 1 space as Y. It is exactly what fixed length sequence labeling needs.

Strict class labels in SVM

I'm using one-vs-all to do a 21-class svm categorization.
I want the label -1 to mean "not in this class" and the label 1 to mean "indeed in this class" for each of the 21 kernels.
I've generated my pre-computed kernels and my test vectors using this standard.
Using easy.py everything went well for 20 of the classes, but for one of them the labels were switched so that all the inputs that should have been labelled with 1 for being in the class were instead labelled -1 and vice-versa.
The difference in that class was that the first vector in the pre-computed kernel was labelled 1, while in all the other kernels the first vector was labelled -1. This suggests that LibSVM relabels all of my vectors.
Is there a way to prevent this or a simple way to work around it?
You already discovered that libsvm uses the label -1 for whatever label it encounters first.
The reason is, that it allows arbitrary labels and changes them to -1 and +1 according to the order in which they appear in the label vector.
So you can either check this directly or you look at the model returned by libsvm.
It contains an entry called Label which is a vector containing the order in which libsvm encountered the labels. You can also use this information to switch the sign of your scores.
If during training libsvm encounters label A first, then during prediction
libsvm will use positive values for assigning object the label A and negative values for another label.
So if you use label 1 for positive class and 0 for negative, then to obtain right output values you should do the following trick (Matlab).
%test_data.y contains 0-s and 1-s
[labels,~,values] = svmpredict(test_data.y, test_data.X, model, ' ');
if (model.Label(1) == 0) % we check which label was encountered by libsvm first
values = -values;
end

Continuous prediction in Google Prediction API?

Is there any announcement about when Google will launch continuous prediction. Currently is there any trick to predict stock prices using Google's prediction API?
They announced continuous output for v1.1 today, along with the much requested multiple category output:
training data submitted with only numbers to v1.1 in the leftmost column will be treated as a continuous output problem (unlike v1)
...
numerical values in the leftmost column of all rows will
automatically return regression values. if you intend to do classification,
we recommend encasing those values within double quotes. For example, 5
indicates a regression value of 5 while "5" indicates a category labeled
"5."
Yes, it can be written.
The important factor affecting the accuracy of your predictions would be the input parameters that you give.
So, try to vary the input training data between different moving averages or other statistical figures and see what comes close at predicting the action to be taken (Buy/Sell).