Neural network non-linear input - input

I have a question about the input choice for my neural network. I have a geographical area that is split into 40 smaller parts which i wish to give as input to my network. I have labeled those from 0-40 and passed them as ints to the network togeather with some other parameters to find a relation. However the desired result from these area inputs are completely unrelated, so the input area 1 and 2 is just as different as 1 and 25.
Often when i read exammples the input value is quite logical. 0 or 1 if the input is a simple true/false alternative. Or maybe if the image is a 32*32 grayscale picture the input is 1024 neurons accepting values from 0-255.
In my case when the 'area' parameter is not linear, what is the proper method to pass it to my network? Or is the whole setup faulty?

I would recommend 40 input variables. Every input variable would correspond to exactly one of your 40 areas. You would set only the input variable corresponding to the correct location to ''1'', and all others to ''0''

Related

How to train data of different lengths in machine learning?

I am analyzing the text of some literary works and I want to look at the distance between certain words in the text. Specifically, I am looking for parallelism.
Since I can’t know the specific number of tokens in a text I can’t simply put all words in the text in the training data because it would not be uniform across all training data.
For example, the text:
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today."
Is not the same text length as
"My fellow Americans, ask not what your country can do for you, ask what you can do for your country."
So therefore I could not columns out of each word and then assign the distance in a row because the lengths would be different.
How could I go about representing this in training data? I was under the assumption that training data had to be the same type and length.
In order to solve this problem you can use something called pad_sequence,so follow this process, sure you are going to transform the data throught some word embedding techniques like TF-IDF or any other algorithm, and after finishing the process of converting the textual data into vectors and by using the shape method you can figure the maximum length you have and than use that maximum in the pad-sequence method, and here is a how you implement this method:
'''
from keras.preprocessing.sequence import pad_sequences
padded_data= pad_sequences(name-of-your-data, maxlen=your-maximum-shape, padding='post', truncating='post')
'''

Number of units in the last dense layer in case of binary classification

My question is related to this one here. I am using the cats and dogs dataset. So there are only these two outcomes. I found two implementations. First one uses:
tf.keras.layers.Dense(1)
as the last layer in the model
The second implementation uses:
layers.Dense(2)
Now, I don't understand this. What is correct here? Is it the same and why (I cannot see why this should be the same). Or what is the difference here? The first solution is modelling cat or dog, the second solution is modelling cat, dog, or any other? Why is this done if we only have cat and dog? Which solution should I take?
Both are correct. One is using binary classification and another one is using categorical classification. Let's try to find the differences.
Binary Classification: In this case, the output layer has only one neuron. From this single neuron output, you have to decide either it's a cat or a dog. You can set any threshold level to classify the output. Let's say cats are labeled as 0 and dogs are labeled as 1 and your threshold value is 0.5. So, if the output is greater than 0.5, then it's a dog because it's closer to 1 otherwise it's a cat. In this case, binary_crossentropy is being used for most of the cases.
Categorical Classification: The number of output layers are exactly the same as the number of classes. This time you're not allowed to label your data as 0 or 1. Label shape should be same as the output layer. In your case, your output layer has two neurons(for classes). You will have to label your data in the same way. To achieve this, you will have to encode your label data. We call this "one-hot-encode". the cats will be encoded as (1,0) and the dogs will be encoded as (0,1) for example. Now your prediction will have two floating-point numbers. If the first number is greater than the second, it's a cat otherwise it's a dog. We call this numbers - confidence score. Let's say, for a test image, your model predicted (0.70, 0.30). which means your model is 70% for confident that it's a cat and 30% confident that it's a dog. Please note that the value of the output layer completely depends on the activation of your layer. To know deeper, please read about activation functions.

Should my seq2seq RNN idea work?

I want to predict stock price.
Normally, people would feed the input as a sequence of stock prices.
Then they would feed the output as the same sequence but shifted to the left.
When testing, they would feed the output of the prediction into the next input timestep like this:
I have another idea, which is to fix the sequence length, for example 50 timesteps.
The input and output are exactly the same sequence.
When training, I replace last 3 elements of the input by zero to let the model know that I have no input for those timesteps.
When testing, I would feed the model a sequence of 50 elements. The last 3 are zeros. The predictions I care are the last 3 elements of the output.
Would this work or is there a flaw in this idea?
The main flaw of this idea is that it does not add anything to the model's learning, and it reduces its capacity, as you force your model to learn identity mapping for first 47 steps (50-3). Note, that providing 0 as inputs is equivalent of not providing input for an RNN, as zero input, after multiplying by a weight matrix is still zero, so the only source of information is bias and output from previous timestep - both are already there in the original formulation. Now second addon, where we have output for first 47 steps - there is nothing to be gained by learning the identity mapping, yet network will have to "pay the price" for it - it will need to use weights to encode this mapping in order not to be penalised.
So in short - yes, your idea will work, but it is nearly impossible to get better results this way as compared to the original approach (as you do not provide any new information, do not really modify learning dynamics, yet you limit capacity by requesting identity mapping to be learned per-step; especially that it is an extremely easy thing to learn, so gradient descent will discover this relation first, before even trying to "model the future").

Keras/TensorFlow: How do I transform text to use as input?

I've been reading tutorials for the last few days, but they all seem to start at the step of "I have my data from this pre-prepared data set, let's go".
What I'm trying to do is take a set of emails I've tokenized, and figure out how to get them into a model as the training and evaluation data.
Example email:
0 0 0 0 0 0 0 0 0 0 0 0 32192 6675 16943 1380 433 8767 2254 8869 8155
I have a folder of emails (one file per email) for each spam and not spam:
/spam/
93451.txt
...
/not-spam/
112.txt
...
How can I get Keras to read this data?
Alternatively, how can I generate a CSV or some other format that it wants to use to input it?
There are many ways to do this, but ill try in this order:
You need to create dictionary of all the words in dataset and then assign a token for each of them. When inputing to the network you can convert it into a one-hot encoded form.
You can convert the input text by feeding it to a pretrained word embeddings model like glove or word-2-vec and obtain a embeddings vector.
You can use the one-hot vector from 1 and train your own embeddings.
As I understood from your task description (please guide me if I'm wrong), you need to classify texts into either spam or not spam category.
Basically, if you want to create the universal text data classification input solution, your
data input stage code should contain 3 steps:
1. Reading list of folders ("spam", "not spam" in your case) and iterating each folder to the list of files.
At the end you should have:
a) a dictionary containing (label_id -> label_name).
So in your case, you should stay with (0-> spam, 1->not_spam).
b) A pair of (file_content, label).
As you understand, this is out of scope of both keras and tensorflow. It is typical python' code.
2. For each part (file_content, label) you should process the first element, and that's the most interesting part usually.
In your example I can see 0 0 0 0 0 0 0 0 0 0 0 0 32192 6675 16943 1380 433 8767 2254 8869 8155. So you already have the indexes of the words, but they are in the text form. All you need is to transform the string to the array having the 300 items (words in your message).
For the further text machine learning projects, I suggest to use raw text data as a source and transform it to the word indexes using tf.contrib.learn.preprocessing.VocabularyProcessor.
3. Transform labels(categories) to the one-hot vector.
So at the end of these step you have a pair of (word_indexes_as_array, label_as_one_hot).
Then you can use these data as input data for training.
Naturally, you would divide this tuple into two, treating the first 80% of data as training set and 20% as testing set (please do not focus on 80/20, numbers it is just a sample).
You may look at at the text classification with keras examples. They are rather straightforward and may be helpful for you as they are starting from the data input step.
Also, please, look at the load_data_and_labels() method in the data input step example. It is a very similar case to yours (positive/negative).

How should be defined the seasons of a year as an input variable for ANN in matlab

I wanted to define the season of a year(4 seasons) as one of the inputs variables for neural network in matlab .Can I just use numbers from 1 to 4 ?Thanks for any suggestion
Since it is a categorical variable, it's better to use 1-hot coding:
0001: summer
0010: fall
0100: winter
1000: spring
So, your season input will become 4 binary inputs.
Generally there are 2 ways to do this: use one input for each category and scale the integer values, e.g. (0,...,4) for season to continuous values in the range of other input variables. However, this approach would assume that you have some hierarchy in the categories, let's say Spring is 'better' or 'higher' than Summer. Since this is not the case, you would need to create one input node for each possible realization of a category, i.e. 4 input variables for the season where all are set to '0' except the category that is active, which is set to '1'. I would not advice to encode the integer categorical variables into binary values, and thus reduce the number of required input nodes. You would end up with a correlation bias among categories that have the value set to '1' at the same time, e.g. for (hot, mild, cold) = ([0,1], [1,0], [1,1]), the encoding for 'hot' would mean an artificial similarity to 'cold' and 'mild', since they share the same bit.