PCA for Recurrent Neural Networks (LSTM) - Shall I use PCA for target variables too? - tensorflow

I have a seasonal timeseries dataset containing 3 target variables and n feature variables. I am trying to apply a PCA algorithm before feeding the data to a simple LSTM.
The operations I do are the following:
Split train - validation - test
Standard scaler (force mean=0 & std=1) of the train dataset (including target and features)
Apply PCA for only features of the train dataset
Transform through the PCA matrix in step 3 the feature variables from validation and target
Where I get lost: What to do with target's validation and target's test variables?
... more neural networks pre-processing and building the architecture of the LSTM
My question is: How do I scale / normalize the target variables? Through a PCA too?, through any independent scaler (standard, mapminmax, etc.)? If I leave the original target values I got overfitting in my LSTM.
The most disappointing is that without the PCA, the LSTM I've build is showing no overfitting
Thanks a lot for your help!

I know this comes late...
As far as I know, you should not apply PCA to the target variables. PCA is used in a way to reduce dimensionality on the feature variables.
As you have applied the PCA transformation trained with the Train dataset, you can do the same with the used Scaler.

Related

What's the relationship between Tensorflow's dataflow graph and DNN?

As we know, a DNN is comprised of many layers which consist of many neurons applying the same function to different parts of the input. Meanwhile, if we use Tensorflow to execute a DNN task, we will get a dataflow graph generated by Tensorflow automatically and we can use Tensorboard to visualize the dataflow graph as blow. But there is no neuron in the layer. So I wonder what is the relationship between Tensorflow dataflow graph and a DNN? When a neuron of DNN's layer map into dataflow graph, how is it represented?What is the relationship of neuron in DNN and node in tensorflow(representing an operation)? I just started to learn DNN and Tensorflow, please help me arrange thoughts in order. Thanks:) enter image description here
You have to differentiate between the metaphoric representation of a DNN and it's mathematic description. The math behind a classic neuron is the sum of the weighted inputs + a bias (usually calling a activation function on this result)
So in this case you have an input vector mutplied by a weight vector (containing trainable variables) and then summed up with a bias scalar (also trainable)
If you now consider a layer of neurons instead of one, the weights will become a matrix and the bias a vector. So calculating a feed forward layer is nothing more then a matrix multiplication follow by a sum of vectors.
This is the operation you can see in your tensorflow graph.
You can actually build your Neural Network this way without any use of the so called High Level API which use the Layer abstraction. (Many have done this in the early days of tensorflow)
The actual "magic", which tensorflow does for you is calculating and executing the derivatives of this foreword pass in order to calculate the updates for the weights.

tensorflow - how to use variational recurrent dropout correctly

The tensorflow config dropout wrapper has three different dropout probabilities that can be set: input_keep_prob, output_keep_prob, state_keep_prob.
I want to use variational dropout for my LSTM units, by setting the variational_recurrent argument to true. However, I don't know which of the three dropout probabilities I have to use for variational dropout to function correctly.
Can someone provide help?
According to this paper https://arxiv.org/abs/1512.05287 that is used for implementation of the variational_recurrent dropouts, you can think about as follows,
input_keep_prob - probability that dropping out input connections.
output_keep_prob - probability that dropping out output connections.
state_keep_prob - Probability that droping out recurrent connections.
See the diagram below,
If you set the variational_recurrent to be true you will get an RNN that's similar to the model in right and otherwise in left.
The basic differences in above two models are,
Variational RNN repeats the same dropout mask at each time
step for both inputs, outputs, and recurrent layers (drop
the same network units at each time step).
Native RNN uses different dropout masks at each time step for the
inputs and outputs alone (no dropout is used with the recurrent
connections since the use of different masks with these connections
leads to deteriorated performance).
In the above diagram, coloured connections represent the dropped-out connections, with different colours corresponding to different dropout masks. Dashed lines correspond to standard connections with no dropout.
Therefore, if you use a variational RNN you can set all three probability parameters according to your requirement.
Hope this helps.

Optimizer and Estimator in Neural Networks

When I started with Neural it seemed I understood Optimizers and Estimators well.
Estimators: Classifier to classify the value based on sample set and Regressor to predict the value based on sample set.
Optimizer: Using different optimizers (Adam, GradientDescentOptimizer) to minimise the loss function, which could be complex.
I understand every estimators come up with an Default optimizer internally to perform minimising the loss.
Now my question is how do they fit in together and optimize the machine training?
short answer: loss function link them together.
for example, if you are doing a classification, your classifier can take input and output a prediction. then you can calculate your loss by take predicted class and ground truth class. the task of your optimizer is to minimize the loss by modifying the parameter of your classifier.

DeepLearning Anomaly Detection for images

I am still relatively new to the world of Deep Learning. I wanted to create a Deep Learning model (preferably using Tensorflow/Keras) for image anomaly detection. By anomaly detection I mean, essentially a OneClassSVM.
I have already tried sklearn's OneClassSVM using HOG features from the image. I was wondering if there is some example of how I can do this in deep learning. I looked up but couldn't find one single code piece that handles this case.
The way of doing this in Keras is with the KerasRegressor wrapper module (they wrap sci-kit learn's regressor interface). Useful information can also be found in the source code of that module. Basically you first have to define your Network Model, for example:
def simple_model():
#Input layer
data_in = Input(shape=(13,))
#First layer, fully connected, ReLU activation
layer_1 = Dense(13,activation='relu',kernel_initializer='normal')(data_in)
#second layer...etc
layer_2 = Dense(6,activation='relu',kernel_initializer='normal')(layer_1)
#Output, single node without activation
data_out = Dense(1, kernel_initializer='normal')(layer_2)
#Save and Compile model
model = Model(inputs=data_in, outputs=data_out)
#you may choose any loss or optimizer function, be careful which you chose
model.compile(loss='mean_squared_error', optimizer='adam')
return model
Then, pass it to the KerasRegressor builder and fit with your data:
from keras.wrappers.scikit_learn import KerasRegressor
#chose your epochs and batches
regressor = KerasRegressor(build_fn=simple_model, nb_epoch=100, batch_size=64)
#fit with your data
regressor.fit(data, labels, epochs=100)
For which you can now do predictions or obtain its score:
p = regressor.predict(data_test) #obtain predicted value
score = regressor.score(data_test, labels_test) #obtain test score
In your case, as you need to detect anomalous images from the ones that are ok, one approach you can take is to train your regressor by passing anomalous images labeled 1 and images that are ok labeled 0.
This will make your model to return a value closer to 1 when the input is an anomalous image, enabling you to threshold the desired results. You can think of this output as its R^2 coefficient to the "Anomalous Model" you trained as 1 (perfect match).
Also, as you mentioned, Autoencoders are another way to do anomaly detection. For this I suggest you take a look at the Keras Blog post Building Autoencoders in Keras, where they explain in detail about the implementation of them with the Keras library.
It is worth noticing that Single-class classification is another way of saying Regression.
Classification tries to find a probability distribution among the N possible classes, and you usually pick the most probable class as the output (that is why most Classification Networks use Sigmoid activation on their output labels, as it has range [0, 1]). Its output is discrete/categorical.
Similarly, Regression tries to find the best model that represents your data, by minimizing the error or some other metric (like the well-known R^2 metric, or Coefficient of Determination). Its output is a real number/continuous (and the reason why most Regression Networks don't use activations on their outputs). I hope this helps, good luck with your coding.

Tensorflow: jointly training CNN + LSTM

There are quite a few examples on how to use LSTMs alone in TF, but I couldn't find any good examples on how to train CNN + LSTM jointly.
From what I see, it is not quite straightforward how to do such training, and I can think of a few options here:
First, I believe the simplest solution (or the most primitive one) would be to train CNN independently to learn features and then to train LSTM on CNN features without updating the CNN part, since one would probably have to extract and save these features in numpy and then feed them to LSTM in TF. But in that scenario, one would probably have to use a differently labeled dataset for pretraining of CNN, which eliminates the advantage of end to end training, i.e. learning of features for final objective targeted by LSTM (besides the fact that one has to have these additional labels in the first place).
Second option would be to concatenate all time slices in the batch
dimension (4-d Tensor), feed it to CNN then somehow repack those
features to 5-d Tensor again needed for training LSTM and then apply a cost function. My main concern, is if it is possible to do such thing. Also, handling variable length sequences becomes a little bit tricky. For example, in prediction scenario you would only feed single frame at the time. Thus, I would be really happy to see some examples if that is the right way of doing joint training. Besides that, this solution looks more like a hack, thus, if there is a better way to do so, it would be great if someone could share it.
Thank you in advance !
For joint training, you can consider using tf.map_fn as described in the documentation https://www.tensorflow.org/api_docs/python/tf/map_fn.
Lets assume that the CNN is built along similar lines as described here https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10.py.
def joint_inference(sequence):
inference_fn = lambda image: inference(image)
logit_sequence = tf.map_fn(inference_fn, sequence, dtype=tf.float32, swap_memory=True)
lstm_cell = tf.contrib.rnn.LSTMCell(128)
output_state, intermediate_state = tf.nn.dynamic_rnn(cell=lstm_cell, inputs=logit_sequence)
projection_function = lambda state: tf.contrib.layers.linear(state, num_outputs=num_classes, activation_fn=tf.nn.sigmoid)
projection_logits = tf.map_fn(projection_function, output_state)
return projection_logits
Warning: You might have to look into device placement as described here https://www.tensorflow.org/tutorials/using_gpu if your model is larger than the memory gpu can allocate.
An Alternative would be to flatten the video batch to create an image batch, do a forward pass from CNN and reshape the features for LSTM.