How to optimize a simulation metric with deep learning without target values? - optimization

I am trying to use an RNN model that outputs bus routes and its input is the demand matrix. The bus routes are then used in a simulation which spits out a metric of how the routes performed. The question is, since there is no target value of bus routes, how do I back propagate the simulation result?
To explain the question with simple python code:
"""
The model is an RNN that takes 400,24,24 matrix as input
dimension 0 represents time, dimension 1 represents departure bus stop and dimension 2 represents the arrival bus stop. Each value is a count of the number of passengers who departed at a bus stop with an arrival bus stop in mind in a specific time
output is 64,24 matrix which will be reshaped to 8,8,24
dimension 0 is the sequence index, dimension 1 is the index of bus (there are 8 buses), dimension 2 is the softmaxed classifier dimension of 24 different bus stops. From the output, 8 bus stops are picked per bus with a sequence
These sequences are then used for path generations of buses and they are evaluated from a simulation
"""
model.train()
optimizer.zero_grad()
out = model(demand)#out is 64,24 demand is 400,24,24
demand, performance = simulation(out)#assume performance as float
#here the out has grad_fn but the performance does not
loss = SOME_NUMBER - performance
loss = torch.FloatTensor(loss)
#here I need to back propagate and it is the confusing part
#simply doing loss.backward() does nothing because no grad_fn
#out.backward() requires 64,24 gradients computed somehow from 1 #metric, causes complete divergence within few steps
optimizer.step()

How does the model output represent the bus routes? Maybe you could try a reinforced learning approach. Take a look at Deep-Q Learning, It basically takes and input vector (the state of the system) and outputs an action (usually represented by an index in your output layer), then it computes the reward of that action and uses it to train the model (without the need of target values).
Here are some resources that might help you get started:
https://towardsdatascience.com/double-deep-q-networks-905dd8325412
https://arxiv.org/pdf/1802.09477.pdf
https://arxiv.org/pdf/1509.06461.pdf
Hope this was useful.
UPDATE
There is a second option, you could define a custom loss function. Generally these functions only take two arguments, the predicted_y and the target_y, in your case, there is no target_y, so you could pass a dummy target_y and not use it inside the function (I assume that you could call your simulation process inside that function, and return the metric as the "loss"). Here are examples in PyTorch and Keras.
Keras: Make a custom loss function in keras
PyTorch:PyTorch custom loss function

Related

How to add knowledge of previous time steps to RNNs?

I am setting up a single layer Gated Recurrent Unit (GRU) using Keras for TensorFlow to predict time steps y_t given time steps X_t for a time series of times t,...,N. As I have knowledge of y at time t-1, how can I feed this to the network? Initially I thought of doing this through hidden states however these do not represent actual values of y and manually setting these will not improve the network unless when the value of y at t-1 is 0 (which corresponds to the default value for uninitialized hidden states).
It is already happening and you don't have to go out of your way to do it. The hidden states are doing but yes, not the actual values are being used, their pattern is being learnt. That is a good thing because your model generalizes well.
If you are having problems with time-series data, consider increasing or decreasing window size, change the number of layers and units in them (first, judge whether overfitting is happening or underfitting) and employ dropout.

how does tensorflow calculate gradients *efficiently* from input to loss?

To calculate the derivative of an output layer of size N w.r.t an input of size M, we need a Jacobian matrix of size M x N. To calculate a complete gradient from loss to inputs using the chain rule, we would need a large number of such Jacobians stored in memory.
I assume that tensorflow does not calculate a complete Jacobian matrix for each step of the graph, but does something more efficient. How does it do it?
Thanks
TensorFlow uses Automatic Differentiation to compute gradients efficiently. Concretely, it defines a computation graph in which nodes are operations and each directed edge represents the partial derivative of a child with respect to its parent. The total derivative of an operation f with respect to x is then given by the sum over all path values from x to f, where each path value is the product of the partial derivatives of the operations on the edges.
More specifically, TensorFlow uses reverse differentiation, which involves a forward pass to compute the value of each node in the computation graph, and a backward pass to compute the partial derivative of the function f that we are differentiating with respect to every node in the graph. We need to repeat the backward pass for each dimension of function f, so the computational complexity is O(dim(f))*O(f), where dim(f) is the output dimensionality of function f.
Although this approach is memory intensive (it requires storing the values of all the nodes before running the backward pass), it is very efficient for machine learning, where we typically have a scalar function f (i.e. dim(f)=1).
You might find this resource useful.

Multilabel/ Multitask/ Multiclass Regression in machine learning

My challenge is to train a neural network to recognize certain actions and events for different classes of task or how you want to call it given the input.
I see that most of the input/output when training neural networks is either 0 or 1 or [0,1]. But in my scenario I want my input to be in the form of integers which are arbitrarily big and the same form is expected for the output.
Let me give you an example:
Input
X = [ 23, 4, 0, 1233423, 1, 0, 0] ->
Y = [ 2, 1, 1]
Now each element in X[i] represent different properties of the same entity.
Let's say it want to describe a human being:
23 -> maps to a place he/she was born
4 -> maps to a school they graduated
etc.
Each entry in Y[i], on the other hand, means what is more likely the human to do in 3 different categories ( as len(Y) is 3 in this case ):
Y[0] = 2 -> maps to eating icecream ( from a variety of other choices )
Y[1] = 1 -> maps to a time of day moment ( morning, noon, afternoon, evening, etc...)
Y[2] = 1 -> maps to a day of the week for example
Now of course if the task was just a multi label problem I would apply a sigmoid on the output layer and do a binary_crossentropy as the loss function but that of course does not work.
Here because my output is obviously not between [0,1].
Also I am not really sure what loss function to apply since I want all classes/subclasses in Y to be correctly predicted. What I am basically saying is that each Y[i] is itself is a class of its own.
It would be more accurate if my output was in the shape of (3, labels_per_class)
and the loss function would calculate a loss for each of the 3 different classes
trying to optimize the result in such a way that each of the 3 classes would have the correct labels.
I am not sure if that is possible or how at least.
I am really still in the beginnings with my neural network knowledge and learning so clearly I am struggling with this problem.
But really to put it more simply I have a better idea how to describe it. It is more or less like an auto-encoder but the inputs and outputs are integers. The difference is that in my case the output has a different size from the input where in the auto-encoder they are the same.
My solution was to apply a relu at the output layer, ( and of course relu-like activations on all other layers as well ) and binary_crossentropy as the loss functions but the accuracy of the network is very low, around 15%.
For a standard classification you would probably do a dense layer with a number of nodes equal to the number of classes then apply softmax. The loss would be tf.losses.softmax_cross_entropy. You would do a sigmoid if you want to allow multiple classes, not just one.
Now you have multiple classification tasks. One way to do it is to take the last hidden layer (the one before the one where you do softmax). For each task do a dense layer with a number of nodes equals to the number of classes for that task and apply softmax. To compute the loss just add the losses together.
If the tasks are too different you may want to have more than one layer for each prediction.
You can also put some weights on the different losses if, say, eating ice-cream is a lot more important than getting the time of day right.
Only use relu if the prediction space is continous. Say time of day is continous but the choice between eating ice-cream, going to work, watching TV is not. If you use relu use a loss like L1(tf.losses.absolut_difference) or L2 (tf.losses.mean_squared_error).

How to compute a per-class parameter from a mini-batch in TensorFlow?

I am starting to learn TensorFlow and I have a seemingly simple modeling question. Suppose I have a C-class problem and data arrives into TensorFlow in mini-batches containing B samples each. Each sample x is a D-dimensional vector that comes with its label y (non-negative integer between 0 and C-1). I want to estimate a class-specific parameter (for example the sample mean) for each class. The estimation takes place after each sample independently undergoes a TensorFlow-defined transformation pipeline. The per-class parameter/sample-mean is then utilized in the computation of other tensors.
Intuitively, I would group the samples in each mini-batch by label, sum-combine them, and add the total of each label group to the corresponding class parameter, with appropriate normalization.
How can I implement such a simple procedure (group by label, perform a per-group operation, then use the labels as indices for writing into a tensor) or an equivalent one, using TensorFlow? What TensorFlow operations do I need to learn about to achieve this? Is it advisable to do it outside TensorFlow?

Tensorflow RNN sequence training

I'm making my first steps learning TF and have some trouble training RNNs.
My toy problem goes like this: a two layers LSTM + dense layer network is fed with raw audio data and should test whether a certain frequency is present in the sound.
so the network should 1 to 1 map float(audio data sequence) to float(pre-chosen frequency volume)
I've got this to work on Keras and seen a similar TFLearn solution but would like to implement this on bare Tensorflow in a relatively efficient way.
what i've done:
lstm = rnn_cell.BasicLSTMCell(LSTM_SIZE,state_is_tuple=True,forget_bias=1.0)
lstm = rnn_cell.DropoutWrapper(lstm)
stacked_lstm = rnn_cell.MultiRNNCell([lstm] * 2,state_is_tuple=True)
outputs, states = rnn.dynamic_rnn(stacked_lstm, in, dtype=tf.float32)
outputs = tf.transpose(outputs, [1, 0, 2])
last = tf.gather(outputs, int(outputs.get_shape()[0]) - 1)
network= tf.matmul(last, W) + b
# cost function, optimizer etc...
during training I fed this with (BATCH_SIZE, SEQUENCE_LEN,1) batches and it seems like the loss converged correctly but I can't figure out how to predict with the trained network.
My (awful lot of) questions:
how do i make this network return a sequence right from Tensorflow without going back to python for each sample(feed a sequence and predict a sequence of the same size)?
If I do want to predict one sample at a time and iterate in python what is the correct way to do it?
During testing is dynamic_rnn needed or it's just used for unrolling for BPTT during training? why is dynamic_rnn returning all the back propagation steps Tensors? these are the outputs of each layer of the unrolled network right?
after some research:
how do i make this network return a sequence right from Tensorflow
without going back to python for each sample(feed a sequence and
predict a sequence of the same size)?
you can use state_saving_rnn
class Saver():
def __init__(self):
self.d = {}
def state(self, name):
if not name in self.d:
return tf.zeros([1,LSTM_SIZE],tf.float32)
return self.d[name]
def save_state(self, name, val):
self.d[name] = val
return tf.identity('save_state_name') #<-important for control_dependencies
outputs, states = rnn.state_saving_rnn(stacked_lstm, inx, Saver(),
('lstmstate', 'lstmstate2', 'lstmstate3', 'lstmstate4'),sequence_length=[EVAL_SEQ_LEN])
#4 states are for two layers of lstm each has hidden and CEC variables to restore
network = [tf.matmul(outputs[-1], W) for i in xrange(EVAL_SEQ_LEN)]
one problem is that state_saving_rnn is using rnn() and not dynamic_rnn() therefore unroll at compile time EVAL_SEQ_LEN steps you might want to re-implement state_saving_rnn with dynamic_rnn if you want to input long sequences
If I do want to predict one sample at a time and iterate in python what is the correct way to do it?
you can use dynamic_rnn and supply initial_state. this is probably just as efficient as state_saving_rnn. look at state_saving_rnn implementations for reference
During testing is dynamic_rnn needed or it's just used for unrolling for BPTT during training? why is dynamic_rnn returning all the back propagation steps Tensors? these are the outputs of each layer of the unrolled network right?
dynamic_rnn does do unrolling at runtime similarly to compile time rnn(). I guess it returns all the steps for you to branch the graph in some other places - after less time steps. in a network that use [one time step input * current state -> one output, new state] like the one described above it's not needed in testing but could be used for training truncated time back propagation