Dynamically find the right amount of timesteps for an LSTM RNN predicting timeseries - tensorflow

I'm fairly new to the world of NNs so it could be that my question is extremely stupid, so sorry in advance:
I'm working on a predictive autoscaler, trying to predict the workload of different, unknown, applications using only historic workload data about this application.
One of my predictors is a LSTM RNN. To make the RNN predict the values in question, I have to define the timestep - that is the amount of Lags I feed in to the RNN to predict the future value (I hope I used the right terms here). Alot of tutorials and literature seems to set the timestep to a value that seems pretty random to me. My Question can be divided in two subquestions:
1. Given I don't know the Timeseries during implementation: Is there any way to compute this value other than trying different values and comparing the confidence of the prediction?
2. How does the Value influence the assumptions the RNN learns about that time series?
I sadly lack of any intuition on what this value influences. To make an example of my confusion:
Given I have a Time Series with a yearly seasonality, but I decide that I will only feed in the data of a week to make the next prediction: Is the Network able to learn this yearly seasonality? Part of me says no because it can't learn that the partial correlation between the timestamp in question and the lag 365 days ago is very high, because it does not have that data, right? Or does it because it has seen the data from a year ago while training and learned that fairly similar pattern and simply applies it now (which is more likely to be right I guess)?
Is my assumption right, that taking too many timestamps into the equation overfits the network?
Can you please help me get a vague understanding of what this parameter influences in the great scheme of things and what properties of a time series should influence my choice of that value?
Thank you so much and stay healthy :)

Related

Issues when modelling LSTM for multi series

I'm a beginner on time series analysis with deep learning, and I have been searching for examples with LSTM in which more than one series (for example one for each city or place) is trained to avoid fitting a model for each one. The main benefit of course is that you have more training data and less computational costs. I have found an interesting code to help modeling this problem with conditional/temporally-static variables (it's called cond-rnn). But wherever I search, it's not clear to me some issues regarding sorting the inputs appropriately.
The context is that I have a target and a set of autoregressive inputs (features, lags, timesteps, wherever you call it), in which data from different series are stack together. RF and GB are outperforming LSTM on this task (with overfitting, even when I use 100k+ samples, dropout or regularization), and I'm not sure if I'm using it appropriately.
It is wrong to stack series together and have the inputs-targets randomly sorted (as in the figure)? Does the LSTM need to receive the inputs temporally sorted?
If they need so, do you have any advice on how to deal with the problem of providing new series (that start from the first time period) to the LSTM training? This answer to a similar problem (but another perspective) suggest to pick "places" as an input column, but I don't think this answer help the questions here I posed.

Why machine learning algorithms focus on speed and not accuracy?

I study ML and I see that most of the time the focus of the algorithms is run time and not accuracy. Reducing features, taking sample from the data set, using approximation and so on.
Im not sure why its the focus since once I trained my model I dont need to train it anymore if my accuracy is high enough and for that if it will take me 1 hours or 10 days to train my model it does not really matter because I do it only 1 time and my goal is to predict as better as I can my outcomes (minimum loss).
If I train a model to differ between cats and dogs I want it to be the most accurate it can be and not the fasted since once I trained this model I dont need to train any more models.
I can understand why models that depends on fasting changing data need this focus of speed but for general training models I dont understand why the focus is on speed.
Speed is relative term. Accuracy is also relative depending on the difficulty of the task. Currently the goal is to achieve human-like performance for application at reasonable costs because this will replace human labor and cut costs.
From what I have seen in reading papers, people usually focus on accuracy first to produce something that works. Then do ablation studies - studies where pieces of the models are removed or modified - to achieve the same performance in less time or memory requirements.
The field is very experimentally validated. There really isn't much of a theory that states why CNN work so well other than that it can model any function given non-linear activations functions. (https://en.wikipedia.org/wiki/Universal_approximation_theorem) There have been some recent efforts to explain why it works well. One I recall is MobileNetV2: Inverted Residuals and Linear Bottlenecks. The explaination of embedding data into a low dimensional space without losing information might be worth reading.

Deep learning basic thoughts

I try to understand the basics of deep learning, lastly reading a bit through deeplearning4j. However, I don't really find an answer for: How does the training performance scale with the amount of training data?
Apparently, the cost function always depends on all the training data, since it just sums the squared error per input. Thus, I guess at each optimization step, all datapoints have to be taken into account. I mean deeplearning4j has the dataset iterator and the INDArray, where the data can live anywhere and thus (I think) doesn't limit the amount of training data. Still, doesn't that mean, that the amount of training data is directly related to the calculation time per step within the gradient descend?
DL4J uses iterator. Keras uses generator. Still the same idea - your data comes in batches, and used for SGD. So, minibatches matter, not the the whole amount of data you have.
Fundamentally speaking it doesn't (though your mileage may vary). You must research right architecture for your problem. Adding new data records may introduce some new features, which may be hard to capture with your current architecture. I'd safely always question my net's capacity. Retrain your model and check if metrics drop.

Cells detection using deep learning techniques

I have to analyse some images of drops, taken using a microscope, which may contain some cell. What would be the best thing to do in order to do it?
Every acquisition of images returns around a thousand pictures: every picture contains a drop and I have to determine whether the drop has a cell inside or not. Every acquisition dataset presents with a very different contrast and brightness, and the shape of the cells is slightly different on every setup due to micro variations on the focus of the microscope.
I have tried to create a classification model following the guide "TensorFlow for poets", defining two classes: empty drops and drops containing a cell. Unfortunately the result wasn't successful.
I have also tried to label the cells and giving to an object detection algorithm using DIGITS 5, but it does not detect anything.
I was wondering if these algorithms are designed to recognise more complex object or if I have done something wrong during the setup. Any solution or hint would be helpful!
Thank you!
This is a collage of drops from different samples: the cells are a bit different from every acquisition, due to the different setup and ambient lights
This kind of problem should definitely be possible. I would suggest starting with a cifar 10 convolutional neural network tutorial and customizing it for your problem.
In future posts you should tell us how your training is progressing. Make sure you're outputting the following information every few steps (maybe every 10-100 steps):
Loss/cost function output, you should see your loss decreasing over time.
Classification accuracy on the current batch of your training data
Classification accuracy on a held out test set (if you've implemented test set evaluation, you might implement this second)
There are many, many, many things that can go wrong, from bad learning rates, to preprocessing steps that go awry. Neural networks are very hard to debug, they are very resilient to bugs, making it hard to even know if you have a bug in your code. For that reason make sure you're visualizing everything.
Another very important step to follow is to save the images exactly as you are passing them to tensorflow. You will have them in a matrix form, you can save that matrix form as an image. Do that immediately before you pass the data to tensorflow. Make sure you are giving the network what you expect it to receive. I can't tell you how many times I and others I know have passed garbage into the network unknowingly, assume the worst and prove yourself wrong!
Your next post should look something like this:
I'm training a convolutional neural network in tensorflow
My loss function (sigmoid cross entropy) is decreasing consistently (show us a picture!)
My input images look like this (show us a picture of what you ACTUALLY FEED to the network)
My learning rate and other parameters are A, B, and C
I preprocessed the data by doing M and N
The accuracy the network achieves on training data (and/or test data) is Y
In answering those questions you're likely to solve 10 problems along the way, and we'll help you find the 11th and, with some luck, last one. :)
Good luck!

Must each tensorflow batch contain a uniform distribution of the inputs for all expected classifications?

This is probably a newbie question but I'm trying to get my head around how training on small batches works.
Scenario -
For the mnist classification problem, let's say that we have a model with appropriate hyerparameters that allow training on 0-9 digits. If we feed it with a small batches of uniform distribution of inputs (that have more or less same numbers of all digits in each batch), it'll learn to classify as expected.
Now, imagine that instead of a uniform distribution, we trained the model on images containing only 1s so that the weights are adjusted until it works perfectly for 1s. And then we start training on images that contain only 2s. Note that only the inputs have changed, the model and everything else has stayed the same.
Question -
What does the training exclusively on 2s after the model was already trained exclusively on 1s do? Will it keep adjusting the weights till it has forgotten (so to say) all about 1s and is now classifying on 2s? Or will it still adjust the weights in a way that it remembers both 1s and 2s?
In other words, must each batch contain a uniform distribution of different classifications? Does retraining a trained model in Tensorflow overwrite previous trainings? If yes, if it is not possible to create small (< 256) batches that are sufficiently uniform, does it make sense to train on very large (>= 500-2000) batch sizes?
That is a good question without a clear answer. In general, the order and selection of training samples has a large impact on the performance of the trained net, in particular in respect to the generalization properties it shows.
The impact is so strong, actually, that selecting specific examples, and ordering them in a particular way to maximize performance of the net even constitutes a genuine research area called `curriculum learning'. See this research paper.
So back to your specific question: You should try different possibilities and evaluate each of them (which might actually be an interesting learning exercise anyways). I would expect uniformly distributed samples to generalize well over different categories; samples drawn from the original distribution to achieve the highest overall score (since, if you have 90% samples from one category A, getting 70% over all categories will perform worse than having 99% from category A and 0% everywhere else, in terms of total accuracy); other sample selection mechanisms will show different behavior.
An interesting reading about such questions is Bengio's 2012 paper Practical Recommendations for Gradient-Based Training of Deep
Architectures
There is a section about online learning where the distribution of training data is unknown. I quote from the original paper
It
means that online learners, when given a stream of
non-repetitive training data, really optimize (maybe
not in the optimal way, i.e., using a first-order gradient
technique) what we really care about: generalization
error.
The best practice though to figure out how your dataset behaves under different testing scenarios would be to try them both and get experimental results of how the distribution of the training data affects your generalization error.