Is it possible to estimate the time needed to train a machine learning model given a size of data and hardware specification? - tensorflow

I am planning to make small Tensor Flow image classification project, which is expected to run on machines with low processing power, and one of the concerns I was asked about was the time needed to train the model.
The project is still in the conception stage and no clear boundary is made.
But assuming that we will use Tensor flow for Python, with a simple Neural Network for say n images data set, is there a way to estimate or predict the time required to train the model before performing the training given the hardware in use?
I have asked one of my colleagues who works in NN and he said that maybe we could calculate the time needed by measuring the time for the first epoch and making an estimation how many epochs needed afterwards. Is this is a valid way? If yes then is it even possible to estimate the number of epochs needed? And either cases is there a way to calculate it before performing any training?

There is no definite way of finding the number of epochs to which the model converges. It is one of the hyperparameter.
Apart from the type of model you are training, convergence also depends on the distribution of data, and the optimizer you are using.
The rough estimate you can make by looking at the number of parameters you have in your model, check time for one epoch, and get a rough idea from "experience" on the number of epochs. BUT you always have to look at the training and validation loss curves to check for the convergence.


Does it make sense to maximize both training and validation accuracy?

While training my CNNs I usually aim to maximize the validation accuracy to 1.0 (i.e. 100%). I know that on the other hand it would not make much sense to aim for a training accuracy of 1.0, because we don't want our model to memorize the training data itself.
However, what about a "mixed" approach --
wouldn't it make sense to maximize both training and validation accuracy?
Let's first address what the purpose of validation is:
When we're training a neural net, we are trying to teach the neural net to perform well at a given task for the entire population of input/output pairs in the task. However, it is unrealistic to have the entire dataset, especially for high dimensional inputs such as images. Therefore, we create a training dataset that contains a (hopefully) large amount of that data. We hope when we're training a neural net that by maximizing performance on the training dataset, we maximize performance on the entire dataset. This is called generalization.
How do we know that the neural net is generalizing well? As you mentioned, we don't want to simply memorize the training data. That is where validation accuracy comes in. We feed data that the neural net did not train on through the network to evaluate its performance. Therefore, the purpose of the validations set is to measure the generalization.
You should watch both the training and validation accuracy. The difference between the validation and training accuracy is called the generalization gap, which will tell you how well your neural net is generalizing to new inputs. You want both the training and validation accuracy to be high, and the difference between them to be minimal.
Technically if you could do so, that would be awesome, you wouldn't say a model is over fitting unless there is a gap between validation accuracy and training accuracy, if their values are close, both high or both low, then the model is not over fitting. ideally you want high accuracy on all samples, training, validation and testing, but as I said "IDEALLY". you just don't care as much about training samples.

Why machine learning algorithms focus on speed and not accuracy?

I study ML and I see that most of the time the focus of the algorithms is run time and not accuracy. Reducing features, taking sample from the data set, using approximation and so on.
Im not sure why its the focus since once I trained my model I dont need to train it anymore if my accuracy is high enough and for that if it will take me 1 hours or 10 days to train my model it does not really matter because I do it only 1 time and my goal is to predict as better as I can my outcomes (minimum loss).
If I train a model to differ between cats and dogs I want it to be the most accurate it can be and not the fasted since once I trained this model I dont need to train any more models.
I can understand why models that depends on fasting changing data need this focus of speed but for general training models I dont understand why the focus is on speed.
Speed is relative term. Accuracy is also relative depending on the difficulty of the task. Currently the goal is to achieve human-like performance for application at reasonable costs because this will replace human labor and cut costs.
From what I have seen in reading papers, people usually focus on accuracy first to produce something that works. Then do ablation studies - studies where pieces of the models are removed or modified - to achieve the same performance in less time or memory requirements.
The field is very experimentally validated. There really isn't much of a theory that states why CNN work so well other than that it can model any function given non-linear activations functions. ( There have been some recent efforts to explain why it works well. One I recall is MobileNetV2: Inverted Residuals and Linear Bottlenecks. The explaination of embedding data into a low dimensional space without losing information might be worth reading.

Training SSD-MOBILENET V1 and the loss does not deacrease

I'm new in everithing about CNN and tensorflow. Im training a pretrained ssd-mobilenev1-pets.config to detect columns of buildings, about one day but the loss is between 2-1 and doesnt decrease since 10 hours ago.
I realized that my input images are 128x128 and SSD resize de image to 300*300.
Does the size of the input images affect the training?
If that is the case, should I retrain the network with larger input images? or what would be another option to decrease the loss? my train dataset has 660 images and test 166 I dont Know if there are enough images
I really aprecciate your help ....
Loss values of ssd_mobilenet can be different from faster_rcnn. From EdjeElectronics' TensorFlow Object Detection Tutorial:
For my training on the Faster-RCNN-Inception-V2 model, it started at
about 3.0 and quickly dropped below 0.8. I recommend allowing your
model to train until the loss consistently drops below 0.05, which
will take about 40,000 steps, or about 2 hours (depending on how
powerful your CPU and GPU are). Note: The loss numbers will be
different if a different model is used. MobileNet-SSD starts with a
loss of about 20, and should be trained until the loss is consistently
under 2.
For more information:
The SSD Mobilnet architecture demands additional training to suffice
the loss accuracy values of the R-CNN model, however, offers
practicality, scalability, and easy accessibility on smaller devices
which reveals the SSD model as a promising candidate for further
assessment (Fleury and Fleury, 2018).
For more information: Fleury, D. & Fleury, A. (2018). Implementation of Regional-CNN and SSD machine learning object detection architectures for the real time analysis of blood borne pathogens in dark field microscopy. MDPI AG.
I would recommend you to take 15%-20% images for testing which cover all the variety present in training data. As you said you have 650+ images for training and 150+ for testing. That is roughly 25% of testing images. It looks like you have enough images to start with. I know the more, the merrier but make sure your model also has sufficient data to learn from!
Resizing the images does not contribute to the loss. It makes sure there is consistency across all images for the model to recognize them without bias. The loss has nothing to do with image resizing as long as every image is resized identically.
You have to make stops and recover checkpoints again and again if you want your model to be perfectly fit. Usually, you can get away with good accuracy by re-training the ssd mobilenet until the loss consistently becomes under 1.Ideally we want the loss to be as lower as possible but we want to make sure the model is not over-fitting. It is all about trial and error. (Loss between 0.5 and 1 seems to be doing the job well but again it all depends on you.)
The reason I think your model is underperforming is due to the fact that you have variety of testing data and not enough training data to suffice.
The model has not been given enough knowledge in training data to make the model learn for new variety of testing data. (For example : Your test data has some images of new angles of buildings which are not sufficiently present in training data). In that case, I recommend you to put variety of all images in training data and then picking images to test making sure you still have sufficient training data of new postures. That's why I recommend you to take 15%-20% test data.

Why is it that the graph of mAP not ascending as training steps increases?

I trained my own ssd coco model with 1000 train pictures and 100 test. I was just curious why is the number of training steps is not directly proportional to the mAP or why does it have lower mAP at certain training steps like shown below image?
Neural Network optimizer functions such as gradient descent and it's variations ( attempt to update the weights of your model at each time step in such a way as to get closer to the smallest possible loss. Sometimes it steps in the wrong direction, sometimes it steps in the right directions, but the step is too big so that it steps right past the minimum.
Sophisticated optimizer functions such as Adam seek to minimize this problem by making the steps taken more consistent and also progressively smaller over time.
What you are seeing above is therefore completely normal - i.e. the mAP jumps up and down but over time it increases.

Memory requirements for back propagation - why not use the mean activation?

I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).
For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.
My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.
Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.
(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments
The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.
The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.
The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).
I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.