I'm building Neural Evolution of Augmented Topologies and I'm looking for a way to optimize my algorithm. The network represents an irregula set of connections between neurons.
I'm not very familiar with tensorflow, but I suppose that there is a way to use it here.
I need to iterate through the network many times in quite a big interval of time. So, it gets very slow when the net is very big.
The network can be of any structure: a genetic algorithm evolves the network. Every neuron can have different activation functions.
Any suggestions?
Related
I followed all the steps mentioned in the article:
https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
Then I compared the results with Linear Regression and found that the error is less (68) than the tensorflow model (84).
from sklearn.linear_model import LinearRegression
logreg_clf = LinearRegression()
logreg_clf.fit(X_train, y_train)
pred = logreg_clf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, pred)))
Does this mean that if I have large dataset, I will get better results than linear regression?
What is the best situation - when I should be using tensorflow?
Answering your first question, Neural Networks are notoriously known for overfitting on smaller datasets, and here you are comparing the performance of a simple linear regression model with a neural network with two hidden layers on the testing data set, so it's not very surprising to see that the MLP model falling behind (assuming that you are working with relatively a smaller dataset) the linear regression model. Larger datasets will definitely help neural networks in learning more accurate parameters and generalize the phenomena well.
Now coming to your second question, Tensorflow is basically a library for building deep learning models, so whenever you are working on a deep learning problem like image recognition, Natural Language Processing, etc. you need massive computational power and will be processing a ton of data to train your models, and this is where TensorFlow becomes handy, it offers you GPU support which will significantly boost your training process which otherwise becomes practically impossible. Moreover, if you are building a product that has to be deployed in a production environment for it to be consumed, you can make use of TensorFlow Serving which helps you to take your models much closer to the customers.
I know CNN has a lot of good features like weight sharing, save memory and feature extracting. However, this question makes me very confused. Is there any possible situation that fully connected network better than CNN? Why?
Thanks a lot guys!
Is there any possible situation that fully connected network better than CNN?
Well, I think we should first define what we mean by "better". Accuracy and precision are not the only things to consider: computational time, degrees of freedom and difficulty of the optimization should also be taken into account.
First, consider an input of size h*w*c. Feeding this input to a convolutional layer with F featuremaps and kernel size s will result in at about F*s*s*c learnable parameters (assuming there are no constraints on the ranks of the convolutions, otherwise we even have less parameters.). Feeding the same input into a fully connected layer with the same number of featuremaps will result in F*d_1*d_2*w*h*c, (where d_1,d_2 are the dimensions of each featuremap) which is clearly in the order of billions of learnable parameters given any input image with decent resolution.
While it can be tempting to think that we can get away with shallower networks (we already have lots of parameters, right?), fully connected layers are just linear layers after all, so we still need to insert many non-linearities in order for the network to gain reasonable representational power. So, this will mean that you will still need a deep network, however with so many parameters that it would be untractable. In addition, a larger network will have more degrees of freedom, and will therefore model much more than what we want: it will model noise unless we feed it some data or constrain it.
So yes, there might be a fully connected network that in theory could give us better performance, but we don't know how to train it yet. Finally, and this is purely based on intuition and therefore might be wrong, but it seems unlikely to me that such a fully connected network would converge to a dense solution. Since many convolutional networks achieve very high levels of accuracy (99% and up) on many tasks, I think that the optimal solution the fully connected network would converge to would be close to the convolutional network. So, we don't really need to train the fully connected one, but just a subset of its architecture.
I plot all my weights of my neural network on tensorboard, I found that some
weights of some layer is normally distributed:
but, some are not.
what does this imply? should I increase or decrease the capacity of this layer?
Update:
My network is a LSTM-based netowrk. the non-normal distributed weights is the weights multiply with input feature, the normal distributed weights is the weights multiply with states.
one explanation base on convolutional networks might be this(I don't know if this is true for any other kind of artificial neural models or not), hence the first layer tries to find distinct small features weights are distributed very widely and network tries to find any useful feature it can, then in the next layers combination of these distinct features are used, which make sense to put a normal distribution of weights hence every one of the previous features are going to be part of a single bigger or more representative feature in next layers.
but this was only my intuition I am not sure if this is the reason with proof now.
Context: I am going to start training a CNN to classify a data set. This CNN will have to be deployed for a real world application. So a forward propagation through this CNN has to be fast. Most of the CNN architectures I have read cannot run without a GPU and need a lot of costly resources to be deployed.
Question:
Now I know one particular technique that's quite useful for reducing the size of a CNN architecture: Downsize the image using cubic interpolation ( Cubic interpolation helps improve certain image features like edges ). This helps reduce the number of convolution layers as well as the filter size thus reducing the overall parameters in a CNN by quite a lot. I wanted to know if there are other techniques which can make a CNN smaller so that it can be realistically deployed.
Binarization techniques are effective algorithms which allow to constrain both the parameters and the activations of a network to have binary values. Obviously the precision loss may degrade a bit the final performances, but the binary representation reduces a lot the resource requirements of the network.
For instance, you can have a look at these works:
Binarized Neural Networks
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1
XNOR-Net: ImageNet Classification Using Binary
Convolutional Neural Networks
which released their code.
To elaborate : Under what circumstances would fine tuning all layers of a small network (say SqueezeNet) perform better than feature extracting or fine tuning only last 1 or 2 Convolution layer of a big network (e.g inceptionV4)?
My understanding is computing resource required for both is somewhat comparable. And I remember reading in a paper that extreme options i.e fine tuning 90% or 10% of network is far better compared to more moderate like 50%. So, what should be the default choice when experimenting extensively is not an option?
Any past experiments and intuitive description of their result, research paper or blog would be specially helpful. Thanks.
I don't have much experience in training models like SqueezeNet, but I think it is much easier to finetune only the last 1 or 2 layers of a big network: you don't have to extensively search for many optimal hyperparameters. Transfer learning works amazingly well out of the box with the LR finder and the cyclical learning rate from fast.ai.
If you want fast inference after the training, then it is preferable to train SqueezeNet. It might also be the case if the new task is very different from ImageNet.
Some intuition from http://cs231n.github.io/transfer-learning/
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.