Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I trained doc2vec model in TensorFlow. So now I have embeded vectors for words in dictionary and vectors for the documents.
In the paper
"Distributed Representations of Sentences and Documents"
Quoc Le, Tomas Mikolov
authors write
“the inference stage” to get paragraph vectors D for new paragraphs
(never seen before) by adding more columns in D and gradient
descending on D while holding W,U,b fixed.
I have pretrained model so we have W, U and b as graph variables. Question is how to implement inference of D(new document) efficiently in Tensorflow?
For most neural networks, the output of the network (class for classification problems, number for regression,...) if the value you are interested in. In those cases, inference means running the frozen network on some new data (forward propagation) to compute the desired output.
For those cases, several strategies can be used to deliver quickly the desired output for multiple new data points : scaling horizontally, reduce the complexity of calculation through quantisation of the weights, optimising the freezed graph computation (see https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/),...
doc2Vec (and word2vec) are different use case is however different : the neural net is used to compute an output (prediction of the next word), but the meaningful and useful data are the weights used in the neural network after training. The inference stage is therefore different : you do not want to get the output of the neural net to get a vector representation of a new document, you need to train the part of the neural net that provides you the vector representation of your document. Part of the neural net is then frozen (W,U,b).
How can you efficiently compute D (document vector) in Tensorflow :
Make experiments to define the optimal learning rate (a smaller value might be a better fit for shorter document) as it defines how quick your neural network representation of a document.
As the other part of the neural net are frozen, you can scale the inference on multiple processes / machines
Identify the bottle necks : what is currently slow ? model computation ? Text retrieval from disk of from external data source ? Storage of the results ?
Knowing more about your current issues, and the context might help.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 11 months ago.
Improve this question
I am doing transfer learning with google audioset embeddings. According to the documentation,
the embedding layer does not include a final non-linear activation, so
the embedding value is pre-activation
I want to train and test a new model on top of these embedding layer with the embedding data. I have planned to do the following
Create new dense layers.
Convert the embeddings from byte string to tensor. Split these embeddings to train, test and split dataset.
Input these tensors to the new model.
Validate and test the model using validate dataset and test dataset.
I have two confusions with this implementation
Is using the embeddings as input of the new layers enough for the transfer learning? I have seen in some Transfer Learning implementation that they load pre-trained weights to the new model and freeze the layers involving those weights. But in those implementation, they use new data for training, not the embeddings from the pre-trained model. I am confused how that works.
Is it okay to split the embeddings to train, test and validate dataset? I am not sure if all the embeddings were used for training the pre-trained model. If they all were used, then does it make sense to use part of them as validation and test dataset?
Is using the embeddings as input of the new layers enough for the transfer learning?
This should work as expected. Of course, you should consider that your generalization capability might be lower than expected for unseen data points (when comparing with data points seen during training of the pre-trained model). Usually, when using a pre-trained model, every data point is unseen for the original network, but in your case some of the data points might have been used for training, so their performance might be "unrealistically too high" when compared with data that your pre-trained model has never seen.
Is it okay to split the embeddings to train, test and validate dataset?
This is a good approach to solve the problem from the previous point. If you don't know which data points were used for training, you could benefit from using cross-validation and create multiple splits to reduce the impact of this issue.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I read that ResNet solves the problem of vanishing gradient problem by using skip functions. But are they not already solved using RELU? Is there some other important thing I'm missing about ResNet or does Vanishing gradient problem occur even after using RELU?
The ReLU activation solves the problem of vanishing gradient that is due to sigmoid-like non-linearities (the gradient vanishes because of the flat regions of the sigmoid).
The other kind of "vanishing" gradient seems to be related to the depth of the network (e.g. see this for example). Basically, when backpropagating gradient from layer N to N-k, the gradient vanishes as a function of depth (in vanilla architectures). The idea of resnets is to help with gradient backpropagation (see for example Identity mappings in deep residual networks, where they present resnet v2 and argue that identity skip connections are better at this).
A very interesting and relatively recent paper that sheds light on the working on resnets is resnets behaves as ensembles of relatively small networks. The tl;dr of this paper could be (very roughly) summarized as this: residual networks behave as an ensemble: removing a single layer (i.e. a single residual branch, not its skip connection) doesn't really affect performance, but performance decreases in an smooth manner as a function of the number of layers that are removed, which is the way in which ensembles behave. Most of the gradient during training comes from short paths. They show that training only this short paths doesn't affect performance in a statistically significant way compared to when all paths are trained. This means that the effect of residual networks doesn't really come from depth as the effect of long paths is almost non-existant.
The main purpose of ResNet is to create more deeper models. In theory deeper models (speaking about VGG models) must show better accuracy, but in the real life they usually do not. But if we add short-connection to the model, we can increase the number of layers and accuracy as well
While the ReLU activation function does solve the problem of vanishing gradients, it does not provide the deeper layers with extra information as in the case of ResNets. The idea of propagating the original input data as deep as possible through the network hence helping the network learn much more complex features is why ResNet architecture was introduced and achieves such high accuracy on a variety of tasks.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on a tensorflow project, in which I have a neural network in a reinforcement learning system, used to predict the Q values. I have 50 inputs and 10 outputs. Some of the inputs are in the range 30-70, and the rest are between 0-1, so I normalize only the first group, using this formula:
x_new = (x - x_min)/(x_max - x_min)
Although I know the mathematical base of neural networks, I do not have experience applying them in real cases, so I do not really know if the hyperparameters I am using are correctly chosen. The ones I have currently are:
2 hidden layers with 10 and 20 neurons each
Learning rate of 0.5
Batch size of 10 (I have tried with different values until 256 obtaining the same result)
The problem I'm not able to solve is that the weights of this neural network only change in the first two or three iterations, and stay fixed afterwards.
What I had read in other posts is that the algorithm is finding a local optima, and that the normalization of the inputs is a good idea to solve it. However, after normalizing the inputs, I am still in the same state. So, my question is if anyone knows where the problem may be, and if there is any other technique (like normalization) that I should add to my pipeline.
I haven't added any line of code in the question, because I think my problem is rather conceptual. However, in case more details were needed, I would insert it.
Some pointers you can check:
50 input data points with 10 classes?... The data is too small for the netowrk to learn anything at all if this is the case
Which activation function are you using. Try ReLU instead of sigmoid or tanh:
activation functions
How deep is your network? maybe all your graident is either vanishing or exploding:
vanishing or exploding gradients
check if your training data overfits. if not your network is not learning anything
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Thanks to Google for providing a few pre-trained models with tensorflow API.
I would like to know how to retrain a pre-trained model available from the above repository, by adding new classes to the model.
For example, the trained COCO dataset model has 90 classes, I would like to add 1 or 2 classes to the existing one and get one 92 class object detection model as a result.
Running Locally is provided by the repository but it is completely replacing those pre-trained classes with newly trained classes. There, only train and eval are mentioned.
So, is there any other way to retrain the model and get 92 classes as a result?
Question : How do we add a few more classes to my already trained network?
Specifically, we want to keep all the network as-is other than the output of the new classes. This means that for something like ResNet, we want to keep everything other than the last layer frozen, and somehow expand the last layer to have our new classes.
Answer : Combine the existing last layer with a new one you train
Specifically, we will replace the last layer with a fully connected layer that is large enough for your new classes and the old ones. Initialize it with random weights and then train it on your classes and just a few of the others. After training, copy the original weights of the original last fully connected layer into your new trained fully connected layer.
If, for example, the previous last layer was a 1024x90 matrix, and your new last layer is a 1024x92 matrix, copy the 1024x90 into the corresponding space in your new 1024x92. This will destructively replace all your training of the old classes with the pre-trained values but leave your training of your new classes. That is good, because you probably didn't train it with the same number of old classes. Do the same thing with the bias, if any.
Your final network will have only 1024x2 new weight values (plus any bias), corresponding to your new classes.
A word of caution, although this will train fast and provide quick results, it will not perform as well as retraining on a full and comprehensive data set.
That said, it'll still work well ;)
Here is a reference to how to replace the last layer How to remove the last layer from trained model in Tensorflow that someone else answered
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to design a CNN for a binary image classification task, which is to detect a small object present or absent in the images. The images are greyscale (unsigned short) with size 512x512 (dowmsampled from 2048x2048 already), and I have thousands of those images for training and test.
It's my first time using CNN for this kind of task, and I hope to achieve ~80% accuracy to start, so I'd like to know, IN GENERAL, how to design the CNN such that I have the best chance to achieve my goal.
My specific questions are:
How many convolution layers and fully-connected layers should I use?
How many feature maps are in each convolution layer and how many nodes in each fully-connected layer?
What's the filter size in each convolution layer?
I'm trying to implement the CNN using Keras with TensorFlow backend, and my computer's specs are: 8 Intel Xeon CPUs # 3.5 GHz; 32 GB memory; 2 Nvidia GPUs: GeForce GTX 980 and Quadro K4200
With those hardware and software, I'd also like to know the computational time of the training. Specifically,
How long will it take to train the CNN (with above structure) with 1000 images mentioned above in epoch, and (in general) how many epochs are needed to achieve ~80% accuracy?
The reason I want to know the typical computational time is to make sure I set up everything properly.
Hope I didn't ask too many questions in my first post.
You'd probably go very well if you take one of the already existing models that keras makes available for that task, such as VGG16, VGG19, InceptionV3 and others: https://keras.io/applications/.
You may experiment on them, try different paramters, little tweaks here and there, and stuff like that. Since you've got only one class, you can probably try smaller versions of them.
All the codes can be found in https://github.com/fchollet/keras/tree/master/keras/applications
Speed is very very relative. It's impossible to tell the speed because each installation method, each driver, each version, each operational system may or may not actually use your hardware capabilities properly or entirely.
But with your specifications, it should be pretty fast, if everything is set up well.