Classification of a sequence of images (fixed number) - tensorflow

I successfully trained a CNN for a single image classification, using pre-trained resnet50 from tensorflow_hub.
Now my goal is to give as input to my network a chronological sequence of images (not a video), to classify the behavior of the subject.
Each sequence consists of 20 images taken every 100ms.
What is the best kind of NN? Where can I find documentation/examples for problems similar to mine?

Any time there is sequential data some type of Recurrent Neural Network is a great candidate (usually in the form of an LSTM).
Your model may look like a combination of an CNN-LSTM because your pictures have some sort of sequential relationship.
Here is a link to some examples and tutorials. He will set up a CNN in his example but you could probably rig your architecture to use the resNet you have already made. Though your are not dealing with a video your problem shares the same domain.
Here is a paper than uses a NN architecture like the one described above you might find useful.

Related

How to use google inception model to classify DNA or protein sequences data sets?

I tried to classify protein using its sequences into their families. Can I use deep convolutional models on this purpose even though they use RGB 3 input metrics of an image? Is there any specific way to convert dataset other than the image in order to classify using these models. I'm new to Artificial neural networks, your suggestions are highly appreciated.
First you need to understand that the models you have in mind are tasked with a very difficult problem: Object Recognition in colored images therefore the models used are very big.
Then you need to know the purpose of using CNNs, is to extract as many features as we can from colored images in order to perform detection.
With the knowledge above considered I think classifying protein using its sequences seems achievable with a much more smaller convolutional model. You may need at max 10 layers of convolution. To conclude you should not need a CNN as complex as google inception model.
About your data: There is no rule about CNNs which say you can only use RGB pictures. These pictures are only arrays. If you have any kind of numeric data which can be used in algorithmic operations ofcourse, you can definitely use CNNs for feature extraction. I recommend you to take a look at this example.
I also recommend you to take a look at the following libraries. SK-LEARN, KERAS and PYTORCH. These libraries are very begginer friendly and they have amazing documentaries.
Best of luck.

How to know what Tensorflow actually "see"?

I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
https://www.smithsonianmag.com/innovation/one-step-closer-to-a-brain-79159265/
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.

how to use tensorflow object detection API for face detection

Open CV provides a simple API to detect and extract faces from given images. ( I do not think it works perfectly fine though because I experienced that it cuts frames from the input pictures that have nothing to do with face images. )
I wonder if tensorflow API can be used for face detection. I failed finding relevant information but hoping that maybe an experienced person in the field can guide me on this subject. Can tensorflow's object detection API be used for face detection as well in the same way as Open CV does? (I mean, you just call the API function and it gives you the face image from the given input image.)
You can, but some work is needed.
First, take a look at the object detection README. There are some useful articles you should follow. Specifically: (1) Configuring an object detection pipeline, (3) Preparing inputs and (3) Running locally. You should start with an existing architecture with a pre-trained model. Pretrained models can be found in Model Zoo, and their corresponding configuration files can be found here.
The most common pre-trained models in Model Zoo are on COCO dataset. Unfortunately this dataset doesn't contain face as a class (but does contain person).
Instead, you can start with a pre-trained model on Open Images, such as faster_rcnn_inception_resnet_v2_atrous_oid, which does contain face as a class.
Note that this model is larger and slower than common architectures used on COCO dataset, such as SSDLite over MobileNetV1/V2. This is because Open Images has a lot more classes than COCO, and therefore a well working model need to be much more expressive in order to be able to distinguish between the large amount of classes and localizing them correctly.
Since you only want face detection, you can try the following two options:
If you're okay with a slower model which will probably result in better performance, start with faster_rcnn_inception_resnet_v2_atrous_oid, and you can only slightly fine-tune the model on the single class of face.
If you want a faster model, you should probably start with something like SSDLite-MobileNetV2 pre-trained on COCO, but then fine-tune it on the class of face from a different dataset, such as your own or the face subset of Open Images.
Note that the fact that the pre-trained model isn't trained on faces doesn't mean you can't fine-tune it to be, but rather that it might take more fine-tuning than a pre-trained model which was pre-trained on faces as well.
just increase the shape of the input, I tried and it's work much better

Fixing error output from seq2seq model

I want to ask you how we can effectively re-train a trained seq2seq model to remove/mitigate a specific observed error output. I'm going to give an example about Speech Synthesis, but any idea from different domains, such as Machine Translation and Speech Recognition, using seq2seq model will be appreciated.
I learned the basics of seq2seq with attention model, especially for Speech Synthesis such as Tacotron-2.
Using a distributed well-trained model showed me how naturally our computer could speak with the seq2seq (end-to-end) model (you can listen to some audio samples here). But still, the model fails to read some words properly, e.g., it fails to read "obey [əˈbā]" in multiple ways like [əˈbī] and [əˈbē].
The reason is obvious because the word "obey" appears too little, only three times out of 225,715 words, in our dataset (LJ Speech), and the model had no luck.
So, how can we re-train the model to overcome the error? Adding extra audio clips containing the "obey" pronunciation sounds impractical, but reusing the three audio clips has the danger of overfitting. And also, I suppose we use a well-trained model and "simply training more" is not an effective solution.
Now, this is one of the drawbacks of seq2seq model, which is not talked much. The model successfully simplified the pipelines of the traditional models, e.g., for Speech Synthesis, it replaced an acoustic model and a text analysis frontend etc by a single neural network. But we lost the controllability of our model at all. It's impossible to make the system read in a specific way.
Again, if you use a seq2seq model in any field and get an undesirable output, how do you fix that? Is there a data-scientific workaround to this problem, or maybe a cutting-edge Neural Network mechanism to gain more controllability in seq2seq model?
Thanks.
I found an answer to my own question in Section 3.2 of the paper (Deep Voice 3).
So, they trained both of phoneme-based model and character-based model, using phoneme inputs mainly except that character-based model is used if words cannot be converted to their phoneme representations.

How can I enrich a Convolutional Neural Network with meta information?

I would very much like to understand how I can enrich a CNN with provided meta information. As I understand, a CNN 'just' looks at the images and classifies it into objects without looking at possibly existing meta-parameters such as time, weather conditions, etc etc.
To be more precise, I am using a keras CNN with tensorflow in the backend. I have the typical Conv2D and MaxPooling Layers and a fully connected model at the end of the pipeline. It works nicely and gives me a good accuracy. However, I do have additional meta information for each image (the manufacturer of the camera with which the image was taken) that is unused so far.
What is a recommended way to incorporate this meta information into the model? I could not yet come out with a good solution by myself.
Thanks for any help!
Usually it is done by adding this information in one of the fully connected layer before the prediction. The fully connected layer gives you K features representing your image, you just concatenate them with the additional information you have.