How to train CNN models with additional categorical / numerical features? - tensorflow

In addition to the image itself in RGB, I also have a list of metadata / categorical / numerical features attached each image.
e.g. Local time of day, day of week of when the photo was taken, GPS / city name of the photo, and a brief description of the photo (written by human).
How do you train a CNN model using tensorflow with additional features?

Generally, it is a hard problem to flexibly include all various features (images, time, description etc) into one model. CNN is designed for and only for extracting information from images. The operations like convolution, pooling can only be applied to images and require a lot of efforts to design generalizations.
However, CNN does help you summarize the information contained in an image. You could use the prediction of your CNN as your features and feed in another model.

Related

How to control the amount of the augmented training data with the augmentation layers integrated into the model

As you may know, recent versions of tensorflow/keras allowed the data augmentation layers integrated into the model. This feature of the API is an excellent option, especially when you want to apply image augmentation on a part of inputs (image) for a model with multimodal inputs and different sub-networks for different inputs. And the test accuracy with this augmentation increased to 3-5% in comparison with no augmentation.
But I can't figure out how many training samples were used in the actual training with this augmentation method. For simplicity, let's assume I am passing a list of numpy arrays as the inputs of the model when fitting the model. For example, if I have 1000 training cases for a model with the augmentation layers, will 1000 training cases with transformed images be used in training? If not, how many?
I tried to search all related sites (tutorials and documentation) for an answer to this simple question in vain.
I think I found the answer. Based on the training log of the model, the augmentation layers do not produce additional images but randomly transform the original images. To increase generated data amount, a user has to provide multiple copies of original training data as input to the model.

Classification of a sequence of images (fixed number)

I successfully trained a CNN for a single image classification, using pre-trained resnet50 from tensorflow_hub.
Now my goal is to give as input to my network a chronological sequence of images (not a video), to classify the behavior of the subject.
Each sequence consists of 20 images taken every 100ms.
What is the best kind of NN? Where can I find documentation/examples for problems similar to mine?
Any time there is sequential data some type of Recurrent Neural Network is a great candidate (usually in the form of an LSTM).
Your model may look like a combination of an CNN-LSTM because your pictures have some sort of sequential relationship.
Here is a link to some examples and tutorials. He will set up a CNN in his example but you could probably rig your architecture to use the resNet you have already made. Though your are not dealing with a video your problem shares the same domain.
Here is a paper than uses a NN architecture like the one described above you might find useful.

Where are the filter image data in this TensorFlow example?

I'm trying to consume this tutorial by Google to use TensorFlow Estimator to train and recognise images: https://www.tensorflow.org/tutorials/estimators/cnn
The data I can see in the tutorial are: train_data, train_labels, eval_data, eval_labels:
((train_data,train_labels),(eval_data,eval_labels)) =
tf.keras.datasets.mnist.load_data();
In the convolutional layers, there should be feature filter image data to multiply with the input image data? But I don't see them in the code.
As from this guide, the input image data matmul with filter image data to check for low-level features (curves, edges, etc.), so there should be filter image data too (the right matrix in the image below)?: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks
The filters are the weight matrices of the Conv2d layers used in the model, and are not pre-loaded images like the "butt curve" you gave in the example. If this were the case, we would need to provide the CNN with all possible types of shapes, curves, colours, and hope that any unseen data we feed the model contains this finite sets of images somewhere in them which the model can recognise.
Instead, we allow the CNN to learn the filters it requires to sucessfully classify from the data itself, and hope it can generalise to new data. Through multitudes of iterations and data( which they require a lot of), the model iteratively crafts the best set of filters for it to succesfully classify the images. The random initialisation at the start of training ensures that all filters per layer learn to identify a different feature in the input image.
The fact that earlier layers usually corresponds to colour and edges (like above) is not predefined, but the network has realised that looking for edges in the input is the only way to create context in the rest of the image, and thereby classify (humans do the same initially).
The network uses these primitive filters in earlier layers to generate more complex interpretations in deeper layers. This is the power of distributed learning: representing complex functions through multiple applications of much simpler functions.

How to know what Tensorflow actually "see"?

I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
https://www.smithsonianmag.com/innovation/one-step-closer-to-a-brain-79159265/
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.

training image datasets for object detection

Which version of YOLO-tensorflow (customised cnn like googlenet) is preferred for traffic science?
If the training datasets are blurred and are with noise is that okay to train or what are the steps to be considered for training dataset images?
You may need to curate your own dataset using frames from a traffic camera and manually tagging images with cars where the passengers' seatbelts are or are not buckled, as this is a very specialized task. From there, you can do data augmentation (perhaps using the Keras ImageDataGenerator class). If a human can identify a seatbelt in an image that is blurred or noisy, a model can learn from it. From there, you can use transfer learning from a pre-trained CNN model like Inception (this is a helpful tutorial for how to do that), or train your own binary classifier with your tagged images, where your inputs are frames of traffic camera video.
I'd suggest that after learning the basics of CNNs with these models, only then should you dive into a more complicated model like yolo.