Convolutional network returning a matrix with a image as input - tensorflow

I have been trying to code a model that looks at an image with a grid and returns a matrix with the contents of that grid.
Here is an example of the input image:
Input
And this should be the output:
[30202133333,
12022320321,
23103100322,
13103110301,
22221301212,
33100210001,
11012010320,
21230233011,
00330223230,
02121221220,
23133103321,
23110110330]
With 0: Blue, 1: Pink, 2: Lavender, 3: Green
I have a hard time finding resources on how to do this. What would be the simpelst way?
Thanks in advance!

There could be multiple design choices to generate this type of output. I suggest using Autoencoders.
Here is some information about Autoencoders taken from Wikipedia -
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data (unsupervised learning).1 The
encoding is validated and refined by attempting to regenerate the
input from the encoding. The autoencoder learns a representation
(encoding) for a set of data, typically for dimensionality reduction,
by training the network to ignore insignificant data (“noise”).
While autoencoders are typically used to reconstruct the input, you have a slightly different problem of mapping the input to a specific matrix.
You'd want to set up the architecture by providing images as input and the corresponding matrices as your "labels." The architecture can be further optimized by using Convolutional layers instead of MLP layers.

Related

how to generate different samples using PixelCNN?

I am trying pixelcnn, which is auto-regressive generative model. After training, the model receive an all-zero tensor and generate the next pixel form the left top coner. Now that the model parameters are fixed, does the model only can produce the same outputs starting from the same zero tensor? How to produce different samples?
Yes, you always provide an all-zero tensor. However, for PixelCNN each pixel location is represented by a distribution. So when you do the forward pass you then sample from a random distribution at the end. That is how the pixel values are different each run.
This is of course because PixelCNN is a probabilistic neural network. So the pixels, as mentioned before, are all represented by conditional probability distributions of all the layers below, not just point estimates.

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

How to input scalar (non-image) values to CNNs?

The classification task is based on a image and a scalar value.
If I encoded the scalar value as image pixels with that value (or a normalized version of the same) and append it as another layer in the input image, I would be wasting convolutional computation cycles over the encoding to get this information into the network.
On the other hand, I can send this as another neuron to the layer where flattening of conved feature maps occurs. Another option would be adding just before the output layer. (But how do I implement such a network in Keras or tensorflow?)
Which is the best method to send in scalar values?
PS: Although this question is not specific to any framework, Keras examples would be great in a way that they are simple enough for most people to understand... Links to blogs addressing the same are welcome too.
See this question and answer on the Cross Validated site: Combining image and scalar inputs into a neural network
In addition to the "bias" method suggested in the paper mentioned there(when the scalar is being used as a bias to some convolution layer), and the other option suggested in the answer to append the scalar to some flattened layer, you can also use an inner product (fully connected, "Dense" in Keras) layer to find the connectivity pattern between the ND input to the scalar.

How can I evaluate FaceNet embeddings for face verification on LFW?

I am trying to create a script that is able to evaluate a model on lfw dataset. As a process, I am reading pair of images (using the LFW annotation list), track and crop the face, align it and pass it through a pre-trained facenet model (.pb using tensorflow) and extract the features. The feature vector size = (1,128) and the input image is (160,160).
To evaluate for the verification task, I am using a Siamese architecture. That is, I am passing a pair of images (same or different person) from two identical models ([2 x facenet] , this is equivalent like passing a batch of images with size 2 from a single network) and calculating the euclidean distance of the embeddings. Finally, I am training a linear SVM classifier to extract 0 when the embedding distance is small and 1 otherwise using pair labels. This way I am trying to learn a threshold to be used while testing.
Using this architecture I am getting a score of 60% maximum. On the other hand, using the same architecture on other models (e.g vgg-face), where the features are 4096 [fc7:0] (not embeddings) I am getting 90%. I definitely cannot replicate the scores that I see online (99.x%), but using the embeddings the score is very low. Is there something wrong with the pipeline in general ?? How can I evaluate the embeddings for verification?
Nevermind, the approach is correct, facenet model that is available online is poorly trained and that is the reason for the poor score. Since this model is trained on another dataset and not the original one that is described in the paper (obviously), verification score will be less than expected. However, if you set a constant threshold to the desired value you can probably increase true positives but by sacrificing f1 score.
You can use a similarity search engine. Either using approximated kNN search libraries such as Faiss or Nmslib, cloud-ready similarity search open-source tools such as Milvus, or production-ready managed service such as Pinecone.io.

Vector representation in multidimentional time-series prediction in Tensorflow

I have a large data set (~30 million data-points with 5 features) that I have reduced using K-means down to 200,000 clusters. The data is a time-series with ~150,000 time-steps. The data on which I would like to train the model is the presence of particular clusters at each time-step. The purpose of the predictive model is generate a generalized sequence similar to generating syntactically correct sentences from a model trained on word sequences. The easiest way to think about this data is that I'm trying to predict the pixels in the next video frame from pixels in the current video frame in order to generate a new sequence of frames that approximate the original sequence.
The raw and sparse representation at each time-step would be 200,000 binary values representing which clusters are present or not at that time step. Note, no more than 200 clusters may be present in any one time-step and thus this representation is extremely sparse.
What is the best representation to convert this sparse vector to a dense vector that would be more suitable to time-series prediction using Tensorflow?
I initially had in mind a RNN / LSTM trained on the vectors at each time-step, but due to the size of the training vector I'm now wondering if a convolution approach would be more suitable.
Note, I have not actually used tensorflow beyond some simple tutorials, but have have previously used OpenCV ML functions. Please consider me a novice in your responses.
Thank you.