Vector representation in multidimentional time-series prediction in Tensorflow - tensorflow

I have a large data set (~30 million data-points with 5 features) that I have reduced using K-means down to 200,000 clusters. The data is a time-series with ~150,000 time-steps. The data on which I would like to train the model is the presence of particular clusters at each time-step. The purpose of the predictive model is generate a generalized sequence similar to generating syntactically correct sentences from a model trained on word sequences. The easiest way to think about this data is that I'm trying to predict the pixels in the next video frame from pixels in the current video frame in order to generate a new sequence of frames that approximate the original sequence.
The raw and sparse representation at each time-step would be 200,000 binary values representing which clusters are present or not at that time step. Note, no more than 200 clusters may be present in any one time-step and thus this representation is extremely sparse.
What is the best representation to convert this sparse vector to a dense vector that would be more suitable to time-series prediction using Tensorflow?
I initially had in mind a RNN / LSTM trained on the vectors at each time-step, but due to the size of the training vector I'm now wondering if a convolution approach would be more suitable.
Note, I have not actually used tensorflow beyond some simple tutorials, but have have previously used OpenCV ML functions. Please consider me a novice in your responses.
Thank you.

Related

Convolutional network returning a matrix with a image as input

I have been trying to code a model that looks at an image with a grid and returns a matrix with the contents of that grid.
Here is an example of the input image:
Input
And this should be the output:
[30202133333,
12022320321,
23103100322,
13103110301,
22221301212,
33100210001,
11012010320,
21230233011,
00330223230,
02121221220,
23133103321,
23110110330]
With 0: Blue, 1: Pink, 2: Lavender, 3: Green
I have a hard time finding resources on how to do this. What would be the simpelst way?
Thanks in advance!
There could be multiple design choices to generate this type of output. I suggest using Autoencoders.
Here is some information about Autoencoders taken from Wikipedia -
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data (unsupervised learning).1 The
encoding is validated and refined by attempting to regenerate the
input from the encoding. The autoencoder learns a representation
(encoding) for a set of data, typically for dimensionality reduction,
by training the network to ignore insignificant data (“noise”).
While autoencoders are typically used to reconstruct the input, you have a slightly different problem of mapping the input to a specific matrix.
You'd want to set up the architecture by providing images as input and the corresponding matrices as your "labels." The architecture can be further optimized by using Convolutional layers instead of MLP layers.

What is the significance of normalization of data before feeding it to a ML/DL model?

I just started learning Deep Learning and was working with the Fashion MNIST data-set.
As a part of pre-processing the X-labels, the training and test images, dividing the pixel values by 255 is included as a part of normalization of the input data.
training_images = training_images/255.0
test_images = test_images/255.0
I understand that this is to scale down the values to [0,1] because neural networks are more efficient while handling such values. However, if I try to skip these two lines, my model predicts something entire different for a particular test_image.
Why does this happen?
Let's see both the scenarios with the below details.
1. With Unnormaized data:
Since your network is tasked with learning how to combine inputs through a series of linear combinations and nonlinear activations, the parameters associated with each input will exist on different scales.
Unfortunately, this can lead toward an awkward loss function topology which places more emphasis on certain parameter gradients.
Or in a simple definition as Shubham Panchal mentioned in comment.
If the images are not normalized, the input pixels will range from [ 0 , 255 ]. These will produce huge activation values ( if you're using ReLU ). After the forward pass, you'll end up with a huge loss value and gradients.
2. With Normalized data:
By normalizing our inputs to a standard scale, we're allowing the network to more quickly learn the optimal parameters for each input node.
Additionally, it's useful to ensure that our inputs are roughly in the range of -1 to 1 to avoid weird mathematical artifacts associated with floating-point number precision. In short, computers lose accuracy when performing math operations on really large or really small numbers. Moreover, if your inputs and target outputs are on a completely different scale than the typical -1 to 1 range, the default parameters for your neural network (ie. learning rates) will likely be ill-suited for your data. In the case of image the pixel intensity range is bound by 0 and 1(mean =0 and variance =1).

how to generate different samples using PixelCNN?

I am trying pixelcnn, which is auto-regressive generative model. After training, the model receive an all-zero tensor and generate the next pixel form the left top coner. Now that the model parameters are fixed, does the model only can produce the same outputs starting from the same zero tensor? How to produce different samples?
Yes, you always provide an all-zero tensor. However, for PixelCNN each pixel location is represented by a distribution. So when you do the forward pass you then sample from a random distribution at the end. That is how the pixel values are different each run.
This is of course because PixelCNN is a probabilistic neural network. So the pixels, as mentioned before, are all represented by conditional probability distributions of all the layers below, not just point estimates.

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

How can I evaluate FaceNet embeddings for face verification on LFW?

I am trying to create a script that is able to evaluate a model on lfw dataset. As a process, I am reading pair of images (using the LFW annotation list), track and crop the face, align it and pass it through a pre-trained facenet model (.pb using tensorflow) and extract the features. The feature vector size = (1,128) and the input image is (160,160).
To evaluate for the verification task, I am using a Siamese architecture. That is, I am passing a pair of images (same or different person) from two identical models ([2 x facenet] , this is equivalent like passing a batch of images with size 2 from a single network) and calculating the euclidean distance of the embeddings. Finally, I am training a linear SVM classifier to extract 0 when the embedding distance is small and 1 otherwise using pair labels. This way I am trying to learn a threshold to be used while testing.
Using this architecture I am getting a score of 60% maximum. On the other hand, using the same architecture on other models (e.g vgg-face), where the features are 4096 [fc7:0] (not embeddings) I am getting 90%. I definitely cannot replicate the scores that I see online (99.x%), but using the embeddings the score is very low. Is there something wrong with the pipeline in general ?? How can I evaluate the embeddings for verification?
Nevermind, the approach is correct, facenet model that is available online is poorly trained and that is the reason for the poor score. Since this model is trained on another dataset and not the original one that is described in the paper (obviously), verification score will be less than expected. However, if you set a constant threshold to the desired value you can probably increase true positives but by sacrificing f1 score.
You can use a similarity search engine. Either using approximated kNN search libraries such as Faiss or Nmslib, cloud-ready similarity search open-source tools such as Milvus, or production-ready managed service such as Pinecone.io.