Pattern recognition on sphere (HEALPY based) - tensorflow

I am using Tensorflow and Keras. Is there a possibility to achieve a proper pattern recognition for images on the surface of a sphere? I am using the (Healpy framework) to create my skymaps on which the pattern recognition should work. The problem is that these Healpy skymaps are one dimensional numpy arrays, thus, a compact sub-pattern may be distributed scattered over this 1d array. This is actually pretty hard to learn for a basic machine learning algorithm (i am thinking about a convolutional deep network).
A specific task in this context would be counting blobbs on the surface of a sphere (see attached image). For this particular task the correct number would be 8. So I created 10000 skymaps (Healpy settings: nside=16 correpsonding to npix=3072) each with a random number of blobbs between 0 and 9 (thus 10 possibilities). I tried to solve this with the 1d Healpy array and a simple Feed Forward network:
model = Sequential()
model.add(Dense(npix, input_dim=npix, init='uniform', activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(10, init='uniform', activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(skymaps, number_of_correct_sources, batch=100, epochs=10, validation_split=1.-train)
, however, after training with 10,000 skymaps the test set yielded an accuracy of only 38%. I guess that this will significantly increase when providing the real arrangement of the Healpy cells (as it appears on the sphere) instead of the 1d array only. In this case one may use a Convolutional network (Convolution2d) and operate as for the usual image recognition. Any ideas how to map the healpy cells properly in a 2d array or using a convolutional network directly on the sphere?
Thanks!

This is a hard way of tackling a relatively simple problem that is unashamedly 2-D!
If the objects you are looking for are as prominent as those in your figure, create the 2_d map for the data and then threshold it for a series of threshold levels: the highest thresholds pick out the brightest objects. Any continuous projection like Aitoff or Hammmer will do, and to eliminate the edge problems, use rotations of the projection. Segmented projections, like Healpix, are good for data storage, but not necessarily ideal for data analysis.
If the map has poor signal to noise so that you are looking for objects in the murk of the noise, then some sophistication is required, maybe even some neural net algorithm. However, you might take a look at the Planck data analysis on Sunyaev-Zeldovich galaxy clusters, the earliest of which is perhaps https://arxiv.org/abs/1101.2024 (Paper VIII). The subsequent papers refine and add to this.
(This should have been a comment but I lack the rep.)

Related

ML/DL Prediction on whole input rather than row by row

I have tabular data from a sensor measuring various features. When the sensor is "off" it will report zero as values. I am training some machine learning models kNN, XGBoost, and NN for the purpose of classification. Here's the issue I am facing: I can train and predict on a row by row basis; however it would be better to classify a range as whole rather than a row by row basis. Another issue to this is that the range can vary in size. For a very basic example, please see this diagram illustrating the range.
I have a basic Keras model:
model = Sequential()
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
print(model.summary())
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
And the training data is shaped with 20 features and 4 classes. How would I:
1.) Format my training data
2.) Shape input data to classify as a "whole" rather than row by row
3.) While this has been talking about using keras. Can the same input shaping/training be applied to XGBoost or a kNN?
I assume that the blue line in that graph represents your targets. Here is a fundamental issue I see with something like predicting the range as a whole instead of sample by sample.
Assuming that there is some reasonable logic that could collapse the range of samples into one (taking mean per each feature, or concatenation, or whatever...), obviously you would first need to identify the range itself. This range identification step is however dependent on the knowledge of target (at least it seems like that based on the presented graph).
If the preprocessing step is dependent on the knowledge of the target, you would need to know the target for the test set as well before you could preprocess the data and make the predictions. In other words, you would need to know the outcome before you could make the prediction which would then be rather pointless.
You have stated that you are trying to perform classification but your target seems to be continuous. I don't know what your classes are or what patterns they are associated with but you would need to bin the target before you could start solving this as a classification problem. You would most likely lose a lot of information by doing this.
Therefore, I would start by solving it as a regression problem. Trying to predict that continuous target for each sample. Once you have that, you can apply some patter matching logic to identify the class for a given sample/range (for example, you could slice the sequence of targets/predictions from the previous step, associate each slice with the desired class and use this data as a new dataset for some classification algorithm).
As for the variable length inputs. Some deep learning architectures allow you to work with input of variable length, such as RNNs or adaptive pooling. You may try to do this one you know how to predict the continuous target as mentioned before. Non-deep-learning algorithms usually expect all samples to have the same shape so there is no general/automatic way of reusing the same input between them and deep learning algorithms that work with input of variable length.

How do you decide on the dimensions for a the activation layer in tensorflow

The tensorflow hub docs have this example code for text classification:
hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", output_shape=[50],
input_shape=[], dtype=tf.string)
model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
I don't understand how we decide if 16 is the right magic number for the relu layer. Can someone explain this please.
The choice of 16 units in the hidden layer is not a uniquely determined magic value. Like Shubham commented, it's all about experimenting and finding values that work well for your problem. Here is some folklore to guide your experimentation:
The usual range for the number of units in hidden layers is tens to thousands.
Powers of two may utilize specific hardware (like GPUs) more effectively.
Simple feed-forward networks like the one above often decrease the number of units between successive layers. A commonly cited intuition is to progress from many basic features to fewer, more abstract ones. (Hidden layers tend to produce dense representations like embeddings, not discrete features, but the reasoning applies analogously to the dimension of the feature space.)
The code snippet above does not show regularization. When trying whether more hidden units help, watch out for the gap between training and validation quality. A widening gap may indicate the need to regularize more.

Neural network immediately overfitting

I have a FFNN with 2 hidden layers for a regression task that overfits almost immediately (epoch 2-5, depending on # hidden units). (ReLU, Adam, MSE, same # hidden units per layer, tf.keras)
32 neurons:
128 neurons:
I will be tuning the number of hidden units, but to limit the search space I would like to know what the upper and lower bounds should be.
Afaik it is better to have a too large network and try to regularize via L2-reg or dropout than to lower the network's capacity -- because a larger network will have more local minima, but the actual loss value will be better.
Is there any point in trying to regularize (via e.g. dropout) a network that overfits from the get-go?
If so I suppose I could increase both bounds. If not I would lower them.
model = Sequential()
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(1, 'linear'))
model.compile('adam', 'mse')
Hyperparameter tuning is generally the hardest step in ML, In general we try different values randomly and evalute the model and choose those set of values which give the best performance.
Getting back to your question, You have a high varience problem (Good in training, bad in testing).
There are eight things you can do in order
Make sure your test and training distribution are same.
Make sure you shuffle and then split the data into two sets (test and train)
A good train:test split will be 105:15K
Use a deeper network with Dropout/L2 regularization.
Increase your training set size.
Try Early Stopping
Change your loss function
Change the network architecture (Switch to ConvNets, LSTM etc).
Depending on your computation power and time you can set a bound to the number of hidden units and hidden layers you can have.
because a larger network will have more local minima.
Nope, this is not quite true, in reality as the number of input dimension increases the chance of getting stuck into a local minima decreases. So We usually ignore the problem of local minima. It is very rare. The derivatives across all the dimensions in the working space must be zero for a local/global minima. Hence, it is highly unlikely in a typical model.
One more thing, I noticed you are using linear unit for last layer. I suggest you to go for ReLu instead. In general we do not need negative values in regression. It will reduce test/train error
Take this :
In MSE 1/2 * (y_true - y_prediction)^2
because y_prediction can be nagative value. The whole MSE term may blow up to large values as y_prediction gets highly negative or highly positive.
Using a ReLu for last layer makes sure that y_prediction is positive. Hence low error will be expected.
Let me try to substantiate some of the ideas here, referenced from Ian Goodfellow et. al. Deep Learning book which is available for free online:
Chapter 7: Regularization The most important point is data, one can and should avoid regularization if they have large amounts of data that best approximate the distribution. In you case, it looks like there might be a significant discrepancy between training and test data. You need to ensure the data is consistent.
Section 7.4: Data-augmentation With regards to data, Goodfellow talks about data-augmentation and inducing regularization by injecting noise (most likely Gaussian) which mathematically has the same effect. This noise works well with regression tasks as you limit the model from latching onto a single feature to overfit.
Section 7.8: Early Stopping is useful if you just want a model with the best test error. But again this only works if your data allows the training to infer the test data. If there is an immediate increase in test error the training would stop immediately.
Section 7.12: Dropout Just applying dropout to a regression model doesn't necessarily help. In fact "when extremely few labeled training examples are available, dropout is less effective". For classification, dropout forces the model to not rely on single features, but in regression all inputs might be required to compute a value rather than classify.
Chapter 11: Practicals emphasises the use of base models to ensure that the training task is not trivial. If a simple linear regression can achieve similar behaviour than you don't even have a training problem to begin with.
Bottom line is you can't just play with the model and hope for the best. Check the data, understand what is required and then apply the corresponding techniques. For more details read the book, it's very good. Your starting point should be a simple regression model, 1 layer, very few neurons and see what happens. Then incrementally experiment.

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

Detection Text from natural images

I write a code in tensorflow by using convolution neural network to detect the text from images. I used TFRecords file to read the street view text dataset, then, I resized the images to 128 for height and width.
I used 9-conv layer with zero padding and three max_pool layer with window size of (2×2) and stride of 2. Since I use just three pooling layer, the last layer shape will be (16×16). the last conv layer has '256' filters.
I used too, two regression fully connected layers (tf.nn.sigmoid) and tf.losses.mean_squared_error as a loss function.
My question is
is this architecture enough for detection process?? I know there is something call NMS for detection. Also what is the label in this case??
In general and this not a rule , it's just based on my experience, you should start with a smaller net 2 or 3 conv layer, and say what happens, if you get some good result focus more on the winning topology and adapt the hyperparameters ( learnrat, batchsize and so one ) , if you don't get good result at all go deep meaning add conv layer. and evaluate again. 12 conv is really huge , your problem complexity should be huge too ! otherwise you wil reach a good accuracy but waste a lot computer power and time for nothing ! and by the way use pyramid form meaning start wider and finish tiny