Sorting a list of arbitrary size using attention / transformers? - tensorflow

Seq2Seq neural network architectures can work with sequences of arbitrary size either via iteration, as in RNN, or parallelism, as in Transformers or other Attention (Query/Key/Value) mechanisms. It is relatively easy to create a model that can be trained to find the maximum of a list. For instance with LSTM this 77 parameters model does the trick well:
model = Sequential()
model.add(Dense(1, input_shape=(None,1), activation='relu'))
model.add(LSTM(2, return_sequences=True, activation='relu'))
model.add(LSTM(2, return_sequences=False, activation='relu'))
model.add(Dense(1, activation='gelu'))
and surely it is possible to do it with smaller RNN even. For Attention, a 93 parameters model also does the job:
number=tf.keras.Input(shape=(None,1))
tinput=tf.keras.layers.Dense(4)(number)
toutput=tf.keras.layers.MultiHeadAttention(num_heads=2,key_dim=2)(tinput,tinput,tinput)
reduction=tf.keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1))(toutput)
result=tf.keras.layers.Dense(1)(reduction)
model=tf.keras.Model(inputs=number,outputs=result)
Now, while LTSM obviously do not have mechanisms to see the entire series and then produce a exact quartile function, a median or a sorting of the whole list, the situation is different with Attention. One could in principle see the median of a dataset and, perhaps, even the production of the full ordered series?
How should it be done? Do I need a complete transformer, using the decoder to produce the series? Or could be just assign a "position" to each element, as output of an encoder?
A problem I find when experimenting with transformers here is that they seem to learn on one side to recognise the input sequence, on other side to produce a "translated" output sequence, so the output always differ from the input at some decimal level. It is noticeable when you scale the input sequence, say from
tiempo=np.random.uniform(1,10000,size=(rows,cols))
to
tiempo=np.random.uniform(1,100,size=(rows,cols))
as then it needs to relearn, while a pure decision based network would work with both inputs.

Related

ML/DL Prediction on whole input rather than row by row

I have tabular data from a sensor measuring various features. When the sensor is "off" it will report zero as values. I am training some machine learning models kNN, XGBoost, and NN for the purpose of classification. Here's the issue I am facing: I can train and predict on a row by row basis; however it would be better to classify a range as whole rather than a row by row basis. Another issue to this is that the range can vary in size. For a very basic example, please see this diagram illustrating the range.
I have a basic Keras model:
model = Sequential()
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
print(model.summary())
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
And the training data is shaped with 20 features and 4 classes. How would I:
1.) Format my training data
2.) Shape input data to classify as a "whole" rather than row by row
3.) While this has been talking about using keras. Can the same input shaping/training be applied to XGBoost or a kNN?
I assume that the blue line in that graph represents your targets. Here is a fundamental issue I see with something like predicting the range as a whole instead of sample by sample.
Assuming that there is some reasonable logic that could collapse the range of samples into one (taking mean per each feature, or concatenation, or whatever...), obviously you would first need to identify the range itself. This range identification step is however dependent on the knowledge of target (at least it seems like that based on the presented graph).
If the preprocessing step is dependent on the knowledge of the target, you would need to know the target for the test set as well before you could preprocess the data and make the predictions. In other words, you would need to know the outcome before you could make the prediction which would then be rather pointless.
You have stated that you are trying to perform classification but your target seems to be continuous. I don't know what your classes are or what patterns they are associated with but you would need to bin the target before you could start solving this as a classification problem. You would most likely lose a lot of information by doing this.
Therefore, I would start by solving it as a regression problem. Trying to predict that continuous target for each sample. Once you have that, you can apply some patter matching logic to identify the class for a given sample/range (for example, you could slice the sequence of targets/predictions from the previous step, associate each slice with the desired class and use this data as a new dataset for some classification algorithm).
As for the variable length inputs. Some deep learning architectures allow you to work with input of variable length, such as RNNs or adaptive pooling. You may try to do this one you know how to predict the continuous target as mentioned before. Non-deep-learning algorithms usually expect all samples to have the same shape so there is no general/automatic way of reusing the same input between them and deep learning algorithms that work with input of variable length.

Keras variable input

Im working through a Keras example at https://www.tensorflow.org/tutorials/text/text_generation
The model is built here:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]),
tf.keras.layers.GRU(rnn_units,
return_sequences=True,
stateful=True,
recurrent_initializer='glorot_uniform'),
tf.keras.layers.Dense(vocab_size)
])
return model
During training, they always pass in a length 100 array of ints.
But during prediction, they are able to pass in any length of input and the output is the same length as the input. I was always under the impression that the lengths of the time steps had to be the same. Is that not the case and the # of time steps of the RNN somehow can change?
RNNs are sequence models, ie. they take in a sequence of input and give out a sequence of outputs. The sequence length is also called the time steps is number of time the RNN cell is unwrapped and for each unwrapping an input is passed and RNN cell using its gates gives out an output (per each unwrapping). So in theory you can have as long sequence as you want. Now lets assume you have different inputs of different size, since you cannot have variable size inputs in a single batches you have to collect the inputs of same size an make a batch if you want to train using batches. You can as well use batch size of 1 and not worry about all this, but training become painfully slow.
In ptractical situations, while training we divide input into same sizes so that training become fast. There are situations like language translation models where this is not feasible.
So in theory RNNs does not have any limitation on the sequence length, however large sequence will start to loose the context at the begging as the sequence length increases.
While predictions you can use any sequence length you want to.
In you case your output size is same as input size because of return_sequences=True. You can as well have single output by using return_sequences=False where in only the output of last unwrapping is returned by keras.
Length of training sequences should not be equal to predicted length.
RNN deals with two vectors: new word and hidden state (accumulated from the previous words). It doesn't keep length of sequence.
But to get good prediction of long sequences - you have to train RNN with long sequences - because RNN should learn a long context.

How do you decide on the dimensions for a the activation layer in tensorflow

The tensorflow hub docs have this example code for text classification:
hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", output_shape=[50],
input_shape=[], dtype=tf.string)
model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
I don't understand how we decide if 16 is the right magic number for the relu layer. Can someone explain this please.
The choice of 16 units in the hidden layer is not a uniquely determined magic value. Like Shubham commented, it's all about experimenting and finding values that work well for your problem. Here is some folklore to guide your experimentation:
The usual range for the number of units in hidden layers is tens to thousands.
Powers of two may utilize specific hardware (like GPUs) more effectively.
Simple feed-forward networks like the one above often decrease the number of units between successive layers. A commonly cited intuition is to progress from many basic features to fewer, more abstract ones. (Hidden layers tend to produce dense representations like embeddings, not discrete features, but the reasoning applies analogously to the dimension of the feature space.)
The code snippet above does not show regularization. When trying whether more hidden units help, watch out for the gap between training and validation quality. A widening gap may indicate the need to regularize more.

BatchNormalization Implementation in Keras (TF backend) - Before or After Activation?

Consider the following code snippet
model = models.Sequential()
model.add(layers.Dense(256, activation='relu')) # Layer 1
model.add(BatchNormalization())
model.add(layers.Dense(128, activation='relu')) # Layer 2
I am using Keras with Tensorflow backend.
My question is - Is BN performed before or after activation function in Keras's implementation?
To add more clarity,
Whether BN SHOULD be applied before or after activation is subject to debate, the original (Ioffe and Szegedy 2015) paper suggests "BEFORE", but comments from the below thread show diverse opinions.
Ordering of batch normalization and dropout?
In Keras documentation (https://keras.io/layers/normalization/), it says
"Normalize the activations of the previous layer at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1."
Keras's doc seems to suggest that BN is applied AFTER activation (i.e. in the example code above, BN applied after 'relu' on layer 1). I would like to confirm if this is the case?
In addition, is it possible to configure whether BN is applied before or after activation function?
Thanks!
To add BatchNorm after or before activation is still an open debate. The original version suggested by the authors works well and have been used in many implementations. But many people have found that BN after activation really works well and helps in faster convergence. For example, check the discussion in this thread.
In short, it depends on the task! Which one is gonna perform better? You have to check that for yourself. And yes, you can control the order. For example:
x = Conv2D(64, (3,3), activation=None)(inputs)
x = BatchNormalization()(x)
x = Activation("relu")(x)
or
x = Conv2D(64, (3,3), activation="relu")(inputs)
x = BatchNormalization()(x)
In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:
It is natural to wonder whether we should apply batch normalization to
the input X, or to the transformed value XW+b. Ioffe and Szegedy (2015)
recommend the latter. More specifically, XW+b should be replaced by a
normalized version of XW. The bias term should be omitted because it
becomes redundant with the β parameter applied by the batch
normalization reparameterization. The input to a layer is usually the
output of a nonlinear activation function such as the rectified linear
function in a previous layer. The statistics of the input are thus
more non-Gaussian and less amenable to standardization by linear
operations.
In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.
Some report better results when placing batch normalization after activation, while others get better results with batch normalization before activation. It is probably best to test your model using both configurations, and if batch normalization after activation gives a significant decrease in validation loss, use that configuration instead.

Pattern recognition on sphere (HEALPY based)

I am using Tensorflow and Keras. Is there a possibility to achieve a proper pattern recognition for images on the surface of a sphere? I am using the (Healpy framework) to create my skymaps on which the pattern recognition should work. The problem is that these Healpy skymaps are one dimensional numpy arrays, thus, a compact sub-pattern may be distributed scattered over this 1d array. This is actually pretty hard to learn for a basic machine learning algorithm (i am thinking about a convolutional deep network).
A specific task in this context would be counting blobbs on the surface of a sphere (see attached image). For this particular task the correct number would be 8. So I created 10000 skymaps (Healpy settings: nside=16 correpsonding to npix=3072) each with a random number of blobbs between 0 and 9 (thus 10 possibilities). I tried to solve this with the 1d Healpy array and a simple Feed Forward network:
model = Sequential()
model.add(Dense(npix, input_dim=npix, init='uniform', activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(10, init='uniform', activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(skymaps, number_of_correct_sources, batch=100, epochs=10, validation_split=1.-train)
, however, after training with 10,000 skymaps the test set yielded an accuracy of only 38%. I guess that this will significantly increase when providing the real arrangement of the Healpy cells (as it appears on the sphere) instead of the 1d array only. In this case one may use a Convolutional network (Convolution2d) and operate as for the usual image recognition. Any ideas how to map the healpy cells properly in a 2d array or using a convolutional network directly on the sphere?
Thanks!
This is a hard way of tackling a relatively simple problem that is unashamedly 2-D!
If the objects you are looking for are as prominent as those in your figure, create the 2_d map for the data and then threshold it for a series of threshold levels: the highest thresholds pick out the brightest objects. Any continuous projection like Aitoff or Hammmer will do, and to eliminate the edge problems, use rotations of the projection. Segmented projections, like Healpix, are good for data storage, but not necessarily ideal for data analysis.
If the map has poor signal to noise so that you are looking for objects in the murk of the noise, then some sophistication is required, maybe even some neural net algorithm. However, you might take a look at the Planck data analysis on Sunyaev-Zeldovich galaxy clusters, the earliest of which is perhaps https://arxiv.org/abs/1101.2024 (Paper VIII). The subsequent papers refine and add to this.
(This should have been a comment but I lack the rep.)