ML/DL Prediction on whole input rather than row by row - tensorflow

I have tabular data from a sensor measuring various features. When the sensor is "off" it will report zero as values. I am training some machine learning models kNN, XGBoost, and NN for the purpose of classification. Here's the issue I am facing: I can train and predict on a row by row basis; however it would be better to classify a range as whole rather than a row by row basis. Another issue to this is that the range can vary in size. For a very basic example, please see this diagram illustrating the range.
I have a basic Keras model:
model = Sequential()
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
optimizer='adam', metrics=['accuracy'])
And the training data is shaped with 20 features and 4 classes. How would I:
1.) Format my training data
2.) Shape input data to classify as a "whole" rather than row by row
3.) While this has been talking about using keras. Can the same input shaping/training be applied to XGBoost or a kNN?

I assume that the blue line in that graph represents your targets. Here is a fundamental issue I see with something like predicting the range as a whole instead of sample by sample.
Assuming that there is some reasonable logic that could collapse the range of samples into one (taking mean per each feature, or concatenation, or whatever...), obviously you would first need to identify the range itself. This range identification step is however dependent on the knowledge of target (at least it seems like that based on the presented graph).
If the preprocessing step is dependent on the knowledge of the target, you would need to know the target for the test set as well before you could preprocess the data and make the predictions. In other words, you would need to know the outcome before you could make the prediction which would then be rather pointless.
You have stated that you are trying to perform classification but your target seems to be continuous. I don't know what your classes are or what patterns they are associated with but you would need to bin the target before you could start solving this as a classification problem. You would most likely lose a lot of information by doing this.
Therefore, I would start by solving it as a regression problem. Trying to predict that continuous target for each sample. Once you have that, you can apply some patter matching logic to identify the class for a given sample/range (for example, you could slice the sequence of targets/predictions from the previous step, associate each slice with the desired class and use this data as a new dataset for some classification algorithm).
As for the variable length inputs. Some deep learning architectures allow you to work with input of variable length, such as RNNs or adaptive pooling. You may try to do this one you know how to predict the continuous target as mentioned before. Non-deep-learning algorithms usually expect all samples to have the same shape so there is no general/automatic way of reusing the same input between them and deep learning algorithms that work with input of variable length.


Sorting a list of arbitrary size using attention / transformers?

Seq2Seq neural network architectures can work with sequences of arbitrary size either via iteration, as in RNN, or parallelism, as in Transformers or other Attention (Query/Key/Value) mechanisms. It is relatively easy to create a model that can be trained to find the maximum of a list. For instance with LSTM this 77 parameters model does the trick well:
model = Sequential()
model.add(Dense(1, input_shape=(None,1), activation='relu'))
model.add(LSTM(2, return_sequences=True, activation='relu'))
model.add(LSTM(2, return_sequences=False, activation='relu'))
model.add(Dense(1, activation='gelu'))
and surely it is possible to do it with smaller RNN even. For Attention, a 93 parameters model also does the job:
reduction=tf.keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1))(toutput)
Now, while LTSM obviously do not have mechanisms to see the entire series and then produce a exact quartile function, a median or a sorting of the whole list, the situation is different with Attention. One could in principle see the median of a dataset and, perhaps, even the production of the full ordered series?
How should it be done? Do I need a complete transformer, using the decoder to produce the series? Or could be just assign a "position" to each element, as output of an encoder?
A problem I find when experimenting with transformers here is that they seem to learn on one side to recognise the input sequence, on other side to produce a "translated" output sequence, so the output always differ from the input at some decimal level. It is noticeable when you scale the input sequence, say from
as then it needs to relearn, while a pure decision based network would work with both inputs.

Multiple BERT binary classifications on a single graph to save on inference time

I have five classes and I want to compare four of them against one and the same class. This isn't a One vs Rest classifier, as for each output I want to score them against one base class.
The four outputs should be: base class vs classA, base class vs classB, etc.
I could do this by having multiple binary classification tasks, but that's wasting computation time if the first layers are BERT preprocessing + pretrained BERT layers, and the only differences between the four classifiers are the last few layers of BERT (finetuned ones) and the Dense layer.
So why not merge the graphs for more performance?
My inputs are four different datasets, each annotated with true/false for each class.
As I understand it, I can re-use most of the pipeline (BERT preprocessing and the first layers of BERT), as those have shared weights. I should then be able to train the last few layers of BERT and the Dense layer on top differently depending on the branch of the classifier (maybe using something like keras.switch?).
I have tried many alternative options including multi-class and multi-label classifiers, with actual and generated (eg, machine-annotated) labels in the case of multiple input labels, different activation and loss functions, but none of the results were acceptable to me (none were as good as the four separate models).
Is there a solution for merging the four different models for more performance, or am I stuck with using 4x binary classifiers?
When you train DNN for specific task it will be (in vast majority of cases) be better than the more general model that can handle several task simultaneously. Saying that, based on my experience the properly trained general model produces very similar results to the original binary ones. Anyways, here couple of suggestions for training strategies (assuming your training datasets for each task are completely different):
Weak supervision approach
Train your binary classifiers, and label your datasets using them (i.e. label with binary classifier trained on dataset 2 datasets [1,3,4]). Then train your joint model as multilabel task using all the newly labeled datasets (don't forget to randomize samples before feeding them to trainer ;) ). Here you will need to experiment if you will use threshold and set a label to 0/1 or use the scores of the binary classifiers.
Create custom loss function that will not penalize if no information provided for certain class. So when your will introduce sample from (say) dataset 2, your loss will be calculated only for the 2nd class.
Of course you can apply both simultaneously. For example, if you know that binary classifier produces scores that are polarized (most results are near 0 or 1), you can use weak labels, and automatically label your data with scores. Now during the second stage penalize loss such that for score x' = 4(x-0.5)^2 (note that you get logits from the model, so you will need to apply sigmoid function). This way you will increase contribution of the samples binary classifier is confident about, and reduce that of less certain ones.
As for releasing last layers of BERT, usually unfreezing upper 3-6 layers is enough. Releasing more layers improves results very little and increases time and memory requirements.

How do you decide on the dimensions for a the activation layer in tensorflow

The tensorflow hub docs have this example code for text classification:
hub_layer = hub.KerasLayer("", output_shape=[50],
input_shape=[], dtype=tf.string)
model = keras.Sequential()
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
I don't understand how we decide if 16 is the right magic number for the relu layer. Can someone explain this please.
The choice of 16 units in the hidden layer is not a uniquely determined magic value. Like Shubham commented, it's all about experimenting and finding values that work well for your problem. Here is some folklore to guide your experimentation:
The usual range for the number of units in hidden layers is tens to thousands.
Powers of two may utilize specific hardware (like GPUs) more effectively.
Simple feed-forward networks like the one above often decrease the number of units between successive layers. A commonly cited intuition is to progress from many basic features to fewer, more abstract ones. (Hidden layers tend to produce dense representations like embeddings, not discrete features, but the reasoning applies analogously to the dimension of the feature space.)
The code snippet above does not show regularization. When trying whether more hidden units help, watch out for the gap between training and validation quality. A widening gap may indicate the need to regularize more.

Binary classification of every time series step based on past and future values

I'm currently facing a Machine Learning problem and I've reached a point where I need some help to proceed.
I have various time series of positional (x, y, z) data tracked by sensors. I've developed some more features. For example, I rasterized the whole 3D space and calculated a cell_x, cell_y and cell_z for every time step. The time series itself have variable lengths.
My goal is to build a model which classifies every time step with the labels 0 or 1 (binary classification based on past and future values). Therefore I have a lot of training time series where the labels are already set.
One thing which could be very problematic is that there are very few 1's labels in the data (for example only 3 of 800 samples are labeled with 1).
It would be great if someone can help me in the right direction because there are too many possible problems:
Wrong hyperparameters
Incorrect model
Too few 1's labels, but I think that's not a big problem because I only need the model to suggests the right time steps. So I would only use the peaks of the output.
Bad or too less training data
Bad features
I appreciate any help and tips.
Your model seems very strange. Why only use 2 units in lstm layer? Also your problem is a binary classification. In this case you should choose only one neuron in your output layer (try to insert one additional dense layer between and lstm layer and try dropout layers between them).
Binary crossentropy does not make much sense with 2 output neurons, if you don't have a multi label problem. But if you're switching to one output neuron it's the right one. You also need sigmoid then as activation function.
As last advice: Try class weights.
This can make a huge difference, if you're label are unbalanced.
You can create the model using tensorflow BasicLSTMCell, the shape of your data fits for BasicLSTMCell in TensorFlow you can find Documentation for BasicLSTMCell here and for creating the model this Documentation contain code that will help to build BasicLstmCell model . Hope this will help you, Cheers.

Pattern recognition on sphere (HEALPY based)

I am using Tensorflow and Keras. Is there a possibility to achieve a proper pattern recognition for images on the surface of a sphere? I am using the (Healpy framework) to create my skymaps on which the pattern recognition should work. The problem is that these Healpy skymaps are one dimensional numpy arrays, thus, a compact sub-pattern may be distributed scattered over this 1d array. This is actually pretty hard to learn for a basic machine learning algorithm (i am thinking about a convolutional deep network).
A specific task in this context would be counting blobbs on the surface of a sphere (see attached image). For this particular task the correct number would be 8. So I created 10000 skymaps (Healpy settings: nside=16 correpsonding to npix=3072) each with a random number of blobbs between 0 and 9 (thus 10 possibilities). I tried to solve this with the 1d Healpy array and a simple Feed Forward network:
model = Sequential()
model.add(Dense(npix, input_dim=npix, init='uniform', activation='relu'))
model.add(Dense(10, init='uniform', activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']), number_of_correct_sources, batch=100, epochs=10, validation_split=1.-train)
, however, after training with 10,000 skymaps the test set yielded an accuracy of only 38%. I guess that this will significantly increase when providing the real arrangement of the Healpy cells (as it appears on the sphere) instead of the 1d array only. In this case one may use a Convolutional network (Convolution2d) and operate as for the usual image recognition. Any ideas how to map the healpy cells properly in a 2d array or using a convolutional network directly on the sphere?
This is a hard way of tackling a relatively simple problem that is unashamedly 2-D!
If the objects you are looking for are as prominent as those in your figure, create the 2_d map for the data and then threshold it for a series of threshold levels: the highest thresholds pick out the brightest objects. Any continuous projection like Aitoff or Hammmer will do, and to eliminate the edge problems, use rotations of the projection. Segmented projections, like Healpix, are good for data storage, but not necessarily ideal for data analysis.
If the map has poor signal to noise so that you are looking for objects in the murk of the noise, then some sophistication is required, maybe even some neural net algorithm. However, you might take a look at the Planck data analysis on Sunyaev-Zeldovich galaxy clusters, the earliest of which is perhaps (Paper VIII). The subsequent papers refine and add to this.
(This should have been a comment but I lack the rep.)