H2O asfactor() for new data - data-manipulation

I am using hf[[x1, x2]] = hf[[x1, x2]].asfactor() to transform X1 and X2 to categorical variables and then train a classification model with automl(). Now for the new and unseen data, how should I convert the data? If I simply use the above method, is there any guarantee that it will be transformed similar to transformation in training phase?
In scklearn you should save the fitted object and use it for transforming train and new dataset but here I have no idea what to do?!

It is safe to convert them to categorical .asfactor(). It will treat the new transformed data the same as it will if it saw the levels earlier (it would be consistent).
If new levels are being transformed and predicted then it will be treated as unseen data and will follow the majority direction.

Related

How to feed normalized new data to saved trained neural network model and then inverse the result?

I am working on a research population by country based on this data set:
https://www.kaggle.com/tanuprabhu/population-by-country-2020
I learned that it's best practice to normalize the dataset before training, so I normalized the data using sklearn.preprocessing MinMaxScaler. I proceeded to train the model using the normalized dataset before saving the model.
Next, I wanted to perform predictions on new data. So I created an input file with a similar format to the training dataset. The new input data has only 2 rows (versus the training dataset which has 200 rows).
The problem that I encounter is, due to a small number of data in the new dataset, the minmaxscaler returned 1 and 0. 1 is for the bigger number, and 0 for the smaller number. When I feed this input into the model, it gave me a prediction that is too far off from the expected value.
I have also tried to apply mixmaxscaler to the new data, feed into the model, and then inverse the result. Still, I got a value that is too far from the expected value.
I have also tried to train the model without applying mixmaxscalar. I got a better result in this model, but the predicted result only respond very well when I changed certain columns with bigger values. The columns with smaller values don't have a very good response, while in real world I know that this factor is quite significant to the predicted result.
Where do I went wrong?
Any sample code on handling the input for the trained model is much appreciated.
To test what is going on I suggest that you take a row of your training data prior to scaling it. Apply the scalar and then use the result as the data for a prediction. You should get the same predicted result as the train data result value. When you apply the scalar look to see if it generates the same values as present in the training data for that row. Make sure you are using the scalar that was fit to the training set. Do not fit the scalar to the new data, just use it to transform the data.

How is keras predict working with datasets

I am new in using tf datasets with keras. Since you just handover one object, I don't understand what actually happens. If I handover a dataset to model predict, how does it know how and what elements to use from this object? Since a dataset of complex structure which inherits many kind of structures and levels I think, what happens if I take a dataset which as more "columns" than the dataset which was trained on. Are somehow the structure, names or levels saved during training from the dataset to remember when making predictions?
if tf.keras.Model.fit() receives tf.dataset() as input - it assumes that the dataset returns a tuple of either (inputs, targets) or (inputs, targets, sample_weights). Now, inputs part itself may be a complex structure of sub-inputs (like tuple of image and label for conditional VAE for instance).
If the dataset does not fit your model inputs - fit() will just fail.
See comment to fit() function in the TF source code

Best way to evaluate performance with tf.data.Dataset

I trained a model and now want to evaluate its performance on a test set. The test set is loaded as tf.data.TFRecordDataset object (from multiple TFRecords with multiple examples in each of them) which consists of ~million examples in the form of tuples (image, label), the data are batched. The raw labels are then mapped to the target integers (one-hot encoded) that the model needs to predict.
I understand that I can pass the Dataset object as an input to model.predict() which will output predictions for each example in the dataset. However, to compute some metric I need to compare true target values to the predicted ones, and to obtain the former ones I need to iterate through the Dataset, cause all true labels are stored in there.
This seems like a common task but I couldn't find a straightforward solution that works for huge dataset in TFRecord format. What would be the best way to compute, for instance, AUC per class in this case? Should I use Callbacks with model.predict(test_dataset)? Or should I process each example one by one in a loop, save true and predicted values into arrays and then use, for example, sklearn.metrics.roc_auc_score() to compute AUC scores for the two arrays? Or maybe I'm missing some obvious way to do it?
Thanks in advance!
If you need all labels, why not just:
model.evaluate(test_dataset.take(-1))
or if your ds is too large for this action, just iterate over your dataset, calculate your metric and the mean at the end.

Subsection of grid as input to cnn

I have two huge grids (input and output) representing some spatial data of the same area. I want to be able to generate the output pixel-by-pixel by feeding a neural network a small part of the input grid, around the pixel of interest.
The naive way of training and evaluating on the CNN would be to extract sections separately, and giving those to the fit() function. But if the sub-grid the CNN operates on is e.g. a 256×256 area of the input, then I would copy each data point 65536 (!!!) times per epoch.
So is there any way to have karas just use subsections of a bigger data structure as training?
To me, this sounds a bit like training RNN's on sequencial sections of a data series, instead of copying each section separately.
The performance consideration is mainly in the case of evaluating the model. I want to use this model to generate output grid of a huge geographical area (denmark) with a resolution of 12,5 cm
It seems to me that you are looking for a fully convolutional network (FCN).
By using only layers that scale in size with their inputs (banishing the use of dense layers specifically), an FCN is able to produce an output with a spatial range that grows proportionally with that of the input — typically, the ouput has the same resolution as the input, as in your case.
If your inputs are very large, you can still train an FCN on subimages. Then for inference, you can
run the network on your entire image: indeed, sometimes the inputs are too big to be batched together during training, but can be feed alone for inference.
or split your input into subimages and tile the results back. In that case, I would probably use overlapping tiles to avoid potential border effects.
You can probably go well with a Sequence generator.
You will still have to create slices for each batch, but taking slices isn't slow at all compared with the CNN operations.
And by using a keras.utils.Sequence, the generation of the batches is parallel with the model's execution, so no penalty:
class GridGenerator(keras.utils.Sequence):
def __init__(self, originalGrid_maybeFileName, outputGrid, subGridSize):
self.originalGrid = originalGrid_maybeFileName
self.outputGrid = outputGrid
self.subgridSize = subgridSize
def __len__(self):
#naive implementation, if grids are squares and the sizes are multiples of each other
self.divs = self.originalGrid.shape[:,:,1] // self.subgridSize
return self.divs * self.divs
def __getitem__(self,i):
row, column = divmod(i, self.divs)
#using channels_last
x= self.originalGrid[:,row:row+self.subgridSize, column:column+self.subgridSize]
y= self.outputGrid[:,row:row+self.subgridSize, column:column+self.subgridSize]
return x,y
If the full grid doesn't fit your PC's memory, then you should find ways of loading parts of the grid at a time. (Use the generator to load these parts)
Create the generator and train with fit_generator:
generator = GridGenerator(xGrid, yGrid, subSize)
#you can create additional generators to take a part of that as training and another part as validation
model.fit_generator(generator, len(generator), ...., workers = 4)
The workers argument determines how many batches will be loaded in parallel before sent to the model.

Vector representation in multidimentional time-series prediction in Tensorflow

I have a large data set (~30 million data-points with 5 features) that I have reduced using K-means down to 200,000 clusters. The data is a time-series with ~150,000 time-steps. The data on which I would like to train the model is the presence of particular clusters at each time-step. The purpose of the predictive model is generate a generalized sequence similar to generating syntactically correct sentences from a model trained on word sequences. The easiest way to think about this data is that I'm trying to predict the pixels in the next video frame from pixels in the current video frame in order to generate a new sequence of frames that approximate the original sequence.
The raw and sparse representation at each time-step would be 200,000 binary values representing which clusters are present or not at that time step. Note, no more than 200 clusters may be present in any one time-step and thus this representation is extremely sparse.
What is the best representation to convert this sparse vector to a dense vector that would be more suitable to time-series prediction using Tensorflow?
I initially had in mind a RNN / LSTM trained on the vectors at each time-step, but due to the size of the training vector I'm now wondering if a convolution approach would be more suitable.
Note, I have not actually used tensorflow beyond some simple tutorials, but have have previously used OpenCV ML functions. Please consider me a novice in your responses.
Thank you.