multi-label supervised classification of text data - pandas

I am solving machine learning problem using python. My knowledge in machine learning is not much. The problem has given training dataset. Training dataset includes text samples and labels for those text samples. All possible values of labels are given. So this is supervised problem. Some text samples don't have empty set of labels. Now I have to make a model to find labels from given text data.
What I have done is, I have created pandas dataframe from training data. Dataframe has columns as [text_data, label1, label2, label3, ..., labeln]. The values of labels columns are either 0 or 1. Then I cleaned and tokenized text_data. I removed stop words from tokens. I stemmed tokens by using PorterStemmer. I split out dataframe into training data and validation data like 80:20. And now trying to make some model by predicting validation data's labels by using training data. But I am very much confused here about how to make model. I tried few things like Naive Bayes Classifier but it didn't work or maybe I did some mistake. Any idea how I should proceed now?

Related

How to feed normalized new data to saved trained neural network model and then inverse the result?

I am working on a research population by country based on this data set:
https://www.kaggle.com/tanuprabhu/population-by-country-2020
I learned that it's best practice to normalize the dataset before training, so I normalized the data using sklearn.preprocessing MinMaxScaler. I proceeded to train the model using the normalized dataset before saving the model.
Next, I wanted to perform predictions on new data. So I created an input file with a similar format to the training dataset. The new input data has only 2 rows (versus the training dataset which has 200 rows).
The problem that I encounter is, due to a small number of data in the new dataset, the minmaxscaler returned 1 and 0. 1 is for the bigger number, and 0 for the smaller number. When I feed this input into the model, it gave me a prediction that is too far off from the expected value.
I have also tried to apply mixmaxscaler to the new data, feed into the model, and then inverse the result. Still, I got a value that is too far from the expected value.
I have also tried to train the model without applying mixmaxscalar. I got a better result in this model, but the predicted result only respond very well when I changed certain columns with bigger values. The columns with smaller values don't have a very good response, while in real world I know that this factor is quite significant to the predicted result.
Where do I went wrong?
Any sample code on handling the input for the trained model is much appreciated.
To test what is going on I suggest that you take a row of your training data prior to scaling it. Apply the scalar and then use the result as the data for a prediction. You should get the same predicted result as the train data result value. When you apply the scalar look to see if it generates the same values as present in the training data for that row. Make sure you are using the scalar that was fit to the training set. Do not fit the scalar to the new data, just use it to transform the data.

How to correctly split MNIST dataset into training and validation set?

I have a really simple code that takes the training data from MNIST and then chooses the last 10,000 examples as validation set, then deletes the last 10,000 examples from the training set.
import tensorflow as tf
import tensorflow_datasets as tfds
(X_train, Y_train)= tf.keras.datasets.mnist.load_data()
X_valid = X_train[-10000:]
Y_valid = Y_train[-10000:]
X_train = X_train[0:40000]
Y_train = Y_train[0:40000]
However, this is very dumb in my opinion and I would like to make the data splitting procedure more sophisticated in the following ways:
I should specify which percentage of the data I want as validation set, instead of just taking the last whatever samples
I need a way to make sure that the data is balanced after I partition it into training and validation. Grabbing a portion could cause a the training examples associated with some digits to be very few.
Surprisingly I went through almost every Tensorflow tutorial and none of them does any validation (except for https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch, which uses the same dumb data splitting methodology as above). Most examples just directly splits the data into train and test which we almost never do in real life.
Could someone please advise?
The keras.datasets.mnist dataset loads the dataset by Yann LeCun (Refer Doc)
The dataset is setup in such a way that it contains 60,000 training data and 10,000 testing data.
Since the load_data() just returns Numpy arrays, you can easily concatenate the train and test arrays into a single array, after which you can play with the new array as you like.
If you want a validation set out of the training set, then you can shuffle the training set first and then extract the validation set.
All these operations will be simple Numpy Array operations and wouldn't even require any Tensorflow functionality.

Best way to evaluate performance with tf.data.Dataset

I trained a model and now want to evaluate its performance on a test set. The test set is loaded as tf.data.TFRecordDataset object (from multiple TFRecords with multiple examples in each of them) which consists of ~million examples in the form of tuples (image, label), the data are batched. The raw labels are then mapped to the target integers (one-hot encoded) that the model needs to predict.
I understand that I can pass the Dataset object as an input to model.predict() which will output predictions for each example in the dataset. However, to compute some metric I need to compare true target values to the predicted ones, and to obtain the former ones I need to iterate through the Dataset, cause all true labels are stored in there.
This seems like a common task but I couldn't find a straightforward solution that works for huge dataset in TFRecord format. What would be the best way to compute, for instance, AUC per class in this case? Should I use Callbacks with model.predict(test_dataset)? Or should I process each example one by one in a loop, save true and predicted values into arrays and then use, for example, sklearn.metrics.roc_auc_score() to compute AUC scores for the two arrays? Or maybe I'm missing some obvious way to do it?
Thanks in advance!
If you need all labels, why not just:
model.evaluate(test_dataset.take(-1))
or if your ds is too large for this action, just iterate over your dataset, calculate your metric and the mean at the end.

Binary classification of every time series step based on past and future values

I'm currently facing a Machine Learning problem and I've reached a point where I need some help to proceed.
I have various time series of positional (x, y, z) data tracked by sensors. I've developed some more features. For example, I rasterized the whole 3D space and calculated a cell_x, cell_y and cell_z for every time step. The time series itself have variable lengths.
My goal is to build a model which classifies every time step with the labels 0 or 1 (binary classification based on past and future values). Therefore I have a lot of training time series where the labels are already set.
One thing which could be very problematic is that there are very few 1's labels in the data (for example only 3 of 800 samples are labeled with 1).
It would be great if someone can help me in the right direction because there are too many possible problems:
Wrong hyperparameters
Incorrect model
Too few 1's labels, but I think that's not a big problem because I only need the model to suggests the right time steps. So I would only use the peaks of the output.
Bad or too less training data
Bad features
I appreciate any help and tips.
Your model seems very strange. Why only use 2 units in lstm layer? Also your problem is a binary classification. In this case you should choose only one neuron in your output layer (try to insert one additional dense layer between and lstm layer and try dropout layers between them).
Binary crossentropy does not make much sense with 2 output neurons, if you don't have a multi label problem. But if you're switching to one output neuron it's the right one. You also need sigmoid then as activation function.
As last advice: Try class weights.
http://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html
This can make a huge difference, if you're label are unbalanced.
You can create the model using tensorflow BasicLSTMCell, the shape of your data fits for BasicLSTMCell in TensorFlow you can find Documentation for BasicLSTMCell here and for creating the model this Documentation contain code that will help to build BasicLstmCell model . Hope this will help you, Cheers.

TensorFlow: Convolution Neural Network with non-image input

I am interested in using Tensorflow for training my data for binary classification based on CNN.
Now I wonder about how to set the filter value, number of output nodes in the convolution process.
I have read many tutorials and example. However, most of them use image data and I cannot compare it with my data that is customer data, not pixel.
So could you suggest me about this issue?
If you data varies in time or space then you can use CNN,I am currently working with EEG data set which varies in time.Also you can refer to this paper
http://www.nlpr.ia.ac.cn/english/irds/People/lwang/M-MCG_EN/Publications/2015/YD2015ACPR.pdf
were the input data(Which is not an image) is presented as an image to the CNN.
You have to reshape the data to be 4d. In this example, I have only 4 column.
x_train = np.reshape(x_train, (x_train.shape[0],2, 2,1))
x_test = np.reshape(x_test, (x_test.shape[0],2,2, 1))
This is a good example to use none image data
https://github.com/fengjiqiang/LSTM-Wind-Speed-Forecasting
You just need to change the following :
prediction_cols
feature_cols
features
and dataload
This tutorial for text :
Here !
You might use one of following classes:
class Dataset: Represents a potentially large set of elements.
class FixedLengthRecordDataset: A Dataset of fixed-length records
from one or more binary files.
class Iterator: Represents the state of iterating through a Dataset.
class TFRecordDataset: A Dataset comprising records from one or more
TFRecord files.
class TextLineDataset: A Dataset comprising lines from one or more
text files.
Tutorial
official documentation