If in tensor flow code number of input sample is 5000000. Does it mean that it training all these samples for training? How can i know the number of samples used for training and test purpose separately?
You will have to choose which ammount of samples are used for training and which for testing. A general approach would be to set a random 70% of samples to train and the remaining 30% to test. This can be done fairly simply as such:
Lets assume you have a dataframe of 5000000 samples named df. The sample() function from pandas will allow you to select a specified percentage of random samples which can be set aside for training. The remaining 30% will be indexed and used for testing.
import pandas as pd
train_set = df.sample(frac=0.7)
test_set = df.loc[~data_.index.isin(train_set.index)]
Now you have two dataframes, one for training (3500000 samples) and one for testing (1500000 samples)
Related
I have two datasets, one with clean data and one with dirty data. I train a Roberta model on the clean dataset and then get predictions for the dirty dataset. Those predictions with a probability greater than 0.9 go to the clean dataset. I then retrain the Roberta model with this new dataset (clean + dirty moving to clean).
For the retraining I am using the MAE loss function (more robust to noisy labels) and I use weights to give less value to the data that passes from the dirty to the clean dataset, as follows:
loss = torch.mean(torch.abs(y_true - y_pred) * weights)
Initially I am using an arbitrary weight of 0.5 for all the dirty data that gets passed into the clean dataset. However, I would like to assign them a weight in a more academic way, not so arbitrary.
How can I do that?
One way to choose the weight is based on your confidence in the dirty data and assign the weight accordingly. For example, if you think that 90% of dirty data is labeled correctly, then choosing 0.9 as the weight for the noisy data is a reasonable option.
Additionally, there is a whole literature on learning from noisy labels, you can check this survey for more information: https://arxiv.org/abs/2007.08199
Out of curiosity, why not use cleanlab to find the label errors and other data issues in your dataset for you directly? https://github.com/cleanlab/cleanlab
It handles most data issues for ML in a few lines of code, some examples:
Find label issues in 1 line of code
from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues
# Option 1 - works with sklearn-compatible models - just input the data and labels ツ
label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels)
# Option 2 - works with ANY ML model - just input the model's predicted probabilities
ordered_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs, # out-of-sample predicted probabilities from any model
return_indices_ranked_by='self_confidence',
)
Train a model as if the dataset did not have errors -- 3 lines of code
from sklearn.linear_model import LogisticRegression
from cleanlab.classification import CleanLearning
cl = CleanLearning(clf=LogisticRegression()) # any sklearn-compatible classifier
cl.fit(train_data, labels)
# Estimate the predictions you would have gotten if you trained without mislabeled data.
predictions = cl.predict(test_data)
Journal of AI Research (with theory to prove it works): https://arxiv.org/abs/1911.00068publication
errors found using cleanlab: https://labelerrors.com/
Documentation and runnable tutorials for cleanlab: https://docs.cleanlab.ai/
I am training a Gaussian-Process model iteratively. In each iteration, a new sample is added to the training dataset (Pandas DataFrame), and the model is re-trained and evaluated. Each row of the dataset comprises 5 independent variables + the dependent variable. The training ends after 150 iterations (150 samples), but I want to extend this behaviour so the training can automatically stop after a number of iterations for which no meaningful information is added to the model.
My first approach is to compare the distribution of the last 10 samples to the previous 10. If the distributions are very similar, I assume that not meaningful knowledge has been added in the last 10 iterations, so I abort the training.
I thought of using Kullback-Leibler divergence, but I am not sure if this can be used for multivariate distributions. Should I use it? If so, how?
Additionally, is there any other better/smarter way to proceed?
Thanks
I have a really simple code that takes the training data from MNIST and then chooses the last 10,000 examples as validation set, then deletes the last 10,000 examples from the training set.
import tensorflow as tf
import tensorflow_datasets as tfds
(X_train, Y_train)= tf.keras.datasets.mnist.load_data()
X_valid = X_train[-10000:]
Y_valid = Y_train[-10000:]
X_train = X_train[0:40000]
Y_train = Y_train[0:40000]
However, this is very dumb in my opinion and I would like to make the data splitting procedure more sophisticated in the following ways:
I should specify which percentage of the data I want as validation set, instead of just taking the last whatever samples
I need a way to make sure that the data is balanced after I partition it into training and validation. Grabbing a portion could cause a the training examples associated with some digits to be very few.
Surprisingly I went through almost every Tensorflow tutorial and none of them does any validation (except for https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch, which uses the same dumb data splitting methodology as above). Most examples just directly splits the data into train and test which we almost never do in real life.
Could someone please advise?
The keras.datasets.mnist dataset loads the dataset by Yann LeCun (Refer Doc)
The dataset is setup in such a way that it contains 60,000 training data and 10,000 testing data.
Since the load_data() just returns Numpy arrays, you can easily concatenate the train and test arrays into a single array, after which you can play with the new array as you like.
If you want a validation set out of the training set, then you can shuffle the training set first and then extract the validation set.
All these operations will be simple Numpy Array operations and wouldn't even require any Tensorflow functionality.
Hi I don't understand the keras fit_generator docs.
I hope my confusion is rational.
There is a batch_size and also the concept of training in in batches. Using model_fit(), I specify a batch_size of 128.
To me this means that my dataset will be fed in 128 samples at a time, thereby greatly alleviating memory. It should allow a 100 million sample dataset to be trained as long as I have got the time to wait. After all, keras is only "working with" 128 samples at a time. Right?
But I highly suspect that for specifying the batch_size alone doesn't do what I want whatsoever. Tons of memory is still being used. For my goals I need to train in batches of 128 examples each.
So I am guessing this is what fit_generator does. I really want to ask why doesn't batch_size actually work as it's name suggests?
More importantly, if fit_generator is needed, where do I specify the batch_size? The docs say to loop indefinitely.
A generator loops over every row once. How do I loop over 128 samples at a time and remember where I last stopped and recall it the next time that keras asks for the next batch's starting row number (would be row 129 after first batch is done).
You will need to handle the batch size somehow inside the generator. Here is an example to generate random batches:
import numpy as np
data = np.arange(100)
data_lab = data%2
wholeData = np.array([data, data_lab])
wholeData = wholeData.T
def data_generator(all_data, batch_size = 20):
while True:
idx = np.random.randint(len(all_data), size=batch_size)
# Assuming the last column contains labels
batch_x = all_data[idx, :-1]
batch_y = all_data[idx, -1]
# Return a tuple of (Xs,Ys) to feed the model
yield(batch_x, batch_y)
print([x for x in data_generator(wholeData)])
First, keras batch_size does work very well. If you are working on GPU, you should know that the model can be very heavy with keras, especially if you are using recurrent cells. If you are working on CPU, the whole program is loaded in memory, the batch size won't have much of an impact on the memory. If you are using fit(), the whole dataset is probably loaded in memory, keras produces batches at every step. It's very difficult to predict the amount of memory that will be used.
As for the fit_generator() method, you should build a python generator function (using yield instead of return), yielding one batch at every step. The yield should be in an infinite loop (we often use while true: ...).
Do you have some code to illustrate your problem?
I am building a simple linear regressor for the data from the csv. Data includes weight and height values of some people. Overall learning process is very simple:
MAX_STEPS = 2000
# ...
features = [tf.contrib.layers.real_valued_column(feature_name) for feature_name in FEATURES_COL]
# ...
linear_regressor = tf.contrib.learn.LinearRegressor(feature_columns=features)
linear_regressor.fit(input_fn=prepare_input, max_steps=MAX_STEPS)
However, the model that is built by the regressor is, unexpectedly, bad. Result could be illustrated with the next picture:
Visualization code(just in case):
plt.plot(height_and_weight_df_filtered[WEIGHT_COL],
linear_regressor.predict(input_fn=prepare_full_input),
color='blue',
linewidth=3)
Here is the same data been given to the LinearRegression class from the scikit-learn:
lr_updated = linear_model.LinearRegression()
lr_updated.fit(weight_filtered_reshaped, height_filtered)
And the visualization:
Increasing amount of steps has no effect. I would assume I'm using regressor from the TensorFlow in a wrong way.
iPython notebook with the code.
It looks like your TF model does indeed work and will get there with enough steps. You need to jack it right up though - 200K showed significant improvement, almost as good as the sklearn default.
I think there are two issues:
sklearn looks like it simply solves the equation using ordinary least squares. TF's LinearRegressor uses the FtrlOptimizer. The paper indicates it is a better choice for very large datasets.
The input_fn to the model is injecting the whole training set at once, for every step. This is just a hunch, but I suspect that the FtrlOptimizer may do better if it sees batches at a time.
Instead of just changing the number of steps up a couple orders of magnitude, you can also jack the learning rate up on the optimizer (the default is 0.2) and get similarly good results from only 4k steps:
linear_regressor = tf.contrib.learn.LinearRegressor(
feature_columns=features,
optimizer=tf.train.FtrlOptimizer(learning_rate=5.0))
I met a similar problem. The solution is to check if your input_fn has enough epoch. The training maybe not converge before iterating over the whole training data several times.