How does Stratified Sampling Work in Weka - data-science

I have a dataset. I want to split the dataset using Stratified Sampling. I would like 70% of data in training set and 30% in test set. So I split the dataset 10 equal subset using StratifiedRemoveFold filter in weka. Then I append 7 datasets to make 70% training dataset and append rest of 3 datasets to make 30% training dataset. However, this is not a good option. I found that, for the 1st attribute of test test one value was missing. Like, my 1st attribute has 7 values. But there were only 6 values for 1st attribute in the test set. As a result when I run the classifier on training set there was error Training set and Test set are incompatible.
I went through the link Stratified Sampling in WEKA. I found if I want to generate a 5% subsample, set the folds to 20. If this is the strategy, then for 30% test set do I need to set the numberofFold of StratifiedRemoveFold filter = 120? And also what about the test set? What should I set as numberofFolds in test set where test set is 70% of whole dataset?

You could try using the supervised Resample (weka.filters.supervised.instance.Resample) filter instead, with no replacement and a bias factor of 0 (to use the distribution of the input data). When using the invert flag, you get the remainder of the dataset.

If you really want to use StratifiedRemoveFolds, then use 10 folds, apply the filter 10 times to get all the 10 folds out and then combine 7 to make your 70% and the remainder to get your 30%.

Related

Stratified sampling into 3 sets considering unbalance

I have looked into Stratified sample in pandas, stratified sampling on ranges, among others and they don't assess my issue specifically, as I'm looking to split the data into 3 sets randomly.
I have an unbalanced dataframe of 10k rows, 10% is positive class, 90% negative class. I'm trying to figure out a way to split this dataframe into 3 datasets, as 60%, 20%, 20% of the dataframe considering the unbalance. However, this split has to be random and non-replaceable, which means if I put together the 3 datasets, it has to be equal to the original dataframe.
Usually I would use train_test_split() but it only works if you are looking to split into two, not three datasets.
Any suggestions?
Reproducible example:
df = pd.DataFrame({"target" : np.random.choice([0,0,0,0,0,0,0,0,0,1], size=10000)}, index=range(0,10000,1))
How about using train_test_split() twice?
1st time, using train_size=0.6, obtaining a 60% training set and 40% (test + valid) set.
2nd time, using train_size=0.5, obtaining a 50%*40%=20% validation and 20% test.
Is this workaround valid for you?

test and train good practice wrt summary feature

When one feature of a dataset is a summary statistic of the entire pool of data, is it good practice to include the train data in your test data in order to calculate the feature for validation?
For instance, let's say I have 1000 data points split into 800 entries of training and 200 entries for validation. I create a feature with the 800 entries for training of say rank quartile (or could be anything), which numbers 0-3 the quartile some other feature falls in. So in the training set, there will be 200 data points in each quartile.
Once you train the model and need to calculate the feature again for the validation set, a) do you use the already set quartiles barriers, ie the 200 validation entries could have a different than 50-50-50-50 quartile split, or b) do you recalculate the quartiles using all 1000 entries so there is a new feature of quartile rank, each of 250 entries each?
Thanks very much
The ideal practice would be to calculate the quartiles on the training dataset, and using those barriers on your holdout / validation dataset. To ensure that you correctly generate model diagnostics to evaluate its predictive performance, you do not want the distribution of the testing dataset to influence your model training. This is because that data will not be available in real life when you apply the model on unseen data.
I also thought that you will find this article extremely useful when thinking about train-test splitting - https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50

How datasets are structured in TensorFlow?

In my first TensorFlow project, I have a big dataset (1M elements) which contains 8 categories of elements, with each category, has a different number of elements of course. I want to split the big dataset into 10 exclusive small datasets, with each of them having approximately 1/10 of each category. (This is for 10-fold cross-validation purposes.)
Here is how I do.
I wind up having 80 datasets, with each category having 10 small datasets, then I randomly sample data from 80 of them by using sample_from_datasets. However, after some steps, I met a lot of warning saying "DirectedInterleave selected an exhausted input:36" where 36 can be some other integer numbers.
The reason I want to do sample_from_datasets is that I tried to do shuffle the original dataset. Even though shuffle only 0.4 x total elements, it still takes a long long time to finish (about 20mins).
My questions are
1. based on my case, any good advice on how to structure the datasets?
2. is it normal to have a long shuffling time? any better solution for shuffling?
3. why do I get this DirectIngerleave selected an exhausted input: warning? and what does it mean?
thank you.
Split your whole datasets into Training, Testing and Validation categories. As you have 1M data, you can split like this: 60% training, 20% testing and 20% validation. Splitting of datasets is completely up to you and your requirements. But normally maximum data is used for training the model. Next, the rest of the datasets can be used for testing and validation.
As you have ten class datasets, split each category into Training, Testing and Validation categories.
Let, you have A, B, C and D categories data. Split your data "A", "B", "C", and "D" like below:
'A'- 60 % for training 20% testing and 20% validation
'B'- 60 % for training 20% testing and 20% validation
'C'- 60 % for training 20% testing and 20% validation
'D'- 60 % for training 20% testing and 20% validation
Finally merge all the A, B, C and D training, testing and validation datasets.

xgboost rank pairwise what is the prediction input and output

I'm trying to use XGBoost to predict the rank for a set of features for a given query. I managed to train a model it but I'm confused around the input data when I ask for a prediction.
I'm trying to understand if I'm doing something wrong or this is not the right approach.
What I'm doing:
I parse the training data (see here a sample) and feed it in a DMatrix such that the first column represents the quality-of-the-match and the following columns are the scores on different properties and also send the docIds as labels
I configure the group sizes
The training seems to work fine, I get not errors, and I use the rank:pairwise objective
For prediction, I use a fake entry with fake scores (1 row, 2 columns see here) and I get back a single float value.
I'm trying to understand:
1. Do I need to feed in a label for the prediction ?
My understanding is that labels are similar to "doc ids" so at prediction time I don't see why I need them
2. Do I need to set the group size when doing predictions ? And if so, what does it represent ?
My understanding is that groups are for training data to assist ranking "per query". How does that correlate with predictions? Do I set a group size anyway?
3. How do I correlate the "group" from the training with the prediction?
How do I figure out the pair (score, group) from the result of the prediction, given I only get back a single float value - what group is that prediction for?

How should I test on a small dataset?

I use Weka to test machine learning algorithms on my dataset. I have 3800 rows and around 25 features. I am testing the combination of different features for prediction models and seem to predict lower than just the oneR algorithm does with the use of Cross-validation. Even C4.5 does not predict better, sometimes it does and sometimes it does not on basis of the features that are still able to classify.
But, on a certain moment I splitted my dataset in a testset and dataset(20/80), and testing it on the testset, the C4.5 algorithm had a far higher accuracy than my OneR algorithm had. I thought, with the small size of the dataset, it probably is just a coincidence that it predicted very well(the target was still splitted up relatively as target attributes). And therefore, its more useful to use Cross-validation on small datasets like these.
However, testing it on another testset, did give the high accuracy towards the testset using C4.5. So, my question actually is, what is the best way to test datasets when the datasets are actually pretty small?
I saw some posts where it is discussed, but I am still not sure what is the right way to do it.
It's almost always a good approach to test your model via Cross-Validation.
A rule of thumb is to use 10 fold cross validation.
In your case, 10 fold cross validation will do the following in Weka:
split your 3800 training instances into 10 sets of 380 instances
for each set (s = 1 .. 10) :
use the instances from s for testing and the other 9 sets for training a model (3420 training instances)
the result will be an average of the results obtained with the 10 models used.
Try to avoid testing your dataset using the training set option, because that could result in creating a model that works very well for you existing data but could have big problems with other new instances (overfitting).