In my first TensorFlow project, I have a big dataset (1M elements) which contains 8 categories of elements, with each category, has a different number of elements of course. I want to split the big dataset into 10 exclusive small datasets, with each of them having approximately 1/10 of each category. (This is for 10-fold cross-validation purposes.)
Here is how I do.
I wind up having 80 datasets, with each category having 10 small datasets, then I randomly sample data from 80 of them by using sample_from_datasets. However, after some steps, I met a lot of warning saying "DirectedInterleave selected an exhausted input:36" where 36 can be some other integer numbers.
The reason I want to do sample_from_datasets is that I tried to do shuffle the original dataset. Even though shuffle only 0.4 x total elements, it still takes a long long time to finish (about 20mins).
My questions are
1. based on my case, any good advice on how to structure the datasets?
2. is it normal to have a long shuffling time? any better solution for shuffling?
3. why do I get this DirectIngerleave selected an exhausted input: warning? and what does it mean?
thank you.
Split your whole datasets into Training, Testing and Validation categories. As you have 1M data, you can split like this: 60% training, 20% testing and 20% validation. Splitting of datasets is completely up to you and your requirements. But normally maximum data is used for training the model. Next, the rest of the datasets can be used for testing and validation.
As you have ten class datasets, split each category into Training, Testing and Validation categories.
Let, you have A, B, C and D categories data. Split your data "A", "B", "C", and "D" like below:
'A'- 60 % for training 20% testing and 20% validation
'B'- 60 % for training 20% testing and 20% validation
'C'- 60 % for training 20% testing and 20% validation
'D'- 60 % for training 20% testing and 20% validation
Finally merge all the A, B, C and D training, testing and validation datasets.
Related
I have looked into Stratified sample in pandas, stratified sampling on ranges, among others and they don't assess my issue specifically, as I'm looking to split the data into 3 sets randomly.
I have an unbalanced dataframe of 10k rows, 10% is positive class, 90% negative class. I'm trying to figure out a way to split this dataframe into 3 datasets, as 60%, 20%, 20% of the dataframe considering the unbalance. However, this split has to be random and non-replaceable, which means if I put together the 3 datasets, it has to be equal to the original dataframe.
Usually I would use train_test_split() but it only works if you are looking to split into two, not three datasets.
Any suggestions?
Reproducible example:
df = pd.DataFrame({"target" : np.random.choice([0,0,0,0,0,0,0,0,0,1], size=10000)}, index=range(0,10000,1))
How about using train_test_split() twice?
1st time, using train_size=0.6, obtaining a 60% training set and 40% (test + valid) set.
2nd time, using train_size=0.5, obtaining a 50%*40%=20% validation and 20% test.
Is this workaround valid for you?
When one feature of a dataset is a summary statistic of the entire pool of data, is it good practice to include the train data in your test data in order to calculate the feature for validation?
For instance, let's say I have 1000 data points split into 800 entries of training and 200 entries for validation. I create a feature with the 800 entries for training of say rank quartile (or could be anything), which numbers 0-3 the quartile some other feature falls in. So in the training set, there will be 200 data points in each quartile.
Once you train the model and need to calculate the feature again for the validation set, a) do you use the already set quartiles barriers, ie the 200 validation entries could have a different than 50-50-50-50 quartile split, or b) do you recalculate the quartiles using all 1000 entries so there is a new feature of quartile rank, each of 250 entries each?
Thanks very much
The ideal practice would be to calculate the quartiles on the training dataset, and using those barriers on your holdout / validation dataset. To ensure that you correctly generate model diagnostics to evaluate its predictive performance, you do not want the distribution of the testing dataset to influence your model training. This is because that data will not be available in real life when you apply the model on unseen data.
I also thought that you will find this article extremely useful when thinking about train-test splitting - https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50
I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.
It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).
The PennTree Bank data seems difficult to understand. Below are two links
https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data
https://www.tensorflow.org/tutorials/recurrent
My concern is as follows. The reader gives me around a million occurrences of 10000 words. I have written code to convert this data set into one-hot encoding. Thus, I have a million vectors of 10000 dimensions, each vector having a 1 at a single location. Now, I want to train a LSTM (long short term memory) model on this for prediction.
For simplicity let us assume that there are 30 (and not a million occurences) occurences, and the sequence length is equal to 10 for the LSTM (the number of timesteps that it unrolls). Let us denote these occurences by
X1,X2,....,X10,X11,...,X20,X21,...X30
Now, my concern is that should I use 3 data samples for training
X1,..X10 and X11,..,X20, and X21,..X30
or should I use 20 data samples for training
X1,..X10 and X2,...,X11, and X3,..,X12, so on until X21,..,X30
In case I go with the latter, then am I not breaking the i.i.d. assumption of training data sequence generations?
I use Weka to test machine learning algorithms on my dataset. I have 3800 rows and around 25 features. I am testing the combination of different features for prediction models and seem to predict lower than just the oneR algorithm does with the use of Cross-validation. Even C4.5 does not predict better, sometimes it does and sometimes it does not on basis of the features that are still able to classify.
But, on a certain moment I splitted my dataset in a testset and dataset(20/80), and testing it on the testset, the C4.5 algorithm had a far higher accuracy than my OneR algorithm had. I thought, with the small size of the dataset, it probably is just a coincidence that it predicted very well(the target was still splitted up relatively as target attributes). And therefore, its more useful to use Cross-validation on small datasets like these.
However, testing it on another testset, did give the high accuracy towards the testset using C4.5. So, my question actually is, what is the best way to test datasets when the datasets are actually pretty small?
I saw some posts where it is discussed, but I am still not sure what is the right way to do it.
It's almost always a good approach to test your model via Cross-Validation.
A rule of thumb is to use 10 fold cross validation.
In your case, 10 fold cross validation will do the following in Weka:
split your 3800 training instances into 10 sets of 380 instances
for each set (s = 1 .. 10) :
use the instances from s for testing and the other 9 sets for training a model (3420 training instances)
the result will be an average of the results obtained with the 10 models used.
Try to avoid testing your dataset using the training set option, because that could result in creating a model that works very well for you existing data but could have big problems with other new instances (overfitting).