Stratified sampling into 3 sets considering unbalance - pandas

I have looked into Stratified sample in pandas, stratified sampling on ranges, among others and they don't assess my issue specifically, as I'm looking to split the data into 3 sets randomly.
I have an unbalanced dataframe of 10k rows, 10% is positive class, 90% negative class. I'm trying to figure out a way to split this dataframe into 3 datasets, as 60%, 20%, 20% of the dataframe considering the unbalance. However, this split has to be random and non-replaceable, which means if I put together the 3 datasets, it has to be equal to the original dataframe.
Usually I would use train_test_split() but it only works if you are looking to split into two, not three datasets.
Any suggestions?
Reproducible example:
df = pd.DataFrame({"target" : np.random.choice([0,0,0,0,0,0,0,0,0,1], size=10000)}, index=range(0,10000,1))

How about using train_test_split() twice?
1st time, using train_size=0.6, obtaining a 60% training set and 40% (test + valid) set.
2nd time, using train_size=0.5, obtaining a 50%*40%=20% validation and 20% test.
Is this workaround valid for you?

Related

How does Stratified Sampling Work in Weka

I have a dataset. I want to split the dataset using Stratified Sampling. I would like 70% of data in training set and 30% in test set. So I split the dataset 10 equal subset using StratifiedRemoveFold filter in weka. Then I append 7 datasets to make 70% training dataset and append rest of 3 datasets to make 30% training dataset. However, this is not a good option. I found that, for the 1st attribute of test test one value was missing. Like, my 1st attribute has 7 values. But there were only 6 values for 1st attribute in the test set. As a result when I run the classifier on training set there was error Training set and Test set are incompatible.
I went through the link Stratified Sampling in WEKA. I found if I want to generate a 5% subsample, set the folds to 20. If this is the strategy, then for 30% test set do I need to set the numberofFold of StratifiedRemoveFold filter = 120? And also what about the test set? What should I set as numberofFolds in test set where test set is 70% of whole dataset?
You could try using the supervised Resample (weka.filters.supervised.instance.Resample) filter instead, with no replacement and a bias factor of 0 (to use the distribution of the input data). When using the invert flag, you get the remainder of the dataset.
If you really want to use StratifiedRemoveFolds, then use 10 folds, apply the filter 10 times to get all the 10 folds out and then combine 7 to make your 70% and the remainder to get your 30%.

How datasets are structured in TensorFlow?

In my first TensorFlow project, I have a big dataset (1M elements) which contains 8 categories of elements, with each category, has a different number of elements of course. I want to split the big dataset into 10 exclusive small datasets, with each of them having approximately 1/10 of each category. (This is for 10-fold cross-validation purposes.)
Here is how I do.
I wind up having 80 datasets, with each category having 10 small datasets, then I randomly sample data from 80 of them by using sample_from_datasets. However, after some steps, I met a lot of warning saying "DirectedInterleave selected an exhausted input:36" where 36 can be some other integer numbers.
The reason I want to do sample_from_datasets is that I tried to do shuffle the original dataset. Even though shuffle only 0.4 x total elements, it still takes a long long time to finish (about 20mins).
My questions are
1. based on my case, any good advice on how to structure the datasets?
2. is it normal to have a long shuffling time? any better solution for shuffling?
3. why do I get this DirectIngerleave selected an exhausted input: warning? and what does it mean?
thank you.
Split your whole datasets into Training, Testing and Validation categories. As you have 1M data, you can split like this: 60% training, 20% testing and 20% validation. Splitting of datasets is completely up to you and your requirements. But normally maximum data is used for training the model. Next, the rest of the datasets can be used for testing and validation.
As you have ten class datasets, split each category into Training, Testing and Validation categories.
Let, you have A, B, C and D categories data. Split your data "A", "B", "C", and "D" like below:
'A'- 60 % for training 20% testing and 20% validation
'B'- 60 % for training 20% testing and 20% validation
'C'- 60 % for training 20% testing and 20% validation
'D'- 60 % for training 20% testing and 20% validation
Finally merge all the A, B, C and D training, testing and validation datasets.

How to use multiple imputed data for further analysis in SVM and ANN?

My original data contains some missing values and I used multiple imputation to fill them. My next objective is to use these data in SVM and ANN. I originally thought MI would give me a "pooled" completed dataset but it turned out that MI only gives pooled analysis results regarding the imputed datasets. So my questions are:
1) Is there any way, like any equation, I can use to aggregate the imputed datasets into one dataset and use it for further analysis;
2) If not, how should proceed my study using the multiple datasets.
Thank you!
This is a general misunderstanding about MI.
The general process is supposed to be like this:
Multiple Imputation
Analysis for each imputed dataset
Pooling
If you would do the imputation and then merge all imputed dataset to one imputed dataset you loose all the benefit of MI. Then you could have just used any other imputation method. The idea is to perform your analysis for example 5 times, one time for each imputed dataset. Because you want to account for the different outcomes your analysis could have had with different imputed input datasets. Afterwards you pool / merge the results of your analysis.
The whole process is not so common in ML. But in your case you could for example use SVM on all 5 datasets and then afterwards compare the results / come up with a procedure to merge/combine the results.

Random projection in Python Pandas using a dataframe containing NaN values

I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with thesklearn.random_projection.GaussianRandomProjection class. However, when I run:
tx = random_projection.GaussianRandomProjection(n_components = 25)
data25 = tx.fit_transform(data)
I get Input contains NaN, infinity or a value too large for dtype('float64'). Is there a work-around to this? I tried changing all the NaN values to a value that is never present in my dataset, such as -1. How valid would my output be in this case? I'm not an expert behind the theory of locality sensitive hashing/random projections so any insight would be helpful as well. Thanks.
NA / NaN values (not-available / not-a-number) are, I have found, just plain troublesome.
You don't want to just substitute a random value like -1. If you are inclined to do that, use one of the Imputer classes. Otherwise, you are likely to very substantially change the distances between points. You likely want to preserve distances as much as possible if you are using random projection:
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.
However, this may or may not result in reasonable values for learning. As far as I know, imputation is an open field of study, which (for instance) this gentlemen has specialized in studying.
If you have enough examples, consider dropping rows or columns that contain NaN values. Another possibility is training a generative model like a Restricted Boltzman Machine and use that to fill in missing values:
rbm = sklearn.neural_network.BernoulliRBM().fit( data_with_no_nans )
mean_imputed_data = sklearn.preprocessing.Imputer().fit_transform( all_data )
rbm_imputation = rbm.gibbs( mean_imputed_data )
nan_mask = np.isnan( all_data )
all_data[ nan_mask ] = rbm_imputation[ nan_mask ]
Finally, you might consider imputing using nearest neighbors. For a given column, train a nearest neighbors model on all the variables except that column using all complete rows. Then, for a row missing that column, find the k nearest neighbors and use the average value among them. (This gets very costly, especially if you have rows with more than one missing value, as you will have to train a model for every combination of missing columns).

Pandas, compute many means with bootstrap confidence intervals for plotting

I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.
The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]