I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.
My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone
I have several time series Features (ECG, HRV and breathing) and separate features made from those time series (e.g. SDNN, RMSSD,...).
I follow Francois Chollet with the naming. For a 3D timeseries input tensor they use [samples,timestep,features]
The time series have 15000 values(samples) per timestep [[15000x1],[15000x1],...] while the separate features have 1 value (sample) per timestep.Those extra features with length [1] are different for each timestep. [[0.3],[0.35],[0.34],...].
ECG, HRV, F1, F2, ...
-------------------------------------------------------------
Sequence 1 |
Step 1 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 2 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 3 | [[15000x1],[1000x1],[1x1],[1x1],...]
Sequence 2 |
Step 1 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 2 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 3 | [[15000x1],[1000x1],[1x1],[1x1],...]
How would you best approach to learn on all those inputs with Keras?
Just zero pad the separate features from 1 to 15000 and add them to the time series
Pad the separate features with themselves (using the one value repeatedly)
As the data is normalized between 0 and 1, using a value way outside of that range of the extra features to pad. E.g. 1000
Have only the time series as 3D tensor input and the separate features as an additional input (as extra layers) and merge them into the learner (multi-input)
Extra question. How does zero padding influence the learner due to the "false" additional information? Especially for the 1 to 15000 part for the separate feature from above. Another example: the HRV and breathing signals are shorter than the ECG due to different sample frequency. Here I would use rather an interpolation instead of zero padding. Would you agree, or does zero padding not influence the learner?
Thanks
Assumption 1
Due to the ambiguity, I'm assuming this (please comment if not and I'll change it)
I'm calling ECG, HRV, etc. features that vary per step
Your feature with the highest frequency has 15000 steps, while the other features have less steps
You have a separate feature that is not sequential and has no steps. (I'll call it the separate feature in this answer)
Extra question:
Yes! Interpolate the less frequent features and make an input tensor like:
(numberOfSequences_maybePatient, 15000 steps, features_ECG_HRV_etc)
You need to keep a correlation between the features on when they happen, and this is made by synchronizing the steps.
Will zero padding influence the results?
Yes it will, unless you use "masking" (a masking layer). But this will only make sense for handling samples (different sequence or patients) with different lengths, not features with different length/sample rate.
Example, the following case would work well with zero padding and masking:
sequence 1: length 100 (all features included, ECG, HRV, etc.)
sequence 2: length 200 (all features included, ECG, HRV, etc.)
How to deal with the separate feature?
There are a number of possible ways. One of the most simple ones, and probably very effective, is to make it a constant sequence with all the 15000 steps. This approach does not require thinking about how the feature relates to the rest of the data, and leaves the task to the model
Suppose the separate feature value is 2 for the first sequence and 4 for the second sequence, make then this data array:
ECG, HRV, separate
--------------------------------------------------------
| [
sequence 1: | [
step 1 | [ecg1, hrv1, 2],
step 2 | [ecg2, hrv2, 2],
step 3 | [ecg3, hrv3, 2]
| ]
|
sequence 2: | [
step 1 | [ecg4, hrv4, 4],
step 2 | [ecg5, hrv5, 4],
step 3 | [ecg6, hrv6, 4]
| ]
| ]
You can also input is as an additional input in the model:
regularSequences = Input((15000,features))
separateFeature = Input((1,)) #assuming 1 value per sequence
And then you decide if you want to sum it somewhere, multiply it somewhere, etc. This approach might be more effective than the other if you have an idea of what this feature means and how it relates to the rest of the data to select the best operations and where.
Assumption 2
Taking this description from your updated answer:
ECG, HRV, F1, F2, ...
-------------------------------------------------------------
Sequence 1 |
Step 1 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 2 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 3 | [[15000x1],[1000x1],[1x1],[1x1],...]
Sequence 2 |
Step 1 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 2 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 3 | [[15000x1],[1000x1],[1x1],[1x1],...]
Then:
You have 15000 features for a single time step in the ECG. (Are you sure this is not a sequence of 15000 steps?)
You have 1000 features for a single time step in HRV. (Are you sure this is not a sequence of 1000 steps?)
You have several other individual features per time step.
Well, organizing this data is quite easy (but mind the questions I asked above), just pack all features together in each time step:
The shape of your input data will be: (sequences, steps, 16002)
ECG, HRV, F1, F2, ...
-------------------------------------------------------------
[
Sequence 1 | [
Step 1 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
Step 2 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
Step 3 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
]
Sequence 2 | [
Step 1 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
Step 2 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
Step 3 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
]