Brainscript example with Dynamic Axes - cntk

I am building an LSTM that handles several parallel sequences, and I'm struggling to find any brainscript example that handles dynamic axes.
In my specific case, an example consists of a binary label and N sequences, where each sequence i has a fixed length (but may differ for j<>i).
For example, sequence 1 is always length 1024, sequence 2 is length 4096, sequence 3 is length 1024.
I am expressing these sequences by packing them in parallel in the CNTK text format:
0 |Label 1 |S1 0 |S2 1 |S3 0
0 |S1 1 |S2 1 |S3 1
... another 1021 rows
0 |S2 0
0 |S2 1
... another 3070 rows with only S2 defined
1 |Label 0 |S1 0 |S2 1 |S3 0
1 |S1 1 |S2 1 |S3 0
... another 1021 rows
1 |S2 1
1 |S2 0
... another 3070 rows with only S2 defined
2 |Label ...
and so on. I feel as though I've constructed examples like this in the past but I've been unable to track down any sample configs, or even any BS examples that specify dynamic axes. Is this approach doable?

The G2P example (...\Examples\SequenceToSequence\CMUDict\BrainScript\G2P.cntk) uses multiple dynamic axes. This is a snippet from this file:
# inputs and axes must be defined on top-scope level in order to get a clean node name from BrainScript.
inputAxis = DynamicAxis()
rawInput = Input (inputVocabDim, dynamicAxis=inputAxis, tag='feature')
rawLabels = Input (labelVocabDim, tag='label')
However, since in your case the axes all have the same length for each input, you may also want to consider to just put them into fixed-sized tensors. E.g instead of 1024 values, you would just have a single value of dimension 1024.
The choice depends on what you want to do with the sequences. Are you planning to run a recurrence over them? If so, you want to keep them as dynamic sequences. If they are just vectors that you plan to process with, say, big matrix products, you would rather want to keep them as static axes.

Related

Balancing a multilabel dataset using Julia

I have a dataframe like this:
id text feat_1 feat_2 feat_3 feat_n
1 random coments 0 0 1 0
2 random coments2 1 0 1 0
1 random coments3 1 1 1 1
Feat columns goes from 1 to 100 and they are labels of a multilabel dataset. The type of data as is 1 and 0 (boolean)
The dataset has over 50k records the labels are unbalance. I am looking for a way to balance it and I was working on this approach:
Sum the values in each feat column and then use the lowest value of this sum as a threshold to filter the dataset.
I need to keep all features columns so I can exclude comments to achieve.
The main idea boild down to: i need to get a balanced dataset to use in a multilabel classification problem, i mean, I need the same amount of feat_columns data as they are my labels.

pandas create Cross-Validation based on specific columns

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

Can I use pandas to create a biased sample?

My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone

Dendrograms with SciPy

I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?

Input mix for Keras: Timeseries and Features

I have several time series Features (ECG, HRV and breathing) and separate features made from those time series (e.g. SDNN, RMSSD,...).
I follow Francois Chollet with the naming. For a 3D timeseries input tensor they use [samples,timestep,features]
The time series have 15000 values(samples) per timestep [[15000x1],[15000x1],...] while the separate features have 1 value (sample) per timestep.Those extra features with length [1] are different for each timestep. [[0.3],[0.35],[0.34],...].
ECG, HRV, F1, F2, ...
-------------------------------------------------------------
Sequence 1 |
Step 1 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 2 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 3 | [[15000x1],[1000x1],[1x1],[1x1],...]
Sequence 2 |
Step 1 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 2 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 3 | [[15000x1],[1000x1],[1x1],[1x1],...]
How would you best approach to learn on all those inputs with Keras?
Just zero pad the separate features from 1 to 15000 and add them to the time series
Pad the separate features with themselves (using the one value repeatedly)
As the data is normalized between 0 and 1, using a value way outside of that range of the extra features to pad. E.g. 1000
Have only the time series as 3D tensor input and the separate features as an additional input (as extra layers) and merge them into the learner (multi-input)
Extra question. How does zero padding influence the learner due to the "false" additional information? Especially for the 1 to 15000 part for the separate feature from above. Another example: the HRV and breathing signals are shorter than the ECG due to different sample frequency. Here I would use rather an interpolation instead of zero padding. Would you agree, or does zero padding not influence the learner?
Thanks
Assumption 1
Due to the ambiguity, I'm assuming this (please comment if not and I'll change it)
I'm calling ECG, HRV, etc. features that vary per step
Your feature with the highest frequency has 15000 steps, while the other features have less steps
You have a separate feature that is not sequential and has no steps. (I'll call it the separate feature in this answer)
Extra question:
Yes! Interpolate the less frequent features and make an input tensor like:
(numberOfSequences_maybePatient, 15000 steps, features_ECG_HRV_etc)
You need to keep a correlation between the features on when they happen, and this is made by synchronizing the steps.
Will zero padding influence the results?
Yes it will, unless you use "masking" (a masking layer). But this will only make sense for handling samples (different sequence or patients) with different lengths, not features with different length/sample rate.
Example, the following case would work well with zero padding and masking:
sequence 1: length 100 (all features included, ECG, HRV, etc.)
sequence 2: length 200 (all features included, ECG, HRV, etc.)
How to deal with the separate feature?
There are a number of possible ways. One of the most simple ones, and probably very effective, is to make it a constant sequence with all the 15000 steps. This approach does not require thinking about how the feature relates to the rest of the data, and leaves the task to the model
Suppose the separate feature value is 2 for the first sequence and 4 for the second sequence, make then this data array:
ECG, HRV, separate
--------------------------------------------------------
| [
sequence 1: | [
step 1 | [ecg1, hrv1, 2],
step 2 | [ecg2, hrv2, 2],
step 3 | [ecg3, hrv3, 2]
| ]
|
sequence 2: | [
step 1 | [ecg4, hrv4, 4],
step 2 | [ecg5, hrv5, 4],
step 3 | [ecg6, hrv6, 4]
| ]
| ]
You can also input is as an additional input in the model:
regularSequences = Input((15000,features))
separateFeature = Input((1,)) #assuming 1 value per sequence
And then you decide if you want to sum it somewhere, multiply it somewhere, etc. This approach might be more effective than the other if you have an idea of what this feature means and how it relates to the rest of the data to select the best operations and where.
Assumption 2
Taking this description from your updated answer:
ECG, HRV, F1, F2, ...
-------------------------------------------------------------
Sequence 1 |
Step 1 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 2 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 3 | [[15000x1],[1000x1],[1x1],[1x1],...]
Sequence 2 |
Step 1 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 2 | [[15000x1],[1000x1],[1x1],[1x1],...]
Step 3 | [[15000x1],[1000x1],[1x1],[1x1],...]
Then:
You have 15000 features for a single time step in the ECG. (Are you sure this is not a sequence of 15000 steps?)
You have 1000 features for a single time step in HRV. (Are you sure this is not a sequence of 1000 steps?)
You have several other individual features per time step.
Well, organizing this data is quite easy (but mind the questions I asked above), just pack all features together in each time step:
The shape of your input data will be: (sequences, steps, 16002)
ECG, HRV, F1, F2, ...
-------------------------------------------------------------
[
Sequence 1 | [
Step 1 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
Step 2 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
Step 3 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
]
Sequence 2 | [
Step 1 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
Step 2 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
Step 3 | [ecg1,ecg2,...,ecg15000,hrv1,hrv2,...hrv1000,F1,F2,...]
]