Variable size multi-label candidate sampling in tensorflow? - tensorflow

nce_loss() asks for a static int value for num_true. That works well for problems where we have the same amount of labels per training example and we know it in advance.
When labels have a variable shape [None], and being batched and/or bucketed by bucket size with .padded_batch() + .group_by_window() it is necessary to provide a variable size num_true in order to accustom for all training examples. This is currently unsupported to my knowledge (correct me if I'm wrong).
In other words suppose we have either a dataset of images with an arbitrary amount of labels per each image (dog, cat, duck, etc.) or a text dataset with numerous multiple classes per sentence (class_1, class_2, ..., class_n). Classes are NOT mutually exclusive, and can vary in size between examples.
But as the amount of possible labels can be huge 10k-100k is there a way to do a sampling loss to improve performance (in comparison with a sigmoid_cross_entropy)?
Is there a proper way to do this or any other workarounds?
nce_loss = tf.nn.nce_loss(
weights=nce_weights,
biases=nce_biases,
labels=labels,
inputs=inputs,
num_sampled=num_sampled,
# Something like this:
# `num_true=(tf.shape(labels)[-1])` instead of `num_true=const_int`
# , would be preferable here
num_classes=self.num_classes)

I see two issues:
1) Work with NCE with different numbers of true values;
2) Classes that are NOT mutually exclusive.
To the first issue, as #michal said, there is an expectative of including this functionality in the future. I have tried almost the same thing: to use labels with shape=(None, None), i.e., true_values dimension None. The sampled_values parameter has the same problem: true_values number must be a fixed integer number. The recomended work around is to use a class (0 is the best one) representing <PAD> and complete the number of true_values. In my case, 0 is an special token that represents <PAD>. Part of code is here:
assert len(labels) <= (window_size * 2)
zeros = ((window_size * 2) - len(labels)) * [0]
labels = labels + zeros
labels.sort()
I sorted the label because considering another recommendation:
Note: By default this uses a log-uniform (Zipfian) distribution for
sampling, so your labels must be sorted in order of decreasing
frequency to achieve good results.
In my case, the special tokens and more frequent words have lower indexes, otherwise, less frequent words have higher indexes. I included all label classes associated to the input at same time and completed with zero till the true_values number. Of course, you must ignore the 0 class at the end.

Related

How to split the dataset into mutiple folds while keeping the ratio of an attribute fixed

Let's say that I have a dataset with multiple input features and one single output. For the sake of simplicity, let's say the output is binary. Either zero or one.
I want to split this dataset into k parts and use a k-fold cross-validation model to learn the mapping from the input features to the output one. If the dataset is imbalanced, the ratio between the number of records with output 0 and 1 is not going to be one. To make it concrete, let's say that 90% of the records are 0 and only 10% are 1.
I think it's important that within each part of k-folds we should see the same ratio of 0s and 1s in order for successful training (the same 9 to 1 ratio). I know how to do this in Pandas but my question is how to do it in TFX.
Reading the TFX documentation, I know that I can split a dataset by specifying an output_config to the class loading the examples:
output = tfx.proto.Output(
split_config=tfx.proto.SplitConfig(splits=[
tfx.proto.SplitConfig.Split(name='fold_1', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_2', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_3', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_4', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_5', hash_buckets=1)
]))
example_gen = CsvExampleGen(input_base=input_dir, output_config=output)
But then, the aforementioned ratio of the examples in each fold will be random at best. My question is: Is there any way I can specify what goes into each split? Can I somehow enforce the ratio of a feature?
BTW, I have seen and experimented with the partition_feature_name argument of the SplitConfig class. It's not useful here unless there's a feature with the ID of the fold for each example which I think is not practical since I might want to change the number of folds as part of the experiment without changing the dataset.
I'm going to answer my own question but only as a workaround. I'll be happy to see someone develop a real solution to this question.
What I could come up with at this point was to split the dataset into a number of tfrecord files. I've chosen a "composite" number of files so I can split them into (almost) any number I want. For this, I've settled down on 60 since it can be divided by 2, 3, 4, 5, 6, 10, and 12 (I don't think anyone would want KFold with k higher than 12). Then at the time of loading them, I have to somehow select which files will go into each split. There are two things to consider here.
First, the ImportExampleGen class from TFX supports glob file patterns. This means we can have multiple files loaded for each split:
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name="fold_1", pattern="fold_1*"),
tfx.proto.Input.Split(name="fold_2", pattern="fold_2*")
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
Next, we need some ingenuity to enable splitting the files into any number we like at the time of loading them. And this is my approach to it:
fold_3.0_4.0_5.0_6.0_10.0/part-###.tfrecords.gz
fold_3.0_4.0_5.1_6.0_10.6/part-###.tfrecords.gz
fold_3.0_4.0_5.2_6.0_10.2/part-###.tfrecords.gz
fold_3.0_4.0_5.3_6.0_10.8/part-###.tfrecords.gz
...
The file pattern is like this. Between each two _ I include the divisor, a ., and then the remainder. And I'll have as many of these as I want to have the "split possibility" later, at the time of loading the dataset.
In the example above, I'll have the option to load them into 3, 4, 5, 6, and 10 folds. The first file will be loaded as part of the 0th split if I want to split the dataset into any number of folds while the second file will be in the 1st split of 5-fold and 6th split of 10-fold.
And this is how I'll load them:
NUM_FOLDS = 5
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name=f'fold_{index + 1}',
pattern=f"fold_*{str(NUM_FOLDS)+'.'+str(index)}*/*")
for index in range(NUM_FOLDS)
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
I could change the NUM_FOLDS to any of the options 3, 4, 5, 6, or 10 and the loaded dataset will consist of pre-curated k-fold splits. It is worth mentioning that I have made sure of the ratio of the samples within each file at the time of creating them. So any combination of them will also have the same ratio.
Again, this is only a trick in the absence of an actual solution. The main drawback of this approach is the fact that you have to split the dataset manually yourself. I've done so, in this case, using pandas. That meant that I had to load the whole dataset into memory. Which might not be possible for all the datasets.

Question about input_dim in keras embedding layer

From the documentation on tf.keras.layers.Embedding :
input_dim:
Integer. Size of the vocabulary, i.e. maximum integer index + 1.
mask_zero:
Boolean, whether or not the input value 0 is a special “padding” value that should be masked
out. This is useful when using recurrent layers which may take variable length input. If this
is True, then all subsequent layers in the model need to support masking or an exception will
be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the
vocabulary (input_dim should equal size of vocabulary + 1).
I was reading this answer but I'm still confused. If my vocabulary size is n but they are encoded with index values from 1 to n (0 is left for padding), is input_dim equal to n or n+1?
If the inputs are padded with zeroes, what are the consequences of leaving mask_zero = False?
If mask_zero = True, based on the documentation, I would have to increment the answer from my first question by one? What is the expected behaviour if this was not done?
I am basically just trying to rephrase parts of the linked answer to make it a bit more understandable in the current context, and also address your other subquestions (which technically should be their own questions, according to [ask]).
It does not matter whether you actually use 0 for padding or not, Keras assumes that you will start indexing from zero and will have to "brace itself" for an input value of 0 in your data. Therefore, you need to choose the value as n+1, because you are essentially just adding a specific value to your vocabulary that you previously didn't consider.
I think this is out of scope for this question to discuss in detail, but - depending on the exact model - the loss values on padded positions do not affect the backpropagation. However, if you choose mask_zero = False, your model will essentially have to correctly predict padding on all those positions (where the padding then also affects the training).
This relates to my illustration: Essentially, you are adding a new vocabulary index. If you do not adjust your dimension, there will likely be an indexing error (out of range) for the vocabulary entry with the highest index (n). Otherwise, you would likely not notice any different behavior.

How to remove frequent/infrequent features from Sklearn CountVectorizer?

Is it possible to remove a percentage of features that occur most frequently / infrequently, from the CountVectorizer?
So basically organize the features from a greatest to least occurrence distribution and just remove a percentage from the left or right side?
As far as I know, there is no straight forward way to do that.
Let me propose a way to achieve the result you want.
I will assume that you are only interested in unigrams (one-word features) to make the examples also simpler.
Regarding the top-x per cent of the features, a possible implementation can be based on the max_features parameter of the CountVectorizer (see user guide).
First, you would need to find out the total number of features by using the CountVectorizer with the default values so that it generates the full vocabulary of terms in the corpus.
vect = CountVectorizer()
bow = vect.fit_transform(corpus)
total_features = len(vect.vocabulary_)
Then you use the CountVectorizer with the max_features parameter, limiting the number of features to the top percentage you need, say 20%. When using the max_features the most frequent terms are selected automatically.
top_vect = CountVectorizer(max_features=int(total_features * 0.2))
top_bow = top_vect.fit_transform(corpus)
Now, regarding the bottom-x per cent of the features, even though I cannot think a good reason why you need that, here is an approach. The vocabulary parameter can be used to limit the model to count only the less frequent terms. For that, we use the output of the first run of the CountVectorizer to create a list of the less common terms.
# Create a list of (term, frequency) tuples sorted by their frequency
sum_words = bow.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vect.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1])
# Keep only the terms in a list
vocabulary, _ = zip(*words_freq[:int(total_features * 0.2)])
vocabulary = list(vocabulary)
Finally, we use the vocabulary to limit the model to the less frequent terms.
bottom_vect = CountVectorizer(vocabulary=vocabulary)
bottom_bow = bottom_vect.fit_transform(corpus)

What is "valency", with regards to machine learning?

This term came up a few times in the Tensorflow Dev Summit, and it shows up in the Tensorflow Extended documentation, but without any sort of definition. After a fair amount of googling, I don't see reference to it in any Statistics-related setting. Searching the Tensorflow repositories produces a few hits, but they're similarly unhelpful. The term does seem to be used in Chemistry, Psychology, and Linguistics, but those definitions appear to be unrelated.
Per the 2017 TFX paper http://stevenwhang.com/tfx_paper.pdf, TFX can calculate a number of stats on a dataset, including:
"The expected valency of the feature in each example, i.e., minimum
and maximum number of values."
We can also look at the code that calculates valency in TFX. From what I can tell, it's designed to run on a feature that is an array, and counts the minimum and maximum number of values within that array for that feature:
# Extract the valency information of the feature.
valency = ''
if feature.HasField('value_count'):
if (feature.value_count.min == feature.value_count.max and
feature.value_count.min == 1):
valency = 'single'
else:
min_value_count = ('[%d' % feature.value_count.min
if feature.value_count.HasField('min') else '[0')
max_value_count = ('%d]' % feature.value_count.max
if feature.value_count.HasField('max') else 'inf)')
valency = min_value_count + ',' + max_value_count
from: https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/utils/display_util.py#L68
As discussed in this blog,
Valency indicates the number of values required per training example.
In the case of categorical features, single indicates that each
training example must have exactly one category for the feature.
More broadly speaking, it applies to features with multiple values (IMHO not that common for features in Machine Learning), e.g., lists and arrays. In that case, valency refers to the minimum or maximum number of values in these data types. For lists, one can compute the valency by applying np.min()/np.max() on the list lengths from all available feature examples.
After experimenting with both numerical and categorical features, it turns out that there only appears values (e.g., "single") in the "Valency" column when the value in the corresponding "Presence" column is "optional" (tfdv 1.6.0).

Setting up the input on an RNN in Keras

So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original