How to get to the original index from a minibatch? - cntk

Assume I have a minibatch as a result of this code:
test_minibatch = reader_test.next_minibatch(10)
How can I get to the indexes of this minibatch as reference into the original data? Assume my test dataset was 100 rows. How can I know which 10 rows out of the 100 original rows are in the minibatch?

Can you create a column with unique Id's (usually called a GUID / UUID) and read that in the reader. This is one way to map your sample to master set. It scales well with very large datasets spanning multiple disks and distributed computing frameworks.

Related

How to handle skewed categorical data for multiclass-classification task?

I want to know how to handle the skewed data which contains a particular column that has multiple categorical values. Some of these values have more value_counts() than others.
As you can see in this data the values greater than 7 have value counts lot less than others. How to handle this kind of skewed data? (This is not the target variable. I want to know about skewed independent variable)
I tried changing ' these smaller count values to a particular value (-1). That way I got count of -1 comparable to other values. But training classification model on this data will affect the accuracy.
Oversampling techniques for minority classes/categories may not work well in many scenarios. You could read more about them here.
One thing you could do is to assign different weights to samples from different classes in your model's loss function, inversely proportional to their frequencies. This would ensure that even classes with few datapoints will equally affect the model's loss, as compared to classes with large number of datapoints.
You could share more details about the dataset or the specific model that you are using, to get more specific suggestions/solutions.

test and train good practice wrt summary feature

When one feature of a dataset is a summary statistic of the entire pool of data, is it good practice to include the train data in your test data in order to calculate the feature for validation?
For instance, let's say I have 1000 data points split into 800 entries of training and 200 entries for validation. I create a feature with the 800 entries for training of say rank quartile (or could be anything), which numbers 0-3 the quartile some other feature falls in. So in the training set, there will be 200 data points in each quartile.
Once you train the model and need to calculate the feature again for the validation set, a) do you use the already set quartiles barriers, ie the 200 validation entries could have a different than 50-50-50-50 quartile split, or b) do you recalculate the quartiles using all 1000 entries so there is a new feature of quartile rank, each of 250 entries each?
Thanks very much
The ideal practice would be to calculate the quartiles on the training dataset, and using those barriers on your holdout / validation dataset. To ensure that you correctly generate model diagnostics to evaluate its predictive performance, you do not want the distribution of the testing dataset to influence your model training. This is because that data will not be available in real life when you apply the model on unseen data.
I also thought that you will find this article extremely useful when thinking about train-test splitting - https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50

Limiting the number of items in a tf.data.Dataset

tl;dr; Can I limit the number of elements in a tf.data.Dataset?
A have a training and evaluation loop which processes the entire given dataset. This is not ideal for testing since it takes forever to go through the whole dataset. I can test this code by creating a Mock dataset or by limiting the number of elements of the dataset so the code only goes through, let's say, the first 10 datapoints. How can I do the second one?
Thanks
The simplest way to take only a fixed number of elements n from a Dataset is to use Dataset.take(n). For example:
large_dataset = ...
small_dataset = large_dataset.take(10)

Why embedding_lookup_sparse and string_to_hash_bucket in tensorflow slow with large number of rows of embeddings

In tensorflow embedding_lookup_sparse lookup the row of embeddings according the sp_ids. I think it's similar to random access. However when the shape of embeddings is large, i.e 10M rows, the inference spent more time than when the embeddings only has about 1M rows. As I think, the lookup phase and is similar to random access and the hash function spent constant time which is all fast and less sensitive with the size. Is there any wrong with my thought? Is there any way to optimize so that the inference can be faster? Thank you!
Are you sure it is caused by the embedding_lookup? In my case I also have millions of rows to lookup. It is very fast if I use GradientDecend optimizer. It is very slow if I use Adam or the others. Probably it is not the embedding_lookup opr slows down your training but other oprs that depend on the total number of params.
It is true that "embedding_lookup" works slowly when there are many rows in table.
And you may figure out why by reading its source code. Here is the source code in "embedding_lookup":
image of the source code: variable "np" is the length of table
image of the source code: loop with np
As you see there is a loop with a time complexity of O(table length) appearing here. In fact "embedding_lookup" use dynamic partition to separate input data into several partition of ids, and then use this loop to embed words vectors to each id's partition. In my opinion, this trick can fix the time complexity to O(table length) no matter how big the input data is.
So I think the best way for you to increase training speed is to input more samples in each batch.

Are there any guidelines on sharding a data set?

Are there any guidelines on choosing the number of shard files for a data set, or the number of records in each shard?
In the examples of using tensorflow.contrib.slim,
there are roughly 1024 records in each shard of ImageNet data set.(tensorflow/models/inception)
there are roughly 600 records in each shard of flowers data set. (tensorflow/models/slim)
Do the number of shard files and the number of records in each shard has any impact on the training and the performance of the trained model?
To my knowledge, if we don't split the data set into multiple shards, it will be not quite random for shuffling data as the capacity of the RandomShuffleQueue may be less than the size of the data set.
Are there any other advantages of using multiple shards?
Update
The documentation says
If you have more reading threads than input files, to avoid the risk that you will have two threads reading the same example from the same file near each other.
Why can't we use 50 threads to read from 5 files?
The newer(2.5) version of Tensorflow has shard feature for dataset.
Find the below sample code from tensorflow documentation
A = tf.data.Dataset.range(10)
B = A.shard(num_shards=3, index=0)
list(B.as_numpy_iterator())
When reading a single input file, you can shard elements as follows
d = tf.data.TFRecordDataset(input_file)
d = d.shard(num_workers, worker_index)