tensorflow tfrecord storage for large datasets - tensorflow

I'm trying to understand the "proper" method of storage for large datasets for tensorflow ingestion. The documentation seems relatively clear that no matter what, tfrecord files are preferred. Large is a subjective measure, but the examples below are randomly generated regression datasets from sklearn.datasets.make_regression() of 10,000 rows and between 1 and 5,000 features, all float64.
I've experimented with two different methods of writing tfrecord files with dramatically different performance.
For numpy arrays, X, y (X.shape=(10000, n_features), y.shape=(10000,)
tf.train.Example with per-feature tf.train.Features
I construct a tf.train.Example in the way that tensorflow developers seem to prefer, at least judging by tensorflow example code at https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/convert_to_records.py.
For each observation or row in X, I create a dictionary keyed with feature names (f_0, f_1, ...) whose values are tf.train.Feature objects with the feature's observation data as a single element of its float_list.
def _feature_dict_from_row(row):
"""
Take row of length n+1 from 2-D ndarray and convert it to a dictionary of
float features:
{
'f_0': row[0],
'f_1': row[1],
...
'f_n': row[n]
}
"""
def _float64_feature(feature):
return tf.train.Feature(float_list=tf.train.FloatList(value=[feature]))
features = { "f_{:d}".format(i): _float64_feature(value) for i, value in enumerate(row) }
return features
def write_regression_data_to_tfrecord(X, y, filename):
with tf.python_io.TFRecordWriter('{:s}'.format(filename)) as tfwriter:
for row_index in range(X.shape[0]):
features = _feature_dict_from_row(X[row_index])
features['label'] = y[row_index]
example = tf.train.Example(features=tf.train.Features(feature=features))
tfwriter.write(example.SerializeToString())
tf.train.Example with one large tf.train.Feature containing all features
I construct a dictionary with one feature (really two counting the label) whose value is a tf.train.Feature with the entire feature row in as its float_list
def write_regression_data_to_tfrecord(X, y, filename, store_by_rows=True):
with tf.python_io.TFRecordWriter('{:s}'.format(filename)) as tfwriter:
for row_index in range(X.shape[0]):
features = { 'f_0': tf.train.Feature(float_list=tf.train.FloatList(value=X[row_index])) }
features['label'] = y[row_index]
example = tf.train.Example(features=tf.train.Features(feature=features))
tfwriter.write(example.SerializeToString())
As the number of features in the dataset grows, the second option gets considerably faster than the first, as shown in the following graph. Note the log scale
10,000 rows:
It makes intuitive sense to me that creating 5,000 tf.train.Feature objects is significantly slower than creating one object with a float_list of 5,000 elements, but it's not clear that this is the "intended" method for feeding large numbers of features into a tensorflow model.
Is there something inherently wrong with doing this the faster way?

Related

Creating large ndarray from multiple mem-mapped arrays

I have multiple large images stored on binary (fits) file on disc. Each array is of the same shape, and dtype.
I need to read in N of these images, but wish to preserve memory-mapping as they would swamp RAM. The easiest way to do this is, of course, read in as elements of a list. However, ideally I would like to treat this as a numpy array ( of shape [n, ny, nx]) e.g. for easy transpose etc.
Is this possible, without reading these in to RAM?
Note: in practice, what I need is more complicated, equivalent to reading in list-of-list (e.g. an M element list, each element itself an N element list, each a ndarray image), but an answer to the simple case above should hopefully be sufficient.
Thanks for any help.
You can either create a complex abstraction that creates an array-like interface to multiple files, or you can consolidate your data. The former is going to be fairly complex, and probably not worth your time.
Consolidating the data, e.g. in a temporary file, is a much simpler option, which I've implemented here with the assumption that you are using astropy for your FITS I/O. You can tailor it for other libraries or other use-cases as you see fit.
from tempfile import TemporaryFile
from astropy.io import fits
n = 0
with TemporaryFile() as output:
for filename in my_list_of_files:
with fits.open(filename) as hdus:
# If you have a single HDU that you know how to reference, get rid of the loop
for hdu in hdus:
if isinstance(hdu, fits.ImageHDU):
data = hdu.data.T
if n == 0:
shape = data.shape
dtype = data.dtype
elif data.shape != shape or data.dtype != dtype:
continue
data.tofile(output)
n += 1
Now you have a single binary flatfile with all your data in row-major order, and all the metadata you need to use numpy's memmap:
array = np.memmap(output, dtype, shape=(n,) + shape)
Do all your work in the outer with block, since output will be delete on close in this implementation.

PyTorch alternative for tf.data.experimental.sample_from_datasets

Suppose I have two datasets, dataset one with 100 items and dataset two with 5000 items.
Now I want that during training my model sees as much items from dataset one as from dataset two.
In Tensorflow I can do:
dataset = tf.data.experimental.sample_from_datasets(
[dataset_one, dataset_two], weights=[50,1], seed=None
)
Is there an alternative in PyTorch that does the same?
I think this is not too difficult to implement by creating a custom dataset (not working example)
from torch.utils.data import Dataset
class SampleDataset(Dataset):
def __init__(self, datasets, weights):
self.datasets = datasets
self.weights = weights
def __len__(self):
return sum([len(dataset) for dataset in self.datasets])
def __getitem__(self, idx):
# sample a random number and based on that sample an item
return self.datasets[dataset_idx][sample_idx]
However, this seems quite common. Is there already something like this available?
I don't think there is a direct equivalent in PyTorch.
However, there's a function called torch.utils.data.WeightedRandomSampler which samples indices based on a list of probabilities. You can use this in combination with torch.data.utils.ConcatDataset and torch.utils.data.DataLoader's sampler option.
I'll give an example with two datasets: SetA has 500 elements and SetB which only has 10.
First, you can create a concatenation of all your datasets with ConcaDataset:
ds = ConcatDataset([SetA(), SetB()])
Then, we need to sample it. The problem is, you can't just give WeightedRandomSampler [50, 1], as you did in Tensorflow. As a workaround, you can create a list of probabilities of the same length as the size of the total dataset.
The corresponding probability list for this example would be:
dist = np.array([1/51]*500 + [50/51]*10)
Essentially, the first 500 indices (i.e. indices 'pointing' to SetA) will have a probability of 1/51 of being choosen while the following 10 indices (i.e. indices in SetB) will have a probability of 50/51 (i.e much more likely to being sampled since there are less elements in SetB, this is the desired result!)
We can create a sampler from that distribution:
WeightedRandomSampler(dist, 10)
Where 10 is the number of sampled elements. I would put the size of the smallest dataset, otherwise you would likely be going over the same datapoints multiple times during the same epoch...
Finally, we just have to instanciate the dataloader with our dataset and sampler:
dl = DataLoader(ds, sampler=sampler)
To summarize:
ds = ConcatDataset([SetA(), SetB()])
dist = np.array([1/51]*500 + [50/51]*10)
sampler = WeightedRandomSampler(dist, 10)
dl = DataLoader(ds, sampler=sampler)
Edit, for any number of datasets:
sets = [SetA(), SetB(), SetC()]
ds = ConcatDataset(sets)
dist = np.concatenate([[(len(ds) - len(s))/len(ds)]*len(s) for s in sets])
sampler = WeightedRandomSampler(weights=dist, num_samplesmin([len(s) for s in sets])
dl = DataLoader(ds, sampler=sampler)

Numpy- Deep Learning, Training Examples

Silly Question, I am going through the third week of Andrew Ng's newest Deep learning course, and getting stuck at a fairly simple Numpy function ( i think? ).
The exercise is to find How many training examples, m , we have.
Any idea what the Numpy function is to find out about the size of a preloaded training example.
Thanks!
shape_X = X.shape
shape_Y = Y.shape
m = ?
print ('The shape of X is: ' + str(shape_X))
print ('The shape of Y is: ' + str(shape_Y))
print ('I have m = %d training examples!' % (m))
It depends on what kind of storage-approach you use.
Most python-based tools use the [n_samples, n_features] approach where the first dimension is the sample-dimension, the second dimension is the feature-dimension (like in scikit-learn and co.). Alternatively expressed: samples are rows and features are columns.
So:
# feature 1 2 3 4
x = np.array([[1,2,3,4], # first sample
[2,3,4,5], # second sample
[3,4,5,6]
])
is a training-set of 3 samples with 4 features each.
The sizes M,N (again: interpretation might be different for others) you can get with:
M, N = x.shape
because numpy's first dimension are rows, numpy's second dimension are columns like in matrix-algebra.
For the above example, the target-array is of shape (M) = n_samples.
Anytime you want to find the number of training examples or the size of an array, you can use
m = X.size
This will give you the size or the total number of the examples. In this case, it would be 400.
The above method is also correct but not the optimal method to find the size since, in large datasets, the values could be large and while python easily handles large values, it is not advisable to utilize extra unneeded space.
Or a better way of doing the above scenario is
m=X.shape[1]

Why shuffling data gives significantly higher accuracy?

In Tensorflow, I've wrote a big model for 2 image classes problem. My question is concerned with the following code snippet:
X, y, X_val, y_val = prepare_data()
probs = calc_probs(model, session, X)
accuracy = float(np.equal(np.argmax(probs, 1), np.argmax(y, 1)).sum()) / probs.shape[0]
loss = log_loss(y, probs)
X is an np.array of shape: (25000,244,244,3). That code results in accuracy=0.5834 (towards random accuracy) and loss=2.7106. But
when I shuffle the data, by adding these 3 lines after the first line:
sample_idx = random.sample(range(0, X.shape[0]), 25000)
X = X[sample_idx]
y = y[sample_idx]
, the results become convenient: accuracy=0.9933 and loss=0.0208.
Why shuffling data can give significantly higher accuracy ? or what can be a reason for that ?
The function calc_probs is mainly a run call:
probs = session.run(model.probs, feed_dict={model.X: X})
Update:
After hours of debugging, I figured out that evaluating a single image gives different result. For example, if you run the following line of code multiple times, you get a different result each time:
session.run(model.props, feed_dict={model.X: [X[20]])
My data is normally sorted, X contains class 1 samples first then class 2. And in calc_probs function, I run using each batch of the data sequentially. So, without shuffling, each run has data of a single class.
I've also noted that with shuffling, if batch size is very small, I get the random accuracy.
There is some mathematical justification for this in the context of randomized Kaczmarz algorithm. Regular Kaczmarz algorithm is an old algorithm which can be seen as an non-shuffling SGD on a least squares problem, and there are guaranteed faster convergence rates that come out if you use randomization, follow references in http://www.cs.ubc.ca/~nickhar/W15/Lecture21Notes.pdf

Incorporating very large constants in Tensorflow

For example, the comments for the Tensorflow image captioning example model state:
NOTE: This script will consume around 100GB of disk space because each image
in the MSCOCO dataset is replicated ~5 times (once per caption) in the output.
This is done for two reasons:
1. In order to better shuffle the training data.
2. It makes it easier to perform asynchronous preprocessing of each image in
TensorFlow.
The primary goal of this question is to see if there is an alternative to this type of duplication. In my use case, storing the data in this way would require each image to be duplicated in the TFRecord files many more times, on the order of 20 - 50 times.
I should note first that I have already fed the images through VGGnet to extract 4096 dim features, and I have these stored as a mapping between filename and the vectors.
Before switching over to Tensorflow, I had been feeding batches containing filename strings and then looking up the corresponding vector on a per-batch basis. This allows me to store all of the image data in ~15GB without needing to duplicate the data on disk.
My first attempt to do this in in Tensorflow involved storing indices in the TFExample buffers and then doing a "preprocessing" step to slice into the corresponding matrix:
img_feat = pd.read_pickle("img_feats.pkl")
img_matrix = np.stack(img_feat)
preloaded_images = tf.Variable(img_matrix)
first_image = tf.slice(preloaded_images, [0,0], [1,4096])
However, in this case, Tensorflow disallows a variable larger than 2GB. So my next thought was to partition this across several variables:
img_tensors = []
for i in range(NUM_SPLITS):
with tf.Graph().as_default():
img_tensors.append(tf.Variable(img_matrices[i], name="preloaded_images_%i"%i))
first_image = tf.concat(1, [tf.slice(t, [0,0], [1,4096//NUM_SPLITS]) for t in img_tensors])
In this case, I'm forced to store each partition on a separate graph, because it seems any one graph cannot be this large either. However, now the concat fails because each tensor I am concatenating is on a separate graph.
Any advice on incorporating a large amount (~15GB) of preloaded into the Tensorflow graph.
Potentially related is this question; however in this case I'd like to override the decoding of the actual JPEG file with the preprocessed value in a tensor op.