How to train large dataset on tensorflow 2.x

How to train large dataset on tensorflow 2.x - tensorflow

I have a large dataset with about 2M rows and 6,000 columns. The input numpy array (X, y) can hold the training data okay. But when it goes to model.fit(), I get a GPU Out-Of-Memory error. I am using tensorflow 2.2. According to its manual, model.fit_generator has been deprecated and model.fit is preferred.
Can someone outline the steps for training large datasets with tensorflow v2.2?

The best solution is to use tf.data.Dataset() and thus you can easily batch your data with the .batch() method.
There are plenty of tutorials available here, you may want to use from_tensor_slices() for playing directly with numpy arrays.
Below there are two excellent documentations to suit your needs.
https://www.tensorflow.org/tutorials/load_data/numpy
https://www.tensorflow.org/guide/data

Related

how can i use tqdm to visualize the progress of training steps using tf.data.Dataset api?

I want to use tqdm to visualize my cnn network training steps.
How can I implement tqdm with tf.data.Dataset() api?
Can u show me a sample code? thx!

NOTE: although this is an obviously flagged question with mal practise, I still think it is a valid question (I had myself) and post a solution.
A possibility is to obtain the cardinality of your dataset previously and use it as #mr.melon states in tqdm.
cardinality = np.sum([1 for i in dataset.batch(batch_size)])
where dataset is of class tf.data.Dataset and you haven't done the full preparation pipeline (I refere to interleaving, shuffling, batch and prefect and their kind).
Then you can
for input, label in tqdm(dataset, total=cardinality):
...

It's pretty easy:
Derive the number of samples in your dataset,
Then, convert the number into some iterable python structure.
for _ in tqdm(iterable=xxx, total=num_samples):
batch_data = sess.run(ele_derived_from_tf_dataset)

Using Tensorflow Datasets and Estimators with More Data than Ram

I've recently switched my modeling framework to use custom Tensorflow Estimators and Datasets, and am quite happy overall with this workflow.
However, I've just noticed an issue with how my dataset_input_fn loads data form tfrecords. My input function is modeled after the example in the Tensorflow documentation. The issue arises when I have more examples than I can fit into RAM. If I have 1e6 examples, and set my shuffle buffer_size to 1e5, a subset of 1e5 examples is selected once, shuffled, and then iterated on. Meaning my model is only trained on 10% of my overall dataset. My code that sets up this behavior is borrowed exactly from the Tensorflow documentation example code:
dataset = dataset.map(parser)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
My question: is it possible to fill the shuffle buffer with new examples outside of the initial 1e5 as I train? Is this type of functionality supported with a one_shot_iterator? Do I need to use an initializable iterator?
Thanks!

I have found what appears to be a tenable workaround for now. Through some experimentation, I learned that when instantiating a TFRecordDataset,
filenames = ["file1.tfrecord", ..., "filen.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
and setting up a shuffle buffer:
dataset = dataset.shuffle(buffer_size=10000)
the buffer is only populated with the first 10000 examples from however many tf records that requires. For example, in my case, I have ~300 tfrecord files containing 4096 examples each. On examination, my shuffle buffer appears to consists only of examples from the first 3 tf records in my filenames list. Since my filenames list is static, this means that my model is only trained of my first 3 tfrecords!
My fix for now is pretty simple. In my training loop I already alternate between Estimator.train and Estimator.evaluate, and I noticed that each time I call Estimator.train, the shuffle buffer is repopulated. My solution then is to shuffle my filenames each time my input_fn is called. This is not a very elegant solution, but does achieve the desired effect of allowing my to iterate across all tfrecords.
#My Crappy Fix: shuffle file names in input_fn
np.random.shuffle(filenames)
dataset = tf.data.TFRecordDataset(filenames)
What's annoying about this solution (aside from its kludginess) is that my minibatches are not "globally random". Rather, they are selected form a small subset of tf records, and only that subset is used for each training/evaluation cycle. One way to mitigate this is to increase my shuffle buffer size or decrease my tfrecord size, I'll probably do both of these. Finally, I think it's worth noting that if
shuffle_buffer_size < (tf_record_size + minibatch_size)
then, as far as I can tell, my TFRecordDataset will pull from a single tfrecord file!
Finally, I don't think the relevant tensorflow documentation conveys these complexities well. The documentation alludes to the ability to train on large datasets that don't fit into memory, but doesn't provide much detail. It seems unlikely that the tf authors had in mind my hacky strategy when writing this, so I remain curious to see if there's a better approach.

Data augmentation in Tensorflow using Estimator API and TFRecords dataset

I'm using Tensorflow's 1.3 Estimator API to perform some image classification. Since I have a considerable amount of data, I gave the TFRecords a go. Saved the file and can read the examples to a Dataset using a parser function inside the input_fn of the estimator model. So far so good.
The issue is when I want to do some image augmentation (rotating and shearing in this case).
1) I tried using the tf.contrib.keras.preprocessing.image.random_shearand the likes. Turns out Keras doesn't like the format of TF's shape ('Dimension') and I can't cast it to a list because its arguments are the axis indexes not the actual value.
2) Then I tried using the tf.contrib.image.rotate and tf.contrib.image.transform with random values in my chosen range. This time I get an error of NotFoundError: Op type not registered 'ImageProjectiveTransform' in binary running on MYPC. Make sure the Op and Kernel are registered in the binary running in this process. which is an open issue (https://github.com/tensorflow/tensorflow/issues/9672). At the moment I can't move from Windows, so I would very interested in possible alternatives.
3) Searched for a way to read TFRecords and transform it to numpy array and do the augmentation with other tools, but can't find a way from within the input_fn from where I can't access the session.
Thanks!

Have you tried using function from the answer to the question below?tensorflow: how to rotate an image for data augmentation?

tf.contrib.learn.LinearRegressor builds unexpectedly bad model for a data with one feature

I am building a simple linear regressor for the data from the csv. Data includes weight and height values of some people. Overall learning process is very simple:
MAX_STEPS = 2000
# ...
features = [tf.contrib.layers.real_valued_column(feature_name) for feature_name in FEATURES_COL]
# ...
linear_regressor = tf.contrib.learn.LinearRegressor(feature_columns=features)
linear_regressor.fit(input_fn=prepare_input, max_steps=MAX_STEPS)
However, the model that is built by the regressor is, unexpectedly, bad. Result could be illustrated with the next picture:
Visualization code(just in case):
plt.plot(height_and_weight_df_filtered[WEIGHT_COL],
linear_regressor.predict(input_fn=prepare_full_input),
color='blue',
linewidth=3)
Here is the same data been given to the LinearRegression class from the scikit-learn:
lr_updated = linear_model.LinearRegression()
lr_updated.fit(weight_filtered_reshaped, height_filtered)
And the visualization:
Increasing amount of steps has no effect. I would assume I'm using regressor from the TensorFlow in a wrong way.
iPython notebook with the code.

It looks like your TF model does indeed work and will get there with enough steps. You need to jack it right up though - 200K showed significant improvement, almost as good as the sklearn default.
I think there are two issues:
sklearn looks like it simply solves the equation using ordinary least squares. TF's LinearRegressor uses the FtrlOptimizer. The paper indicates it is a better choice for very large datasets.
The input_fn to the model is injecting the whole training set at once, for every step. This is just a hunch, but I suspect that the FtrlOptimizer may do better if it sees batches at a time.
Instead of just changing the number of steps up a couple orders of magnitude, you can also jack the learning rate up on the optimizer (the default is 0.2) and get similarly good results from only 4k steps:
linear_regressor = tf.contrib.learn.LinearRegressor(
feature_columns=features,
optimizer=tf.train.FtrlOptimizer(learning_rate=5.0))

I met a similar problem. The solution is to check if your input_fn has enough epoch. The training maybe not converge before iterating over the whole training data several times.

How should I structure my labels for TensorFlow?

I'm trying to use TensorFlow to train output servo commands given an input image.
I plan on using a file as #mrry suggested in this question, with the images like so:
../some/path/some_img.JPG *some_label*
My question is, what are the label formats I can provide to TensorFlow and what structures are suggested?
My data is basically n servo commands from 0-10 seconds. A vector would work great:
[0,2,4,3]
or similarly:
[0,.25,.4,.3]
I couldn't find much about labels in the docs. Can anyone shed any light on TensorFlow labels?
And a very related question is what is the best way to structure these for TensorFlow to properly learn from them?

In Tensorflow Labels are just generic tensor. You can use any kind of tensor to store your labels. In your case a 1-D tensor with shape (4,) seems to be desired.
Labels do only differ from the rest of the data by its use in the computational graph. (Usually) labels should only be used inside the loss function while you propagate the other data through the whole network. For your problem a 4-d regression function should work.
Also, look at my newest comment to the (old) question. Using the slice_input_producer seems to be preferable in your case.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas