how to load large datasets of numpy arrays in order to train a CNN model in tensorflow2.1.0 - numpy

I'm training a convolutional neural network (CNN) model for a binary classification task in tensorflow2.1.0.
The feature of each instance is a 4-dimensional numpy array with shape of (50, 50, 50, 2), in which the type of each element is float32.
The label of each instance is 1 or 0
My largest training dataset can contain up to ~100 millions of instances.
To efficiently train the model, is it best to serialize my training data and store it in a set of files with TFrecord format, and then load them with tf.data.TFRecordDataset() and parse them with tf.data.map()?
If so, could you show me an example of how to serialize the pairs of feature-label and store them into TFrecord files, then how to load and parse them?
I did not find appropriate example in the website of Tensorflow.
Or is there any better way to store and load the huge datasets? Thanks very much.

There are many ways to efficiently build data pipeline without TFRecord click thislink it was very useful
To extract images from directory efficiently then click this link.
Hope this helped you.

Related

Tensorflow estimator exporter when training data is from tfrecords and inference is from raw data

Background
0) I'm working on an NLP model that I would like to export
1) I have training data in the form of tfRecords
2) I would like to export my model and host it on a flask app, so the data that comes in is raw text
3) I handle all my pre-processing (tokenization and such) as part of my tensorflow graph
Question
1) Given the fact that I do the data loading (tf.Dataset creation, and pre-processing) as part of tensorflow graph, would the raw text that comes in break the process? (specifically in the tf.Dataset creation step)
2) Would it make more sense to just load in raw text instead of tf.Dataset data?
NVM, I completely forgot that you want to have an input_function which you feed datasets through, and a serving_input_function which accepts raw data

How to use TensorFlow to predict large csv files by chunks and glue results together

Now that I've trained a predicting model with TensorFlow, and there's a large test.csv file that's too big to fit into memory, can it be possible to feed it by a smaller chunk at a time and then concat them again within one session?
Using tf.estimator.Estimator for your model and calling the predict method using the numpy_input_fn will give you all the pieces to build what you want.

Neural Network with my own dataset

I have downloaded many face images from web. In order to learn Tensorflow I want to feed those images to a simple fully-connected neural network with a single hidden layer. I have found an example code in here.
Since I am a beginner, I don't know how to train, evaluate, and test the network with the downloaded images. The code owner used a '.mat' file and a .pkl file. I don't understand how he organized training and test set.
In order to run the code with my images;
Do I need to divide my images into training, test, and validation folders and turn each folder into a mat file? How am I going to provide labels for the training?
Besides, I don't understand why he used a '.pkl' file?
All in all, I would like to change this code so that I can find test, training , and validation set classification performance with my image dataset.
It might be an easy question, but it is important for me as it is a starting step. Thanks for your understanding.
First, you don't have to use .mat files nor pickles. Tensorflow expects numpy array.
For instance, let's say you have 70000 images of size 28x28 (=784 dimensions) belonging to 10 classes. Let's also assume that you'd like to train a simple feedforward neural network to classify the images.
The first step would be to split the images between train and test (and validation, but let's put this aside for the sake of simplicity). For the sake of the example, let's imagine that you chose randomly 60000 images for your training set and 10000 for your test set.
The second step would be to ensure that your data has the right format. Here, you'd like your training set to consist in one numpy array of shape (60000, 784) for the images and another one of shape (60000, 10) for the labels (if you use one-hot encoding to represent your classes). As for your test set, you should have an array of shape (10000, 784) for the images and one of shape (10000, 10) for the labels.
Once you have these big numpy arrays, you should define placeholders that will allow you to feed data to you network during training and evaluation.
images = tf.placeholder(tf.float32, shape=[None, 784])
labels = tf.placeholder(tf.int64, shape=[None, 10])
The None here means that you can feed a batch of any size, i.e. as many images as you want, as long as you numpy array is of shape (anything, 784).
The third step consists in defining your model as well as the loss function and the optimizer.
The fourth step consists in training your network by feeding it with random batches of data using the placeholders created above. As your network is training, you can periodically print its performance like the training loss/accuracy as well as the test loss/accuracy.
You can find a complete and very simple example here.

reduce size of pretrained deep learning model for feature generation

I am using an pretrained model in Keras to generate features for a set of images:
model = InceptionV3(weights='imagenet', include_top=False)
train_data = model.predict(data).reshape(data.shape[0],-1)
However, I have a lot of images and the Imagenet model outputs 131072 features (columns) for each image.
With 200k images I would get an array of (200000, 131072) which is too large to fit into memory.
More importantly, I need to save this array to disk and it would take 100 GB of space when saved as .npy or .h5py
I could circumvent the memory problem by feeding only batches of like 1000 images and saving them to disk, but not the disk space problem.
How can I make the model smaller without losing too much information?
update
as the answer suggested I include the next layer in the model as well:
base_model = InceptionV3(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('avg_pool').output)
this reduced the output to (200000, 2048)
update 2:
another interesting solution may be the bcolz package to reduce size of numpy arrays https://github.com/Blosc/bcolz
I see at least two solutions to your problem:
Apply a model = AveragePooling2D((8, 8), strides=(8, 8))(model) where model is an InceptionV3 object you loaded (without top). This is the next step in InceptionV3 architecture - so one may easily assume - that these features still hold loads of discriminatory clues.
Apply a some kind of dimensionality reduction (e.g. like PCA) on a sample of data and reduce the dimensionality of all data to get the reasonable file size.

TensorFlow: Convolution Neural Network with non-image input

I am interested in using Tensorflow for training my data for binary classification based on CNN.
Now I wonder about how to set the filter value, number of output nodes in the convolution process.
I have read many tutorials and example. However, most of them use image data and I cannot compare it with my data that is customer data, not pixel.
So could you suggest me about this issue?
If you data varies in time or space then you can use CNN,I am currently working with EEG data set which varies in time.Also you can refer to this paper
http://www.nlpr.ia.ac.cn/english/irds/People/lwang/M-MCG_EN/Publications/2015/YD2015ACPR.pdf
were the input data(Which is not an image) is presented as an image to the CNN.
You have to reshape the data to be 4d. In this example, I have only 4 column.
x_train = np.reshape(x_train, (x_train.shape[0],2, 2,1))
x_test = np.reshape(x_test, (x_test.shape[0],2,2, 1))
This is a good example to use none image data
https://github.com/fengjiqiang/LSTM-Wind-Speed-Forecasting
You just need to change the following :
prediction_cols
feature_cols
features
and dataload
This tutorial for text :
Here !
You might use one of following classes:
class Dataset: Represents a potentially large set of elements.
class FixedLengthRecordDataset: A Dataset of fixed-length records
from one or more binary files.
class Iterator: Represents the state of iterating through a Dataset.
class TFRecordDataset: A Dataset comprising records from one or more
TFRecord files.
class TextLineDataset: A Dataset comprising lines from one or more
text files.
Tutorial
official documentation