Can I train a CNN architecture based deep-learning model with raw audio-data at a high sampling frequency - tensorflow

Shape of my raw data array is (1299981,2000)
I am getting error that Allocation of 415952000 exceeds 10% of free system memory.
My doubt is that how should I train my model with raw data while keeping input shape feasible (should I decrease the sampling frequency?) for the model.

Related

what if I predict data in training dataset

I'm developing recommender system using NCF in somewhat modified way.
My circumstance is that data for prediction occasionally includes data used in training.
For example, My training set is 100000rows. And by negative sampling, some unobserved datas are added to training set.
And I want to predict all the unobserved one from the trained model. Then some datas from negative sampling is intersection of train data and predict data.
Will this cause any problem?
Should I remove unobserved data from negative sampling in predict data?

Large trainable embedding layer slows down training

I am training a network to classify text with a LSTM. I use a randomly initialized and trainable embedding layer for the word inputs. The network is trained with the Adam Optimizer and the words are fed into the network with a one-hot-encoding.
I noticed that the number of words which are represented in the embedding layer influences heavily the training time, but I don't understand why. Increasing the number of words in the network from 200'000 to 2'000'000 almost doubled the time for a training epoch.
Shouldn't the training only update weights which where used during the prediction of the current data point. Thus if my input sequence has always the same length, there should always happen the same number of updates, regardless of the size of the embedding layer.
The number of updates needed would be reflected in the number of epochs it takes to reach a certain precision.
If your observation is that convergence takes the same number of epochs, but each epoch takes twice as much wall clock time, then it's an indication that simply performing the embedding lookup (and writing the update of embedding table) now takes a significant part of your training time.
Which could easily be the case. 2'000'000 words times 4 bytes per float32 times the length of your embedding vector (what is it? let's assume 200) is something like 1.6 gigabytes of data that needs to be touched every minibatch. You're also not saying how you're training this (CPU, GPU, what GPU) which has a meaningful impact on how this should turn out because of e.g. cache effects, as for CPU doing the exact same number of reads/writes in a slightly less cache-friendly manner (more sparsity) can easily double the execution time.
Also, your premise is a bit unusual. How much labeled data do you have that would have enough examples of the #2000000th rarest word to calculate a meaningful embedding directly? It's probably possible, but would be unusual, in pretty much all datasets, including very large ones, the #2000000th word would be a nonce and thus it'd be harmful to include it in trainable embeddings. The usual scenario would be to calculate large embeddings separately from large unlabeled data and use that as a fixed untrainable layer, and possibly concatenate them with small trainable embeddings from labeled data to capture things like domain-specific terminology.
If I understand correctly, your network takes one-hot vectors representing words to embeddings of some size embedding_size. Then the embeddings are fed as input to an LSTM. The trainable variables of the network are both those of the embedding layer and the LSTM itself.
You are correct regarding the update of the weights in the embedding layer. However, the number of weights in one LSTM cell depends on the size of the embedding. If you look for example at the equation for the forget gate of the t-th cell,
you can see that the matrix of weights W_f is multiplied by the input x_t, meaning that one of the dimensions of W_f must be exactly embedding_size. So as embedding_size grows, so does the network size, so it takes longer to train.

reduce size of pretrained deep learning model for feature generation

I am using an pretrained model in Keras to generate features for a set of images:
model = InceptionV3(weights='imagenet', include_top=False)
train_data = model.predict(data).reshape(data.shape[0],-1)
However, I have a lot of images and the Imagenet model outputs 131072 features (columns) for each image.
With 200k images I would get an array of (200000, 131072) which is too large to fit into memory.
More importantly, I need to save this array to disk and it would take 100 GB of space when saved as .npy or .h5py
I could circumvent the memory problem by feeding only batches of like 1000 images and saving them to disk, but not the disk space problem.
How can I make the model smaller without losing too much information?
update
as the answer suggested I include the next layer in the model as well:
base_model = InceptionV3(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('avg_pool').output)
this reduced the output to (200000, 2048)
update 2:
another interesting solution may be the bcolz package to reduce size of numpy arrays https://github.com/Blosc/bcolz
I see at least two solutions to your problem:
Apply a model = AveragePooling2D((8, 8), strides=(8, 8))(model) where model is an InceptionV3 object you loaded (without top). This is the next step in InceptionV3 architecture - so one may easily assume - that these features still hold loads of discriminatory clues.
Apply a some kind of dimensionality reduction (e.g. like PCA) on a sample of data and reduce the dimensionality of all data to get the reasonable file size.

Vector representation in multidimentional time-series prediction in Tensorflow

I have a large data set (~30 million data-points with 5 features) that I have reduced using K-means down to 200,000 clusters. The data is a time-series with ~150,000 time-steps. The data on which I would like to train the model is the presence of particular clusters at each time-step. The purpose of the predictive model is generate a generalized sequence similar to generating syntactically correct sentences from a model trained on word sequences. The easiest way to think about this data is that I'm trying to predict the pixels in the next video frame from pixels in the current video frame in order to generate a new sequence of frames that approximate the original sequence.
The raw and sparse representation at each time-step would be 200,000 binary values representing which clusters are present or not at that time step. Note, no more than 200 clusters may be present in any one time-step and thus this representation is extremely sparse.
What is the best representation to convert this sparse vector to a dense vector that would be more suitable to time-series prediction using Tensorflow?
I initially had in mind a RNN / LSTM trained on the vectors at each time-step, but due to the size of the training vector I'm now wondering if a convolution approach would be more suitable.
Note, I have not actually used tensorflow beyond some simple tutorials, but have have previously used OpenCV ML functions. Please consider me a novice in your responses.
Thank you.

Regarding the use of tf.train.shuffle_batch() to create batches

In Tensorflow tutorial, it gives the following example regarding tf.train.shuffle_batch():
# Creates batches of 32 images and 32 labels.
image_batch, label_batch = tf.train.shuffle_batch(
[single_image, single_label],
batch_size=32,
num_threads=4,
capacity=50000,
min_after_dequeue=10000)
I am not very clear about the meaning of capacity and min_after_dequeue. In this example, it is set as 50000 and 10000 respectively. What is the logic for this kind of setup, or what does that mean. If input has 200 images and 200 labels, what will happen?
The tf.train.shuffle_batch() function uses a tf.RandomShuffleQueue internally to accumulate batches of batch_size elements, which are sampled uniformly at random from the elements currently in the queue.
Many training algorithms, such as the stochastic gradient descent–based algorithms that TensorFlow uses to optimize neural networks, rely on sampling records uniformly at random from the entire training set. However, it is not always practical to load the entire training set in memory (in order to sample from it), so tf.train.shuffle_batch() offers a compromise: it fills an internal buffer with between min_after_dequeue and capacity elements, and samples uniformly at random from that buffer. For many training processes, this improves the accuracy of the model and provides adequate randomization.
The min_after_dequeue and capacity arguments have an indirect effect on training performance. Setting a large min_after_dequeue value will delay the start of training, because TensorFlow has to process at least that many elements before training can start. The capacity is an upper bound on the amount of memory that the input pipeline will consume: setting this too large may cause the training process to run out of memory (and possibly start swapping, which will impair the training throughput).
If the dataset has only 200 images, it would be easily possible to load the entire dataset in memory. tf.train.shuffle_batch() would be quite inefficient, because it enqueue each image and label multiple times in the tf.RandomShuffleQueue. In this case, you may find it more efficient to do the following instead, using tf.train.slice_input_producer() and tf.train.batch():
random_image, random_label = tf.train.slice_input_producer([all_images, all_labels],
shuffle=True)
image_batch, label_batch = tf.train.batch([random_image, random_label],
batch_size=32)