Tensorflow tf.data.Dataset takes too much time to generate dataset. Better way to optimize it? - tensorflow

I have .stem.mp4 files each of which is composed of multiple audio sources.
Each length of file is 2 minutes to 6 minutes. It varies a lot.
When I try to make tf.data.Dataset out of it, it seems to take a lot of time to generate a input_batch much more than my model makes a prediction of a given batch.
Let me illustarte an example.
import tensorflow as tf
import tensorflow.keras as keras
sample_data = tf.random.normal((5, 755200, 2)) # 5 sources of audio, stereo channel
# First axis is the mixture of the audio, so this is the input
# Rest 4 axes are the each source of the audio(eg. bass, drum, vocals, etc) so these are the output
input_mixture = sample_data[0, :, :]
target_mixtures = sample_data[1:, :, :]
target_mixtures = np.column_stack(target_mixtures)
length = 44100 * 11 # I want to split these into length of 11 seconds
strides = 44100 # 1 second stride
ds_inp = tf.data.Dataset.from_tensor_slices((input_mixture))
ds_inp = ds_inp.window(length, shift=strides, drop_remainder=True)
ds_inp = ds_inp.flat_map(lambda windows: windows.batch(length))
ds_inp = ds_inp.map(lambda windows: windows, num_parallel_calls=tf.data.AUTOTUNE)
ds_tar = tf.data.Dataset.from_tensor_slices((target_mixtures))
ds_tar = ds_tar.window(length, shift=strides, drop_remainder=True)
ds_tar = ds_tar.flat_map(lambda windows: windows.batch(length))
ds_tar = ds_tar.map(lambda windows: windows, num_parallel_calls=tf.data.AUTOTUNE)
ds_total = [ds_inp, ds_tar]
total_ds = tf.data.Dataset.zip(tuple(ds_total))
total_ds = total_ds.batch(BATCH_SIZE)
total_ds = total_ds.prefetch(tf.data.AUTOTUNE)
This is how I made a tf.data.Dataset from the given file.
And when I measure the time how fast does this make a input_batch and output_batch,
%%time
for i, j in total_ds.take(1):
pass
# Wall time: 18.3 s
My model has about 100 million variables, but since it fairly has a simple structure so that it takes about 6 seconds to generate a predicted_batch out of given input_batch.
So my problem is, is there any way to make it to generate input_batch, output_batch faster?
(My assumption is that, as this 'window' the given arrays, there is no better way to improve this.)
Obviously all of the files are big enough not to be cached.

Related

lightgbm memory issue on wide dataset (400 columns)

I am new to lightgbm. I have big data (billions of rows constantly updated). The dataset prepared for training is also wide with around 400 columns.
I have 2 questions:
First, my kernel keeps dying after some thousands epochs even for such a small subset as 10 000 rows. Memory use keeps rising while training untill it fails. I have 126 gigabytes of memory.
I have tried training with different parameters, commented are the one that are tried as well
parameters = {
'histogram_pool_size': 5000,
'objective': 'regression',
'metric': 'l2',
'boosting': 'dart',#'gbdt
'num_leaves': 10, #100
'learning_rate': 0.01,
'verbose': 0,
'max_bin': 66,.
'force_col_wise':True, #default
'max_bin': 6, #60 #default
'max_depth': 10, #default
'min_data_in_leaf': 30, #default
'min_child_samples': 20,#default
'feature_fraction': 0.5,#default
'bagging_fraction': 0.8,#default
'bagging_freq': 40,#default
'bagging_seed': 11,#default
'lambda_l1': 2 #default
'lambda_l2': 0.1 #default }
Limiting number of columns seems to help, but I know that some columns that have low score with global feature importance would have significant importance in some local scope.
Second, what is the right way of training lightgbm with big data incrementally and updating lightgbm model with new data? I previously worked mainly with neural nets, which are trained incrementally by nature and I know that trees do not works this way and though it's technically possible to update the model it will not be the same as the model that is trained in a holistic way. How to deal with it?
full code:
# X is dataframe
cat_names = X.select_dtypes(['bool','category',object]).columns.tolist()
for c in cat_names: X[c] = X[c].astype('category')
cat_cols = [c for c, col in enumerate(cat_names)]
X[cat_names] = X[cat_names].apply(lambda x: x.cat.codes)
x = X.values
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.2, random_state=42)
train_ds = lightgbm.Dataset(x_train, label=y_train)
valid_ds = lightgbm.Dataset(x_valid, label=y_valid)
model = lightgbm.train(parameters,
train_ds,
valid_sets=valid_ds,
categorical_feature = cat_cols,
num_boost_round=2000,
early_stopping_rounds=50)
Changing data types to less verbose fixed the memory problem! If your dataset is pandas dataframe do something like this:
ds[ds.select_dtypes('float64').columns] = ds.select_dtypes('float64').astype('float32')
ds[ds.select_dtypes('int64').columns] = ds.select_dtypes('int64').astype('int32')
!!! caution Your data ranges may be out of the selected datatype range and pandas will mess up your data in that case. For example int8 dtype is ranges only within -128 to 127, so select the ones that are capable to handle your data.
You may check selected dtype range with
import numpy as np
np.iinfo('int32').min, np.iinfo('int32').max

Speed up generation of USE(universal sentence encoder) embeddings

I am working on a semantic similarity problem using universal sentence encoder. The dataset contains abstracts of scholarly articles. The mean length is around 1500. There are ~300k records in data and it will take quite long to generate USE embedding for all of them. I am looking for ways to optimize this. Currently, generating embedding for 10k rows of data took ~15 mins.
from tqdm import tqdm
use_module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(use_module_url)
print ("module %s loaded" % use_module_url)
def embed(input):
return model(input)
def get_features(texts):
if type(texts) is str:
texts = [texts]
return embed(texts)
def data_iterator(data):
chunk_list = []
for x in tqdm(range(0, len(data), 1000)):
if x+1000 > len(data):
chunk_list.append(data[x:len(data)])
else:
chunk_list.append(data[x:x+1000])
return chunk_list
data = df['text'][:10000].values
data_processed = list(map(process_text, data))
Here, I want to speed up the generation of USE embeddings for my data. I am experimenting in kaggle kernel and have turned on the GPU. The GPU utilization doesn`t go beyond 2-3% & CPU utilization was ~120%
%%time
BASE_VECTORS = []
chunk_list = data_iterator(data_processed)
for i in tqdm(chunk_list):
BASE_VECTORS_tmp = get_features(i)
BASE_VECTORS.extend(BASE_VECTORS_tmp)
BASE_VECTORS = np.asarray(BASE_VECTORS)
Time taken
CPU times: user 16min 48s, sys: 2min 59s, total: 19min 47s
Wall time: 15min 13s
Probably you had not installed GPU version of tensorflow, or had some CUDNN version mismatch. Normally USE uses GPU a lot.

Memory leak when running universal-sentence-encoder-large itterating on dataframe

I have 140K sentences I want to get embeddings for. I am using TF_HUB Universal Sentence Encoder and am iterating over the sentences(I know it's not the best way but when I try to feed over 500 sentences into the model it crashes).
My Environment is:
Ubuntu 18.04
Python 3.7.4
TF 1.14
Ram: 16gb
processor: i-5
my code is:
version 1
I iterate inside the tf.session context manager
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = None
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
version 2
I open and close a session within each iteration
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = None
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
While version 2 does seem to have some sort of GC and memory is cleared a bit. It still goes over 50 items and explodes.
version 1 just goes on gobbling memory.
The correct solution as given by arnoegw
def calculate_embeddings(dataframe, table_name):
sql_get_sentences = "SELECT * FROM semantic_similarity.sentences WHERE embedding IS NULL LIMIT 1500"
sql_update = 'UPDATE {} SET embedding = data.embedding FROM (VALUES %s) AS data(id, embedding) WHERE {}.id = data.id'.format(table_name, table_name)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
with hub.eval_function_for_module("https://tfhub.dev/google/universal-sentence-encoder-large/3") as embed:
while len(df) >= 0:
sentence_array = df['content'].values
sentence_embeddings = embed(sentence_array)
df['embedding'] = sentence_embeddings.tolist()
values = [tuple(x) for x in df[['id', 'embedding']].values]
pandas_repository.update_db_from_df('semantic_similarity.sentences', sql_update, values)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
I am a newbee to TF and can use any help I can get.
Your code uses tf.Session, so it falls under the TF1.x programming model of first building a dataflow graph and then running it repeatedly with inputs being fed and outputs being fetched from the graph.
But your code does not align well with that programming model. Both versions keep adding new applications of (calls to) the hub.Module to the default TensorFlow graph instead of applying it once and running the same graph repeatedly for the various inputs. Version 2 keeps going into and out of tf.Sessions, which frees some memory but is very inefficient.
Please see my answer to "Strongly increasing memory consumption when using ELMo from Tensorflow-Hub" for guidance how to do it right in the graph-based programming model of TensorFlow 1.x.
TensorFlow 2.0, which is going to be released soon, defaults to the programming model of "eager execution", which does away with graphs and sessions and would have avoided this confusion. TensorFlow Hub will be updated in due course for TF2.0. For a preview close to your use-case, see https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_text_classification.ipynb

tf.decode_csv lasts too long

I am using TensorFlow v0.8, and it's strange that it takes around 5 minutes to print the second print time.time(). I thought tf.decode_csv() would just simply add an operation into the graph without doing any computation.
Why does it take so long to call tf.decode_csv()?
def main(argv=None):
# deal with arguments
with tf.device("/cpu:0"):
filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once(train_set_filename + "*"))
reader = tf.TextLineReader()
_, line = reader.read(filename_queue)
default = [[-1.0] for x in range(image_size * image_size * channels + 1)]
print time.time()
line = tf.decode_csv(line, record_defaults=default)
print time.time()
label = line[0]
feature = tf.pack(list(line[1:]))
...
The tf.decode_csv(line, record_defaults=default) takes a lot of time because you use so many columns.
I don't know your image_size, but if it is around 200 you are trying to set 120,001 columns to your csv, which is huge. You are right, TensorFlow is not doing any computation, but it has to build the graph properly and with that much columns it takes a lot of time !
I strongly advise you to not use csv format for images. Instead you should store your images in JPEG format, and use tf.image.decode_jpeg().

TensorFlow: How to apply the same image distortion to multiple images

Starting from the Tensorflow CNN example, I'm trying to modify the model to have multiple images as an input (so that the input has not just 3 input channels, but multiples of 3 by stacking images).
To augment the input, I try to use random image operations, such as flipping, contrast and brightness provided in TensorFlow.
My current solution to apply the same random distortion to all input images is to use a fixed seed value for these operations:
def distort_image(image):
flipped_image = tf.image.random_flip_left_right(image, seed=42)
contrast_image = tf.image.random_contrast(flipped_image, lower=0.2, upper=1.8, seed=43)
brightness_image = tf.image.random_brightness(contrast_image, max_delta=0.2, seed=44)
return brightness_image
This method is called multiple times for each image at graph construction time, so I thought for each image it will use the same random number sequence and consequently, it will result in have the same applied image operations for my image input sequence.
# ...
# distort images
distorted_prediction = distort_image(seq_record.prediction)
distorted_input = []
for i in xrange(INPUT_SEQ_LENGTH):
distorted_input.append(distort_image(seq_record.input[i,:,:,:]))
stacked_distorted_input = tf.concat(2, distorted_input)
# Ensure that the random shuffling has good mixing properties.
min_queue_examples = int(num_examples_per_epoch *
MIN_FRACTION_EXAMPLES_IN_QUEUE)
# Generate a batch of sequences and prediction by building up a queue of examples.
return generate_sequence_batch(stacked_distorted_input, distorted_prediction, min_queue_examples,
batch_size, shuffle=True)
In theory, this works fine. And after doing some test runs, this really seemed to solve my problem. But after a while, I found out that I'm having a race-condition, because I use the input pipeline of the CNN-example code with multiple threads (which is the suggested method in TensorFlow to improve performance and reduce memory consumption at runtime):
def generate_sequence_batch(sequence_in, prediction, min_queue_examples,
batch_size):
num_preprocess_threads = 8 # <-- !!!
sequence_batch, prediction_batch = tf.train.shuffle_batch(
[sequence_in, prediction],
batch_size=batch_size,
num_threads=num_preprocess_threads,
capacity=min_queue_examples + 3 * batch_size,
min_after_dequeue=min_queue_examples)
return sequence_batch, prediction_batch
Because multiple threads create my examples, it is not guaranteed anymore that all image operations are performed in the right order (in sense of the right order of random operations).
Here I came to a point where I got completely stuck. Does anyone know how to solve this problem to apply the same image distortion to multiple images?
Some thoughts of mine:
I thought about to do some synchronizations arround these image distortion methods, but I could find anything provided by TensorFlow
I tried to generate to generate a random number for e.g. the random brightness delta using tf.random_uniform() by myself and use this value for tf.image.adjust_contrast(). But the result of the TensorFlow random generator is always a tensor, and I have not found a way to use this tensor as a parameter for tf.image.adjust_contrast() which expects a simple float32 for its contrast_factor parameter.
A solution that would (partly) work would be to combine all images to a huge image using tf.concat(), apply random operations to change contrast and brightness, and split the image afterwards. But this would not work for random flipping, because this would (at least in my case) change the order of the images, and there is no way to detect whether tf.image.random_flip_left_right() has performed a flip or not, which would be required to fix the wrong order of images if necessary.
Here is what I came up with by looking at the code of random_flip_up_down and random_flip_left_right within tensorflow :
def image_distortions(image, distortions):
distort_left_right_random = distortions[0]
mirror = tf.less(tf.pack([1.0, distort_left_right_random, 1.0]), 0.5)
image = tf.reverse(image, mirror)
distort_up_down_random = distortions[1]
mirror = tf.less(tf.pack([distort_up_down_random, 1.0, 1.0]), 0.5)
image = tf.reverse(image, mirror)
return image
distortions = tf.random_uniform([2], 0, 1.0, dtype=tf.float32)
image = image_distortions(image, distortions)
label = image_distortions(label, distortions)
I would do something like this using tf.case. It allows you to specify what to return if certain condition holds https://www.tensorflow.org/api_docs/python/tf/case
import tensorflow as tf
def distort(image, x):
# flip vertically, horizontally, both, or do nothing
image = tf.case({
tf.equal(x,0): lambda: tf.reverse(image,[0]),
tf.equal(x,1): lambda: tf.reverse(image,[1]),
tf.equal(x,2): lambda: tf.reverse(image,[0,1]),
}, default=lambda: image, exclusive=True)
return image
def random_distortion(image):
x = tf.random_uniform([1], 0, 4, dtype=tf.int32)
return distort(image, x[0])
To check if it works.
import numpy as np
import matplotlib.pyplot as plt
# create image
image = np.zeros((25,25))
image[:10,5:10] = 1.
# create subplots
fig, axes = plt.subplots(2,2)
for i in axes.flatten(): i.axis('off')
with tf.Session() as sess:
for i in range(4):
distorted_img = sess.run(distort(image, i))
axes[i % 2][i // 2].imshow(distorted_img, cmap='gray')
plt.show()