Speed up generation of USE(universal sentence encoder) embeddings - tensorflow

I am working on a semantic similarity problem using universal sentence encoder. The dataset contains abstracts of scholarly articles. The mean length is around 1500. There are ~300k records in data and it will take quite long to generate USE embedding for all of them. I am looking for ways to optimize this. Currently, generating embedding for 10k rows of data took ~15 mins.
from tqdm import tqdm
use_module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(use_module_url)
print ("module %s loaded" % use_module_url)
def embed(input):
return model(input)
def get_features(texts):
if type(texts) is str:
texts = [texts]
return embed(texts)
def data_iterator(data):
chunk_list = []
for x in tqdm(range(0, len(data), 1000)):
if x+1000 > len(data):
chunk_list.append(data[x:len(data)])
else:
chunk_list.append(data[x:x+1000])
return chunk_list
data = df['text'][:10000].values
data_processed = list(map(process_text, data))
Here, I want to speed up the generation of USE embeddings for my data. I am experimenting in kaggle kernel and have turned on the GPU. The GPU utilization doesn`t go beyond 2-3% & CPU utilization was ~120%
%%time
BASE_VECTORS = []
chunk_list = data_iterator(data_processed)
for i in tqdm(chunk_list):
BASE_VECTORS_tmp = get_features(i)
BASE_VECTORS.extend(BASE_VECTORS_tmp)
BASE_VECTORS = np.asarray(BASE_VECTORS)
Time taken
CPU times: user 16min 48s, sys: 2min 59s, total: 19min 47s
Wall time: 15min 13s

Probably you had not installed GPU version of tensorflow, or had some CUDNN version mismatch. Normally USE uses GPU a lot.

Related

Tensorflow tf.data.Dataset takes too much time to generate dataset. Better way to optimize it?

I have .stem.mp4 files each of which is composed of multiple audio sources.
Each length of file is 2 minutes to 6 minutes. It varies a lot.
When I try to make tf.data.Dataset out of it, it seems to take a lot of time to generate a input_batch much more than my model makes a prediction of a given batch.
Let me illustarte an example.
import tensorflow as tf
import tensorflow.keras as keras
sample_data = tf.random.normal((5, 755200, 2)) # 5 sources of audio, stereo channel
# First axis is the mixture of the audio, so this is the input
# Rest 4 axes are the each source of the audio(eg. bass, drum, vocals, etc) so these are the output
input_mixture = sample_data[0, :, :]
target_mixtures = sample_data[1:, :, :]
target_mixtures = np.column_stack(target_mixtures)
length = 44100 * 11 # I want to split these into length of 11 seconds
strides = 44100 # 1 second stride
ds_inp = tf.data.Dataset.from_tensor_slices((input_mixture))
ds_inp = ds_inp.window(length, shift=strides, drop_remainder=True)
ds_inp = ds_inp.flat_map(lambda windows: windows.batch(length))
ds_inp = ds_inp.map(lambda windows: windows, num_parallel_calls=tf.data.AUTOTUNE)
ds_tar = tf.data.Dataset.from_tensor_slices((target_mixtures))
ds_tar = ds_tar.window(length, shift=strides, drop_remainder=True)
ds_tar = ds_tar.flat_map(lambda windows: windows.batch(length))
ds_tar = ds_tar.map(lambda windows: windows, num_parallel_calls=tf.data.AUTOTUNE)
ds_total = [ds_inp, ds_tar]
total_ds = tf.data.Dataset.zip(tuple(ds_total))
total_ds = total_ds.batch(BATCH_SIZE)
total_ds = total_ds.prefetch(tf.data.AUTOTUNE)
This is how I made a tf.data.Dataset from the given file.
And when I measure the time how fast does this make a input_batch and output_batch,
%%time
for i, j in total_ds.take(1):
pass
# Wall time: 18.3 s
My model has about 100 million variables, but since it fairly has a simple structure so that it takes about 6 seconds to generate a predicted_batch out of given input_batch.
So my problem is, is there any way to make it to generate input_batch, output_batch faster?
(My assumption is that, as this 'window' the given arrays, there is no better way to improve this.)
Obviously all of the files are big enough not to be cached.

Numpy memmap throttles with Pytorch Dataloader when available RAM less than file size

I'm working on a dataset that is too big to fit into RAM. The solution I'm trying currently is to use numpy memmap to load one sample/row at a time using Dataloader. The solution looks something like this:
class MMDataset(torch.utils.data.Dataset):
def __init__(self, path):
self.file_path = path
self.dataset_len = 44000000
self.bytes_per_value = 32/8
self.num_cols = 512
self.num_rows = 1
def __getitem__(self, index):
x = np.memmap(self.file_path, dtype='float32', mode='r', shape=(
self.num_rows, self.num_cols), offset=int(index*self.num_cols*self.bytes_per_value))
return np.array(x)
def __len__(self):
return self.dataset_len
dataset = MMDataset('./data/emb.memmap')
data_loader = DataLoader(
dataset,
batch_size=4096,
shuffle=True,
num_workers=20
)
When the amount of RAM available is greater than the size of the memmap file, the data loading is fast. I get around 60 batches/second. However, when the RAM available is less than the size of the memmap file, I get around 3 batches/second.
I discovered this when trying various sizes for the memmap file.
Why is this the case? If Dataloader + memmap is going to throttle when available RAM < memmap file size, this defeats the point of the solution.
I've observed that disk i/o is at 500MB/s read constantly when available RAM < memmap file size. This is much higher than the theoretical amount of reading required to load a batch of 4096 samples (closer to 8MB/s).

Memory leak when running universal-sentence-encoder-large itterating on dataframe

I have 140K sentences I want to get embeddings for. I am using TF_HUB Universal Sentence Encoder and am iterating over the sentences(I know it's not the best way but when I try to feed over 500 sentences into the model it crashes).
My Environment is:
Ubuntu 18.04
Python 3.7.4
TF 1.14
Ram: 16gb
processor: i-5
my code is:
version 1
I iterate inside the tf.session context manager
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = None
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
version 2
I open and close a session within each iteration
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = None
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
While version 2 does seem to have some sort of GC and memory is cleared a bit. It still goes over 50 items and explodes.
version 1 just goes on gobbling memory.
The correct solution as given by arnoegw
def calculate_embeddings(dataframe, table_name):
sql_get_sentences = "SELECT * FROM semantic_similarity.sentences WHERE embedding IS NULL LIMIT 1500"
sql_update = 'UPDATE {} SET embedding = data.embedding FROM (VALUES %s) AS data(id, embedding) WHERE {}.id = data.id'.format(table_name, table_name)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
with hub.eval_function_for_module("https://tfhub.dev/google/universal-sentence-encoder-large/3") as embed:
while len(df) >= 0:
sentence_array = df['content'].values
sentence_embeddings = embed(sentence_array)
df['embedding'] = sentence_embeddings.tolist()
values = [tuple(x) for x in df[['id', 'embedding']].values]
pandas_repository.update_db_from_df('semantic_similarity.sentences', sql_update, values)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
I am a newbee to TF and can use any help I can get.
Your code uses tf.Session, so it falls under the TF1.x programming model of first building a dataflow graph and then running it repeatedly with inputs being fed and outputs being fetched from the graph.
But your code does not align well with that programming model. Both versions keep adding new applications of (calls to) the hub.Module to the default TensorFlow graph instead of applying it once and running the same graph repeatedly for the various inputs. Version 2 keeps going into and out of tf.Sessions, which frees some memory but is very inefficient.
Please see my answer to "Strongly increasing memory consumption when using ELMo from Tensorflow-Hub" for guidance how to do it right in the graph-based programming model of TensorFlow 1.x.
TensorFlow 2.0, which is going to be released soon, defaults to the programming model of "eager execution", which does away with graphs and sessions and would have avoided this confusion. TensorFlow Hub will be updated in due course for TF2.0. For a preview close to your use-case, see https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_text_classification.ipynb

Find a GPU with enough memory

I want to programmatically find out the available GPUs and their current memory usage and use one of the GPUs based on their memory availability. I want to do this in PyTorch.
I have seen the following solution in this post:
import torch.cuda as cutorch
for i in range(cutorch.device_count()):
if cutorch.getMemoryUsage(i) > MEM:
opts.gpuID = i
break
but it is not working in PyTorch 0.3.1 (there is no function called, getMemoryUsage). I am interested in a PyTorch based (using the library functions) solution. Any help would be appreciated.
In the webpage you give, there exist an answer:
#!/usr/bin/env python
# encoding: utf-8
import subprocess
def get_gpu_memory_map():
"""Get the current gpu usage.
Returns
-------
usage: dict
Keys are device ids as integers.
Values are memory usage as integers in MB.
"""
result = subprocess.check_output(
[
'nvidia-smi', '--query-gpu=memory.used',
'--format=csv,nounits,noheader'
])
# Convert lines into a dictionary
gpu_memory = [int(x) for x in result.strip().split('\n')]
gpu_memory_map = dict(zip(range(len(gpu_memory)), gpu_memory))
return gpu_memory_map
print get_gpu_memory_map()

Tensorflow on shared GPUs: how to automatically select the one that is unused

I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).
Others have access too and sometimes they take random gpus.
I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.
I would like to go throuh the gpus usage and find the first that is unused and use only this one.
I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.
I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?
I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.
Something like this:
import subprocess, re
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE
import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow
An implementation along the lines of Yaroslav Bulatov's solution is available on https://github.com/bamos/setGPU.