bert_vocab.bert_vocab_from_dataset taking too long - tensorflow

I'm following this tutorial (https://colab.research.google.com/github/tensorflow/text/blob/master/docs/guide/subwords_tokenizer.ipynb#scrollTo=kh98DvoDz7Jn) to generate a vocabulary from a custom dataset. In the tutorial, it takes around 2 minutes for this code to complete:
bert_vocab_args = dict(
# The target vocabulary size
vocab_size = 8000,
# Reserved tokens that must be included in the vocabulary
reserved_tokens=reserved_tokens,
# Arguments for `text.BertTokenizer`
bert_tokenizer_params=bert_tokenizer_params,
# Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
learn_params={},
)
pt_vocab = bert_vocab.bert_vocab_from_dataset(
train_pt.batch(1000).prefetch(2),
**bert_vocab_args
)
On my dataset it takes a lot longer... I tried increasing the batch number as well as decreasing the size of the vocabulary, all to no avail. Is there any way to make this go faster?

I ran into the same issue. This is how I resolved it:
First I checked the number of elements in the dataset:
examples, metadata = tfds.load('my_dataset', as_supervised=True, with_info=True)
print(metadata)
In my case, the dataset contained more than 5 million elements, which explains why creating the vocabulary took an endless amount of time.
The portuguese vocabulary of the tensorflow example is built using some 50000 elements. So I selected 1% of my dataset:
train_tokenize, metadata = tfds.load('my_dataset', split='train[:1%]',as_supervised=True, with_info=True)
I then used this dataset to develop the vocabulary, which took some 2 minutes:
train_en_tokenize = train_tokenize.map(lambda en, ol: en)
train_ol_tokenize = train_tokenize.map(lambda en, ol: ol)
ol_vocab = bert_vocab.bert_vocab_from_dataset(
train_ol_tokenize.batch(1000).prefetch(2),
**bert_vocab_args
)
en_vocab = bert_vocab.bert_vocab_from_dataset(
train_en_tokenize.batch(1000).prefetch(2),
**bert_vocab_args
)
where ol stands for the 'other language' I am developing the model for.

Related

How to load huge time series windows dataset without memory errors?

I want to convert a typical time series dataset of about 1 million lines into 100-item windows with 50% overlap. Note that it's a multivariate one, so for example given 8 features and 1000 windows with 100 items the final shape would be (1000, 100, 8) replacing (n_samples, n_timesteps, n_features). The goal is to use it for training machine learning algorithms including deep neural networks.
So far, I've enjoyed using numpy's sliding_window_view as shown below;
x = np.arange(100).reshape(20, 5)
v = sliding_window_view(x, (3, 5))
v
Unfortunately, I get crashes as I run out of RAM in large datasets with millions of lines. Do you have any suggestion?
Additionally, one serious restriction is that there's a consecutive label for every timestep (integer) according to which the dataset needs to be grouped by (using pandas) so this limits some options about reading it in portions.
I think you are looking for tf.data.Dataset. I'm working on a million rows dataset, and the following code runs well for me:
convert = tf.data.TextLineDataset("path_to_file.txt")
dataset = tf.data.Dataset.zip(convert)
Now you have initialized your dataset, but for don't stepping into memory issues:
def dataset_batches(ds, batch_size):
return (
ds
.cache()
.batch(batch_size)
.prefetch(tf.data.AUTOTUNE) )
# you can do more operations here
train_batches = dataset_batches(dataset, 64)
And to run it, you'll have to loop:
for (batch, row) in enumerate(train_batche):
# do stuff
# batch = current batch (0, 1, 2, ...) so if your dataset has 1600 rows and you've used batch_size=16 you'll have 100 batches
# row is the actual data (tensor)

tensorflow profile explanation

I use tensorflow profile to test the inference of my model and here is the profile details. I find that there are 0,1,2,3, four numbers where 1 and 2 are filled with blank. So what is the meaning of 0-4 and why there are blanks in 1 and 2.
The machine has 80 cores and does it mean that the inference course only occupy 4 cores of them ?
Thanks.
I suppose that each row corresponds to each worker thread to run operators.
So your inference processing only occupies 4 cores as you say.
Tensorflow uses multi-threads when
There are some independent graph parts.
There is a operator using multi-threads.
So you can use multi-core effectively, if your graph have many independent graph parts.
In the following code, the graph has many independent graph parts. Therefore the number of the rows in profiler matches to "inter_op_parallelism_threads".
config = tf.ConfigProto(inter_op_parallelism_threads=5, intra_op_parallelism_threads=1)
with tf.device("/cpu:0"):
list_r = []
for i in range(80):
r = tf.random_normal(shape=[100, 100])
list_r.append(r)
v = tf.add_n(list_r)
global_step = tf.train.create_global_step()
hook = tf.train.ProfilerHook(save_steps=1)
increment_global = global_step.assign_add(1)
with tf.train.SingularMonitoredSession(hooks=[hook], config=config) as sess:
sess.run([v, increment_global])
If you want to know the detail of ConfigProto, you can get information from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto

TF DATA API: How to produce tensorflow input to object set recognition

Consider this problem: select a random number of samples from a random subject in an image dataset (like ImageNet) as an input element for Tensorflow graph which functions as an object set recognizer. For each batch, each class has a same number of samples to facilitate computation. But a different batch would have a different number of images for one class, i.e. batch_0:num_imgs_per_cls=2; batch_1000:num_imgs_per_cls=3.
If there is existing functionality in Tensorflow, explanation for the whole process from scratch (like from directories of images) will be really appreciated.
There is a very similar answer by #mrry here.
Sampling balanced batches
In face recognition we often use triplet loss (or similar losses) to train the model. The usual way to sample triplets to compute the loss is to create a balanced batch of images where we have for instance 10 different classes (i.e. 10 different people) with 5 images each. This gives a total batch size of 50 in this example.
More generally the problem is to sample num_classes_per_batch (10 in the example) classes, and then sample num_images_per_class (5 in the example) images for each class. The total batch size is:
batch_size = num_classes_per_batch * num_images_per_class
Have one dataset for each class
The easiest way to deal with a lot of different classes (100,000 in MS-Celeb) is to create one dataset for each class.
For instance you can have one tfrecord for each class and create the datasets like this:
# Build one dataset per class.
filenames = ["class_0.tfrecords", "class_1.tfrecords"...]
per_class_datasets = [tf.data.TFRecordDataset(f).repeat(None) for f in filenames]
Sample from the datasets
Now we would like to be able to sample from these datasets. For instance we want the following labels in our batch:
1 1 1 3 3 3 9 9 9 4 4 4
This corresponds to num_classes_per_batch=4 and num_images_per_class=3.
To do this we will need to use features that will be released in r1.9. The function should be called tf.contrib.data.choose_from_datasets (see here for a discussion on this).
It should look like:
def choose_from_datasets(datasets, selector):
"""Chooses elements with indices from selector among the datasets in `datasets`."""
So we create this selector which will output 1 1 1 3 3 3 9 9 9 4 4 4 and combine it with datasets to obtain our final dataset that will output balanced batches:
def generator(_):
# Sample `num_classes_per_batch` classes for the batch
sampled = tf.random_shuffle(tf.range(num_classes))[:num_classes_per_batch]
# Repeat each element `num_images_per_class` times
batch_labels = tf.tile(tf.expand_dims(sampled, -1), [1, num_images_per_class])
return tf.to_int64(tf.reshape(batch_labels, [-1]))
selector = tf.contrib.data.Counter().map(generator)
selector = selector.apply(tf.contrib.data.unbatch())
dataset = tf.contrib.data.choose_from_datasets(datasets, selector)
# Batch
batch_size = num_classes_per_batch * num_images_per_class
dataset = dataset.batch(batch_size)
You can test this with the nightly TensorFlow build and by using DirectedInterleaveDataset as a workaround:
# The working option right now is
from tensorflow.contrib.data.python.ops.interleave_ops import DirectedInterleaveDataset
dataset = DirectedInterleaveDataset(selector, datasets)
I also wrote about this workaround here.

Why shuffling data gives significantly higher accuracy?

In Tensorflow, I've wrote a big model for 2 image classes problem. My question is concerned with the following code snippet:
X, y, X_val, y_val = prepare_data()
probs = calc_probs(model, session, X)
accuracy = float(np.equal(np.argmax(probs, 1), np.argmax(y, 1)).sum()) / probs.shape[0]
loss = log_loss(y, probs)
X is an np.array of shape: (25000,244,244,3). That code results in accuracy=0.5834 (towards random accuracy) and loss=2.7106. But
when I shuffle the data, by adding these 3 lines after the first line:
sample_idx = random.sample(range(0, X.shape[0]), 25000)
X = X[sample_idx]
y = y[sample_idx]
, the results become convenient: accuracy=0.9933 and loss=0.0208.
Why shuffling data can give significantly higher accuracy ? or what can be a reason for that ?
The function calc_probs is mainly a run call:
probs = session.run(model.probs, feed_dict={model.X: X})
Update:
After hours of debugging, I figured out that evaluating a single image gives different result. For example, if you run the following line of code multiple times, you get a different result each time:
session.run(model.props, feed_dict={model.X: [X[20]])
My data is normally sorted, X contains class 1 samples first then class 2. And in calc_probs function, I run using each batch of the data sequentially. So, without shuffling, each run has data of a single class.
I've also noted that with shuffling, if batch size is very small, I get the random accuracy.
There is some mathematical justification for this in the context of randomized Kaczmarz algorithm. Regular Kaczmarz algorithm is an old algorithm which can be seen as an non-shuffling SGD on a least squares problem, and there are guaranteed faster convergence rates that come out if you use randomization, follow references in http://www.cs.ubc.ca/~nickhar/W15/Lecture21Notes.pdf

word2vec - get nearest words

Reading the tensorflow word2vec model output how can I output the words related to a specific word ?
Reading the src : https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/tutorials/word2vec/word2vec_basic.py can view how the image is plotted.
But is there a data structure (e.g dictionary) created as part of training the model that allows to access nearest n words closest to given word ?
For example if word2vec generated image :
image src: https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html
In this image the words 'to , he , it' are contained in same cluster, is there a function which takes as input 'to' and outputs 'he , it' (in this case n=2) ?
This approach apply to word2vec in general. If you can save the word2vec in text/binary file like google/GloVe word vector. Then what you need is just the gensim.
To install:
Via github
Python code:
from gensim.models import Word2Vec
gmodel=Word2Vec.load_word2vec_format(fname)
ms=gmodel.most_similar('good',10)
for x in ms:
print x[0],x[1]
However this will search all the words to give the results, there are approximate nearest neighbor (ANN) which will give you the result faster but with a trade off in accuracy.
In the latest gensim, annoy is used to perform the ANN, see this notebooks for more information.
Flann is another library for Approximate Nearest Neighbors.
I will assume that you don't want to use gensim, and would prefer to stick with tensorflow. In that case, I'll offer two options
Option 1 - Tensorboard:
If you are just trying to do this from an exploratory standpoint, I would suggest using Tensorboard's embedding visualizer to search for the closest embeddings. It provides a cool interface and you can use both cosine and euclidian distances with a set number of neighbors.
Link to Tensorflow documentation
Option 2 - Direct Calculation
Within the word2vec_basic.py file, there is an example of how they are calculating closest words, and you could go ahead and use that if you mess with the function a little bit. The following is found in the graph itself:
# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(
valid_embeddings, normalized_embeddings, transpose_b=True)
Then, during training (every 10000 steps) they run this next bit of code (while the session is active). When they call similarity.eval() it is getting the literal numpy array evaluation of the similarity tensor in the graph.
# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in xrange(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log_str = "Nearest to %s:" % valid_word
for k in xrange(top_k):
close_word = reverse_dictionary[nearest[k]]
log_str = "%s %s," % (log_str, close_word)
print(log_str)
If you want to adapt this for yourself, you will have to do some finessing with changing reverse_dictionary[valid_examples[i]] to be the word/words idxs that you want to get the k-closest words for.
Get gensim and use similar_by_word method on gensim.models.Word2Vec model.
similar_by_word takes 3 parameters,
The input word
n - for top n similar words (optional, default=10)
restrict_vocab (optional, default=None)
Example
import gensim, nltk
class FileToSent(object):
"""A class to load a text file efficiently """
def __init__(self, filename):
self.filename = filename
# To remove stop words (optional)
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
Then depending on your input sentences (sentence_file.txt),
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, min_count=2, hs=1)
print model.similar_by_word('hack', 2) # Get two most similar words to 'hack'
# [(u'debug', 0.967338502407074), (u'patch', 0.952264130115509)] (Output specific to my dataset)