From the source code of Matcher, it seems that pipe is just calling the matching function one by one. Does it use multithreading or buffer content by batch_size at all?
def pipe(self, docs, batch_size=1000, n_threads=2):
"""Match a stream of documents, yielding them in turn.
docs (iterable): A stream of documents.
batch_size (int): Number of documents to accumulate into a working set.
n_threads (int): The number of threads with which to work on the buffer
in parallel, if the implementation supports multi-threading.
YIELDS (Doc): Documents, in order.
for doc in docs:
yield doc
I have a dataset of images that is too large to store on memory. What I plan to do is loading pairs of the paths to the images and corresponding labels as my dataset, then use a generator function during training to convert only the paths in my batch to images before feeding them to the network.
Is a good way to do this? Does it return a mapping function, that can be applied only to the current batch during training, or does it perform the mapping operation on the whole dataset at once, occupying lots of memory? In the second case, what is an alternative?
A few tutorials I went through made me believe the mapping takes place per batch, but this quote from the documentation suggests a whole new dataset is returned: "This transformation applies map_func to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input."
The key thing to understand here is that objects are generally "lazy" in that elements are only processed as needed (in a batched Dataset, elements == batches). When iterating over a dataset, this usually means that only the next requested element is prepared and then returned. So to answer your question: When using map to load data from disk, and applying this to a dataset of file names, only one batch of the loaded data should be stored in memory at the same time, and you should be able to process the dataset just fine. However, this can significantly slow down training if loading the files is a bottleneck in terms of speed.
There are some exceptions though, for example:
When you use the shuffle method, you need to provide a buffer size, and AFAIK the entire buffer is preprocessed at once. This can lead to issues since you want a large buffer for good shuffling, but this requires more memory. Thus you probably want to use shuffle before applying map.
The prefetch method results in multiple elements being prepared in order to avoid the model having to wait for the next batch to be processed.
Note that this lazy behavior also has some disadvantages, e.g.
You can only iterate over datasets sequentially; there is no random access.
A dataset doesn't even know how many elements it contains (this would require iterating over the entire set).
My current understanding is:
Different map_func: Both interleave and flat_map expect "A function mapping a dataset element to a dataset". In contrast, map expects "A function mapping a dataset element to another dataset element".
Arguments: Both interleave and map offer the argument num_parallel_calls, whereas flat_map does not. Moreover, interleave offers these magical arguments block_length and cycle_length. For cycle_length=1, the documentation states that the outputs of interleave and flat_map are equal.
Last, I have seen data loading pipelines without interleave as well as ones with interleave. Any advice when to use interleave vs. map or flat_map would be greatly appreciated
//EDIT: I do see the value of interleave, if we start out with different datasets, such as in the code below
files ="/path/to/dataset/train-*.tfrecord")
dataset = files.interleave(
However, is there any benefit of using interleave over map in a scenario such as the one below?
files ="/path/to/dataset/train-*.png")
dataset =,
Can map not also be used to parallelize I/O?
Indeed, you can read images and labels from a directory with map function. Assume this case:
list_ds =
def process_path(path):
### get label here etc. Images need to be decoded
return, label
new_ds =,
Note that, now it is multi-threaded as num_parallel_calls has been set.
The advantage of interlave() function:
Suppose you have a dataset
With cycle_length you can out that many elements from the dataset, i.e 5, then 5 elements are out from the dataset and a map_func can be applied.
After, fetch dataset objects from newly generated objects, block_length pieces of data each time.
In other words, interleave() function can iterate through your dataset while applying a map_func(). Also, it can work with many datasets or data files at the same time. For example, from the docs:
dataset = dataset.interleave(lambda x:, num_parallel_calls=1),
cycle_length=4, block_length=16)
However, is there any benefit of using interleave over map in a
scenario such as the one below?
Both interleave() and map() seems a bit similar but their use-case is not the same. If you want to read dataset while applying some mapping interleave() is your super-hero. Your images may need to be decoded while being read. Reading all first, and decoding may be inefficient when working with large datasets. In the code snippet you gave, AFAIK, the one with should be faster.
TL;DR interleave() parallelizes the data loading step by interleaving the I/O operation to read the file.
map() will apply the data pre-processing to the contents of the datasets.
So you can do something like:
ds = train_file.interleave(lambda x:, will decide the level of parallelism for buffer size, CPU power, and also for I/O operations. In other words, AUTOTUNE will handle the level dynamically at runtime.
num_parallel_calls argument spawns multiple threads to utilize multiple cores for parallelizing the tasks. With this you can load multiple datasets in parallel, reducing the time waiting for the files to be opened; as interleave can also take an argument num_parallel_calls. Image is taken from docs.
In the image, there are 4 overlapping datasets, that is determined by the argument cycle_length, so in this case cycle_length = 4.
FLAT_MAP: Maps a function across the dataset and flattens the result. If you want to make sure order stays the same you can use this. And it does not take num_parallel_calls as an argument. Please refer docs for more.
The map function will execute the selected function on every element of the Dataset separately. Obviously, data transformations on large datasets can be expensive as you apply more and more operations. The key point is, it can be more time consuming if CPU is not fully utilized. But we can use parallelism APIs:
num_of_cores = multiprocessing.cpu_count() # num of available cpu cores
mapped_data =, num_parallel_calls = num_of_cores)
For cycle_length=1, the documentation states that the outputs of
interleave and flat_map are equal
cycle_length --> The number of input elements that will be processed concurrently. When set it to 1, it will be processed one-by-one.
INTERLEAVE: Transformation operations like map can be parallelized.
With parallelism of the map, at the top the CPU is trying to achieve parallelization in transformation, but the extraction of data from the disk can cause overhead.
Besides, once the raw bytes are read into memory, it may also be necessary to map a function to the data, which of course, requires additional computation. Like decrypting data etc. The impact of the various data extraction overheads needs to be parallelized in order to mitigate this with interleaving the contents of each dataset.
So while reading the datasets, you want to maximize:
Source of image:
I have some sentences for which I am creating an embedding and it works great for similarity searching unless there are some truly unusual words in the sentence.
In that case, the truly unusual words in fact contain the very most similarity information of any words in the sentence BUT all of that information is lost during embedding due to the fact that the word is apparently not in the vocabulary of the model.
I'd like to get a list of all of the words known by the GUSE embedding model so that I can mask those known words out of my sentence, leaving only the "novel" words.
I can then do an exact word search for those novel words in my target corpus and achieve usability for my similar sentence searching.
e.g. "I love to use Xapian!" gets embedded as "I love to use UNK".
If I just do a keyword search for "Xapian" instead of a semantic similarity search, I'll get much more relevant results than I would using GUSE and vector KNN.
Any ideas on how I can extract the vocabulary known/used by GUSE?
I combine the earlier answer from #Roee Shenberg and the solution provided here to come up with solution, which is applicable for USE v4:
import importlib
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model("/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/")
graph = saved_model.meta_graphs[0].graph_def
fns = [f for f in saved_model.meta_graphs[0].graph_def.library.function if "ptb" in str(f).lower()];
print(len(fns)) # should be 1
nodes_with_sp = [n for n in fns[0].node_def if == "Embeddings_words"]
print(len(nodes_with_sp)) # should be 1
words_tensor = nodes_with_sp[0].attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # should be 400004
If you are just curious about the words I upload them here.
I'm assuming you have tensorflow & tensorflow_hub installed, and youhave already downloaded the model.
IMPORTANT: I'm assuming you're looking at! There's no guarantee the object graph looks the same for different versions, it's likely that modifications will be needed.
Find it's location on disk - it's somewhere at /tmp/tfhub_modules unless you set the TFHUB_CACHE_DIR environment variable (Windows/Mac have different locations). The path should contain a file called saved_model.pb, which is the model, serialized using Protocol Buffers.
Unfortunately, the dictionary is serialized inside the model's Protocol Buffers file and not as an external asset, so we'll have to load the model and get the variable from it.
The strategy is to use tensorflow's code to deserialize the file, and then travel down the serialized object tree all the way to the dictionary.
import importlib
MODEL_PATH = 'path/to/model/dir' # e.g. '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'
# Use the tensorflow internal Protobuf loader. A regular import statement will fail.
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model(MODEL_PATH)
# reach into the object graph to get the tensor
graph = saved_model.meta_graphs[0].graph_def
function = graph.library.function
node_type, node_value = function[5].node_def
# if you print(node_type) you'll see it's called "text_preprocessor/hash_table"
# as well as get insight into this branch of the object graph we're looking at
words_tensor = node_value.attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # -> 400004
Some resources that helped:
A GitHub issue relating to changing the vocabulary
A Tensorflow Google-group thread linked from the issue
Extra Notes
Despite what the GitHub issue may lead you to think, the 400k words here are not the GloVe 400k vocabulary. You can verify this by downloading the GloVe 6B embeddings (file link), extracting glove.6B.50d.txt, and then using the following code to compare the two dictionaries:
with open('/path/to/glove.6B.50d.txt') as f:
glove_vocabulary = set(line.strip().split(maxsplit=1)[0] for line in f)
USE_vocabulary = set(word_list) # from above
print(len(USE_vocabulary - glove_vocabulary)) # -> 281150
Inspecting the different vocabularies is interesting in and of itself, e.g. why does GloVe have an entry for '287.9'?
I have a dataset of around 1M examples. I each example to a separate .tfrecord file, which resulted in around 500GB sitting in some network location.
Reading multiple small files from this network location is extremely slow, so I'm thinking about grouping around 100 examples into one .tfrecord file.
I am worried though, that examples from the same .tfrecords file will always appear in the same minibatch (or one minibatch after each other), which is bad for the proper mixing of training data I want to have.
my input pipeline is the following:
I have a tf.train.string_input_producer(files, capacity=100000) for the filenames queue, using to read from the filenames queue, and use tf.train.batch that creates an examples queue and returns a batch from it using dequeue_many.
I fear that once the filenames queue dequeues a filename, all examples from it will be read and enqueued into the examples FIFO queue created by tf.train.batch, which will result in the same examples being in the same minibatches over and over.
Is it really going to have the same examples in the same minibatch over and over? If so, should I create a Shuffle queue for examples, instead of using tf.train.batch?
One of the points of TFRecord is to store many files in the same location to overcome the problem of opening/closing many files. So your approach of one tfrecord per one example does not make sense. You can put even all examples in one file or have 10k per file. Regarding shuffling: there are two types shuffling which serve different purposes and shuffle different things:
tf.train.string_input_producer shuffle: Boolean. If true, the strings are randomly shuffled within each epoch.. So if you have a few files ['file1', 'file2', ..., 'filen'] this randomly selects a file from this list. If case of false, the files follow one after each other.
tf.train.shuffle_batch Creates batches by randomly shuffling tensors. So it takes batch_size tensors from your queue (you will need to create a queue with tf.train.start_queue_runners ) and shuffles them.
About the (from TensorFlow 1.2, see here and here) usage:
The way how to get data doesn't really fit any way how I get the data usually. In my case, I have a thread and I receive data there and I don't know in advance when it will end but I see when it ends. Then I wait until I processed all the buffers and then I have finished one epoch. How can I get this logic with the Dataset?
Note that I prefer the Dataset interface over the QueueBase interface because it gives me the iterator interface which I can reinitialize and even reset to a different Dataset. This is more powerful compared to queues which cannot be reopened currently after they are closed (see here and here).
Maybe a similar question, or the same question: How can I wrap around a Dataset over a queue? I have some thread with reads some data from somewhere and which can feed it and queue it somehow. How do I get the data into the Dataset? I could repeat some dummy tensor infinite times and then use map to just return my queue.dequeue() but that really only gets me back to all the original problems with the queue, i.e. how to reopen the queue.
The new Dataset.from_generator() method allows you to define a Dataset that is fed by a Python generator. (To use this feature at present, you must download a nightly build of TensorFlow or build it yourself from source. It will be part of TensorFlow 1.4.)
The easiest way to implement your example would be to replace your receiving thread with a generator, with pseudocode as follows:
def receiver():
while True:
next_element = ... # Receive next element from external source.
# Note that this method may block.
end_of_epoch = ... # Decide whether or not to stop based on next_element.
if not end_of_epoch:
yield next_element # Note: you may need to convert this to an array.
return # Returning will signal OutOfRangeError on downstream iterators.
dataset =, output_types=...)
# You can chain other `Dataset` methods after the generator. For example:
dataset = dataset.prefetch(...) # This will start a background thread
# to prefetch elements from `receiver()`.
dataset = dataset.repeat(...) # Note that each repetition will call
# `receiver()` again, and start from
# a fresh state.
dataset = dataset.batch(...)
More complicated topologies are possible. For example, you can use Dataset.interleave() to create many receivers in parallel.