spaCy use only some components - spacy

I am using spaCy for my project. It works magnificiently, only it is a bit time-consuming. I am looking for ways to reduce processing time. I have realized that calling nlp on my text will perform many operations: tokenization, ner, ... (doc here: https://spacy.io/usage/spacy-101#pipelines) ; while in some parts of my code, I only need to perform e.g. vectorization. Is it possible to apply only some components of the pipeline to reduce processing time?

It is possible to disable modules and enable them back when necessary. When speeding up really is an issue, try using the pipe functionality, this speeds up for loads of documents.
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
print([(ent.text, ent.label_) for ent in doc.ents])
Source

Related

A checklist for Spacy optimization?

I have been trying to understand how to systematically make Spacy run as fast as possible for a long time and I would like this post to become a wiki-style public post if possible.
Here is what I currently know, with subsidiary questions on each point:
1. Space will run faster on faster hardware. For example, try a computer with more CPU cores, or more RAM/primary memory.
What I do not know:
What specific aspects of the execution of Spacy - especially the main one of instantiating the Doc object - depend more on CPU vs. RAM and why?
Is the instantiation of a Doc object a sequence of arithmetical calculations (the compiled binary of the neural networks), so the more CPU cores, the more calculations can be done at once, therefore faster? Does that mean increasing RAM would not make this process faster?
Are there any other aspects of CPUs or GPUs to watch out for, other than cores, that would make one chip better than another, for Spacy? Someone mentioned "hyper threading".
Is there any standard mathematical estimate of time per pipeline component, such as parser, relative to input string length? Like Parser, seconds = number of characters in input? / number of CPU cores
2. You can make Spacy run faster by removing components you don't need, for example by nlp = spacy.load("en_core_web_sm", disable=['tagger', 'ner', 'lemmatizer', 'textcat'])
Just loading the Spacy module itself with import spacy is slightly slow. If you haven't even loaded the language model yet, what are the most significant things being loaded here, apart from just adding functions to the namespace? Is it possible to only load a part of the module you need?
3. You can make Spacy faster by using certain options that simply make it run faster.
I have read about multiprocessing with nlp.pipe, n_process, batch_size and joblib, but that's for multiple documents and I'm only doing a single document right now.
4. You can make Spacy faster by minimising the number of times it has to perform the same operations.
You can keep Spacy alive on a server and pass processing commands to it when you need to
You can serialize a Doc to reload it later, and you can further exclude attributes you don't need with doc.to_bytes(exclude=["tensor"]) or doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
5. Anything else?
Checklist
The following checklist is focused on runtime performance optimization and not training (i.e. when one utilises existing config.cfg files loaded with the convenience wrapper spacy.load(), instead of training their own models and creating a new config.cfg file), however, most of the points still apply. This list is not comprehensive: the spaCy library is extensive and there are many ways to build pipelines and carry out tasks. Thus, including all cases here is impractical, regardless, this list intends to be a handy reference and starting point.
Summary
If more powerful hardware is available, use it.
Use (optimally) small models/pipelines.
Use your GPU if possible.
Process large texts as a stream and buffer them in batches.
Use multiprocessing (if appropriate).
Use only necessary pipeline components.
Save and load progress to avoid re-computation.
1. If more powerful hardware is available, use it.
CPU. Most of spaCy's work at runtime is going to be using CPU instructions to allocate memory, assign values to memory and perform computations, which, in terms of speed, will be CPU bound not RAM, hence, performance is predominantly dependent on the CPU. So, opting for a better CPU as opposed to more RAM is the smarter choice in most situations. As a general rule, newer CPUs with higher frequencies, more cores/threads, more cache etc. will realise faster spaCy processing times. However, simply comparing these numbers between different CPU architectures is not useful. Instead look at benchmarks like cpu.userbenchmark.com (e.g. i5-12600k vs. Ryzen 9 5900X) and compare the single-core and multi-core performance of prospective CPUs to find those that will likely offer better performance. See Footnote (1) on hyperthreading & core/thread counts.
RAM. The practical consideration for RAM is the size: larger texts require more memory capacity, speed and latency is less important. If you have limited RAM capacity, disable NER and parser when creating your Doc for large input text (e.g. doc = nlp("My really long text", disable = ['ner', 'parser'])). If you require these parts of the pipeline, you'll only be able to process approximately 100,000 * available_RAM_in_GB characters at a time, if you don't, you'll be able to process more than this. Note that the default spaCy input text limit is 1,000,000 characters, however this can be changed by setting nlp.max_length = your_desired_length.
GPU. If you opt to use a GPU, processing times can be improved for certain aspects of the pipeline which make use of GPU-based computations. See the section below on making use of your GPU. The same general rule as with CPUs applies here too: generally, newer GPUs with higher frequencies, more memory, larger memory bus widths, bigger bandwidth etc. will realise faster spaCy processing times.
Overclocking. If you're experienced with overclocking and have the correct hardware to be able to do it (adequate power supply, cooling, motherboard chipset), then another effective way to gain extra performance without changing hardware is to overclock your CPU/GPU.
2. Use (optimally) small models/pipelines.
When computation resources are limited, and/or accuracy is less of a concern (e.g. when experimenting or testing ideas), load spaCy pipelines that are efficiency focused (i.e. those with smaller models). For example:
# Load a "smaller" pipeline for faster processing
nlp = spacy.load("en_core_web_sm")
# Load a "larger" pipeline for more accuracy
nlp = spacy.load("en_core_web_trf")
As a concrete example of the differences, on the same system, the smaller en_core_web_lg pipeline is able to process 10,014 words per second, whereas the en_core_web_trf pipeline only processes 684. Remember that there is often a trade-off between speed and accuracy.
3. Use your GPU if possible.
Due to the nature of neural network-based models, their computations can be efficiently solved using a GPU, leading to boosts in processing times. For instance, the en_core_web_lg pipeline can process 10,014 vs. 14,954 words per second when using a CPU vs. a GPU.
spaCy can be installed for a CUDA compatible GPU (i.e. Nvidia GPUs) by calling pip install -U spacy[cuda] in the command prompt. Once a GPU-enabled spaCy installation is present, one can call spacy.prefer_gpu() or spacy.require_gpu() somewhere in your program before any pipelines have been loaded. Note that require_gpu() will raise an error if no GPU is available. For example:
spacy.prefer_gpu() # Or use spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")
4. Process large texts as a stream and buffer them in batches.
When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts (default is 1000), and process the texts as a stream using nlp.pipe(). For example:
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, batch_size=1000))
5. Use multiprocessing (if appropriate).
To make use of multiple CPU cores, spaCy includes built-in support for multiprocessing with nlp.pipe() using the n_process option. For example,
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=4))
Note that each process requires its own memory. This means that every time a new process is spawned (the default start method), model data has to be copied into memory for every individual process (hence, the larger the model, the more overhead to spawn a process). Therefore, it is recommended that if you are just doing small tasks, that you increase the batch size and use fewer processes. For example,
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=2, batch_size=2000)) # default batch_size = 1000
Finally, multiprocessing is generally not recommended on GPUs because RAM is limited.
6. Use only necessary pipeline components.
Generating predictions from models in the pipeline that you don't require unnecessarily degrades performance. One can prevent this by either disabling or excluding specific components, either when loading a pipeline (i.e. with spacy.load()) or during processing (i.e. with nlp.pipe()).
If you have limited memory, exclude the components you don't need, for example:
# Load the pipeline without the entity recognizer
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
If you might need a particular component later in your program, but still want to improve processing speed for tasks that don't require those components in the interim, use disable, for example:
# Load the tagger but don't enable it
nlp = spacy.load("en_core_web_sm", disable=["tagger"])
# ... perform some tasks with the pipeline that don't require the tagger
# Eventually enable the tagger
nlp.enable_pipe("tagger")
Note that the lemmatizer depends on tagger+attribute_ruler or morphologizer for a number of languages. If you disable any of these components, you’ll see lemmatizer warnings unless the lemmatizer is also disabled.
7. Save and load progress to avoid re-computation.
If one has been modifying the pipeline or vocabulary, made updates to model components, processed documents etc., there is merit in saving one's progress to reload at a later date. This requires one to translate the contents/structure of an object into a format that can be saved -- a process known as serialization.
Serializing the pipeline
nlp = spacy.load("en_core_web_sm")
# ... some changes to pipeline
# Save serialized pipeline
nlp.to_disk("./en_my_pipeline")
# Load serialized pipeline
nlp.from_disk("./en_my_pipeline")
Serializing multiple Doc objects
The DocBin class provides an easy method for serializing/deserializing multiple Doc objects, which is also more efficient than calling Doc.to_bytes() on every Doc object. For example:
from spacy.tokens import DocBin
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts))
doc_bin = DocBin(docs=docs)
# Save the serialized DocBin to a file
doc_bin.to_disk("./data.spacy")
# Load a serialized DocBin from a file
doc_bin = DocBin().from_disk("./data.spacy")
Footnotes
(1) "Hyper-threading" is a term trademarked by Intel used to refer to their proprietary Simultaneous Multi-Threading (SMT) implementation that improves parallelisation of computations (i.e. doing multiple tasks at once). AMD has SMT as well, it just doesn't have a fancy name. In short, processors with 2-way SMT (SMT-2) allow an Operating System (OS) to treat each physical core on the processor as two cores (referred to as "virtual cores"). Processors with SMT will perform better on tasks that can make use of these multiple "cores", sometimes referred to as "threads" (e.g. the Ryzen 5600X is an 6 core/12 thread processor (i.e. 6 physical cores, but with SMT-2, it has 12 "virtual cores" or "threads")). Note that Intel has recently released a CPU architecture with e-cores, which are cores that don't have hyper-threading, despite other cores on the processor (namely, p-cores) having it, hence you will see some chips like the i5-12600k that have 10 cores with hyper-threading, but it has 16 threads not 20. This is because only the 6 p-cores have hyper-threading, while the 4 e-cores do not, hence 16 threads total.

Feed large text to PyTextRank

I would like to use PyTextRank for keyphrase extraction. How can I feed feed 5 million documents (each document consisting of a few paragraphs) to the package?
This is the example I see on the official tutorial.
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
ic(phrase.rank, phrase.count, phrase.text)
ic(phrase.chunks)
Is my option only to concatenate several million documents to a single string and pass it to nlp(text)? I do not think I could use nlp.pipe(texts) as I want to create one network by computing words/phrases from all documents.
No, instead it would almost certainly be better to run these tasks in parallel. Many use cases of pytextrank have used Spark, Dask, Ray, etc., to parallelize running documents through a spaCy pipeline with pytestrank to extract entities.
For an example of parallelization with Ray, see https://github.com/Coleridge-Initiative/rclc/blob/4d5347d8d1ac2693901966d6dd6905ba14133f89/bin/index_phrases.py#L45
One question would be how you are associating the extracted entities with documents? Are these being collected into a dataset, or perhaps a database or key/value store?
However these results get collected, you could then construct a graph of co-occurring phrases, and also include additional semantics to help structure the results. A sister project kglab https://github.com/DerwenAI/kglab was created for these kinds of use cases. There are some examples in the Jupyter notebooks included with the kglab project; see https://derwen.ai/docs/kgl/tutorial/
FWIW, we'll have tutorials coming up at ODSC West about using kglab and pytextrank and there are several videos online (under Graph Data Science) for previous tutorials at conferences. We also have monthly public office hours through https://www.knowledgegraph.tech/ – message me #pacoid on Tw for details.

How to compute additional statistics in Evaluator?

In the TFX Evaluator, on top of the metrics described in TFMA format,I would like to compute statistics relative to the performance of my model on my dataset. Naturally, I would also like a way to get access to these statistics: either through the output of the component, or by letting the component upload the statistics somewhere.
I guess that some amount of custom code would be needed (both for the computation and the return of the statistics), but I don't really know how much and what would be the best way to write it. Any ideas on the topic ?
Thanks
There are 2 methods that you can achieve this depending on how you see the placement of your functionality in the TFX flow.
Writing a Custom TFX Components - which requires a lot of effort and you need to define quite a few things.
Reusing Existing Components - instead of writing a component for TFX entirely from scratch, we can inherit an existing component and customize it by overwriting the executor functionality.
I would suggest the following blogs to begin with:
Anatomy of TFX Component
Creating custom Components

Nesting pipelines in apache beam

I am looking do to the following with apache beam.
Specifically pre-processing for a tensorflow neural network.
for each file from a folder.
for each line from a file
process line to 1d list of floats
I need each return to be a 2d list of floats for each file.
I think I can accomplish this by creating nested pipelines.
I could create and run a pipeline inside of a ParDo of another pipeline.
This seems inefficient, but my problem seems like a pretty standard use case.
Is there a tool to do this better in apache beam?
Is there a way to restructure my problem to make it work in apache beam better?
Are nested pipelines not as bad as I think they are?
Thanks
Apache Beam is a great tool for pre-processing data for machine learning with Tensorflow. More information about this general use case and tf.Transform is available in this post.
Nothing described seems to indicate the need for "nested pipelines". Processing each line of each file in a directory is a simple TextIO.Read transformation. It is unclear what your requirements from now on are, but, in general, separating the line into floats and joining with other lines are straightforward ParDo and grouping operations.
As a general guidance, I'd avoid nested pipelines, and try to break down the problem to fit into a single pipeline.

What is the concept of CNTKTextFormatDeserializer and why use?

I am using the CNTKTextReader to read in my training and test sets. The train file is getting large ( 2.7 GB now, and soon to get bigger ).
I don't understand what is "CNTKTextFormatDeserializer" -- the doc I found didn't explain what the big picture is -- what is it and why use it? The doc I found just went into syntax of it.
So, is it a way to use a binary version of these files to make them more compact?
Readers in general are just a way to make certain aspects of training easier. These include
randomization: SGD generalizes better when the data presented to it are coming in random order. The reader can randomize the data for you with shuffling happening on the fly.
distributed training: For distributed training the reader is aware of the multiple workers and can make sure they receive distinct chunks of data.
memory budget issues: The reader does not load the whole training file in memory.
language agnostic i/o: The reader provides a cross-platform way to read data. (if you want to always be in Python, you might not care about this but others do).
The CTF format is a little verbose and indeed there is a binary format deserializer that was recently added.