I have been trying to understand how to systematically make Spacy run as fast as possible for a long time and I would like this post to become a wiki-style public post if possible.
Here is what I currently know, with subsidiary questions on each point:
1. Space will run faster on faster hardware. For example, try a computer with more CPU cores, or more RAM/primary memory.
What I do not know:
What specific aspects of the execution of Spacy - especially the main one of instantiating the Doc object - depend more on CPU vs. RAM and why?
Is the instantiation of a Doc object a sequence of arithmetical calculations (the compiled binary of the neural networks), so the more CPU cores, the more calculations can be done at once, therefore faster? Does that mean increasing RAM would not make this process faster?
Are there any other aspects of CPUs or GPUs to watch out for, other than cores, that would make one chip better than another, for Spacy? Someone mentioned "hyper threading".
Is there any standard mathematical estimate of time per pipeline component, such as parser, relative to input string length? Like Parser, seconds = number of characters in input? / number of CPU cores
2. You can make Spacy run faster by removing components you don't need, for example by nlp = spacy.load("en_core_web_sm", disable=['tagger', 'ner', 'lemmatizer', 'textcat'])
Just loading the Spacy module itself with import spacy is slightly slow. If you haven't even loaded the language model yet, what are the most significant things being loaded here, apart from just adding functions to the namespace? Is it possible to only load a part of the module you need?
3. You can make Spacy faster by using certain options that simply make it run faster.
I have read about multiprocessing with nlp.pipe, n_process, batch_size and joblib, but that's for multiple documents and I'm only doing a single document right now.
4. You can make Spacy faster by minimising the number of times it has to perform the same operations.
You can keep Spacy alive on a server and pass processing commands to it when you need to
You can serialize a Doc to reload it later, and you can further exclude attributes you don't need with doc.to_bytes(exclude=["tensor"]) or doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
5. Anything else?
Checklist
The following checklist is focused on runtime performance optimization and not training (i.e. when one utilises existing config.cfg files loaded with the convenience wrapper spacy.load(), instead of training their own models and creating a new config.cfg file), however, most of the points still apply. This list is not comprehensive: the spaCy library is extensive and there are many ways to build pipelines and carry out tasks. Thus, including all cases here is impractical, regardless, this list intends to be a handy reference and starting point.
Summary
If more powerful hardware is available, use it.
Use (optimally) small models/pipelines.
Use your GPU if possible.
Process large texts as a stream and buffer them in batches.
Use multiprocessing (if appropriate).
Use only necessary pipeline components.
Save and load progress to avoid re-computation.
1. If more powerful hardware is available, use it.
CPU. Most of spaCy's work at runtime is going to be using CPU instructions to allocate memory, assign values to memory and perform computations, which, in terms of speed, will be CPU bound not RAM, hence, performance is predominantly dependent on the CPU. So, opting for a better CPU as opposed to more RAM is the smarter choice in most situations. As a general rule, newer CPUs with higher frequencies, more cores/threads, more cache etc. will realise faster spaCy processing times. However, simply comparing these numbers between different CPU architectures is not useful. Instead look at benchmarks like cpu.userbenchmark.com (e.g. i5-12600k vs. Ryzen 9 5900X) and compare the single-core and multi-core performance of prospective CPUs to find those that will likely offer better performance. See Footnote (1) on hyperthreading & core/thread counts.
RAM. The practical consideration for RAM is the size: larger texts require more memory capacity, speed and latency is less important. If you have limited RAM capacity, disable NER and parser when creating your Doc for large input text (e.g. doc = nlp("My really long text", disable = ['ner', 'parser'])). If you require these parts of the pipeline, you'll only be able to process approximately 100,000 * available_RAM_in_GB characters at a time, if you don't, you'll be able to process more than this. Note that the default spaCy input text limit is 1,000,000 characters, however this can be changed by setting nlp.max_length = your_desired_length.
GPU. If you opt to use a GPU, processing times can be improved for certain aspects of the pipeline which make use of GPU-based computations. See the section below on making use of your GPU. The same general rule as with CPUs applies here too: generally, newer GPUs with higher frequencies, more memory, larger memory bus widths, bigger bandwidth etc. will realise faster spaCy processing times.
Overclocking. If you're experienced with overclocking and have the correct hardware to be able to do it (adequate power supply, cooling, motherboard chipset), then another effective way to gain extra performance without changing hardware is to overclock your CPU/GPU.
2. Use (optimally) small models/pipelines.
When computation resources are limited, and/or accuracy is less of a concern (e.g. when experimenting or testing ideas), load spaCy pipelines that are efficiency focused (i.e. those with smaller models). For example:
# Load a "smaller" pipeline for faster processing
nlp = spacy.load("en_core_web_sm")
# Load a "larger" pipeline for more accuracy
nlp = spacy.load("en_core_web_trf")
As a concrete example of the differences, on the same system, the smaller en_core_web_lg pipeline is able to process 10,014 words per second, whereas the en_core_web_trf pipeline only processes 684. Remember that there is often a trade-off between speed and accuracy.
3. Use your GPU if possible.
Due to the nature of neural network-based models, their computations can be efficiently solved using a GPU, leading to boosts in processing times. For instance, the en_core_web_lg pipeline can process 10,014 vs. 14,954 words per second when using a CPU vs. a GPU.
spaCy can be installed for a CUDA compatible GPU (i.e. Nvidia GPUs) by calling pip install -U spacy[cuda] in the command prompt. Once a GPU-enabled spaCy installation is present, one can call spacy.prefer_gpu() or spacy.require_gpu() somewhere in your program before any pipelines have been loaded. Note that require_gpu() will raise an error if no GPU is available. For example:
spacy.prefer_gpu() # Or use spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")
4. Process large texts as a stream and buffer them in batches.
When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts (default is 1000), and process the texts as a stream using nlp.pipe(). For example:
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, batch_size=1000))
5. Use multiprocessing (if appropriate).
To make use of multiple CPU cores, spaCy includes built-in support for multiprocessing with nlp.pipe() using the n_process option. For example,
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=4))
Note that each process requires its own memory. This means that every time a new process is spawned (the default start method), model data has to be copied into memory for every individual process (hence, the larger the model, the more overhead to spawn a process). Therefore, it is recommended that if you are just doing small tasks, that you increase the batch size and use fewer processes. For example,
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=2, batch_size=2000)) # default batch_size = 1000
Finally, multiprocessing is generally not recommended on GPUs because RAM is limited.
6. Use only necessary pipeline components.
Generating predictions from models in the pipeline that you don't require unnecessarily degrades performance. One can prevent this by either disabling or excluding specific components, either when loading a pipeline (i.e. with spacy.load()) or during processing (i.e. with nlp.pipe()).
If you have limited memory, exclude the components you don't need, for example:
# Load the pipeline without the entity recognizer
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
If you might need a particular component later in your program, but still want to improve processing speed for tasks that don't require those components in the interim, use disable, for example:
# Load the tagger but don't enable it
nlp = spacy.load("en_core_web_sm", disable=["tagger"])
# ... perform some tasks with the pipeline that don't require the tagger
# Eventually enable the tagger
nlp.enable_pipe("tagger")
Note that the lemmatizer depends on tagger+attribute_ruler or morphologizer for a number of languages. If you disable any of these components, you’ll see lemmatizer warnings unless the lemmatizer is also disabled.
7. Save and load progress to avoid re-computation.
If one has been modifying the pipeline or vocabulary, made updates to model components, processed documents etc., there is merit in saving one's progress to reload at a later date. This requires one to translate the contents/structure of an object into a format that can be saved -- a process known as serialization.
Serializing the pipeline
nlp = spacy.load("en_core_web_sm")
# ... some changes to pipeline
# Save serialized pipeline
nlp.to_disk("./en_my_pipeline")
# Load serialized pipeline
nlp.from_disk("./en_my_pipeline")
Serializing multiple Doc objects
The DocBin class provides an easy method for serializing/deserializing multiple Doc objects, which is also more efficient than calling Doc.to_bytes() on every Doc object. For example:
from spacy.tokens import DocBin
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts))
doc_bin = DocBin(docs=docs)
# Save the serialized DocBin to a file
doc_bin.to_disk("./data.spacy")
# Load a serialized DocBin from a file
doc_bin = DocBin().from_disk("./data.spacy")
Footnotes
(1) "Hyper-threading" is a term trademarked by Intel used to refer to their proprietary Simultaneous Multi-Threading (SMT) implementation that improves parallelisation of computations (i.e. doing multiple tasks at once). AMD has SMT as well, it just doesn't have a fancy name. In short, processors with 2-way SMT (SMT-2) allow an Operating System (OS) to treat each physical core on the processor as two cores (referred to as "virtual cores"). Processors with SMT will perform better on tasks that can make use of these multiple "cores", sometimes referred to as "threads" (e.g. the Ryzen 5600X is an 6 core/12 thread processor (i.e. 6 physical cores, but with SMT-2, it has 12 "virtual cores" or "threads")). Note that Intel has recently released a CPU architecture with e-cores, which are cores that don't have hyper-threading, despite other cores on the processor (namely, p-cores) having it, hence you will see some chips like the i5-12600k that have 10 cores with hyper-threading, but it has 16 threads not 20. This is because only the 6 p-cores have hyper-threading, while the 4 e-cores do not, hence 16 threads total.
I profiled a model that I am running and the vast majority of the time in each step (295 of 320ms) is being taken up by "device-to-device" operations (see image). I assume this means loading data from my cpu onto my gpu and back is the bottleneck.
I am running this on a single machine. The data is stored on an SSD and being fed into a GPU.
I am using tensorflow's tf.data.Dataset API and doing all the recommended things like prefetching and num_parallel_calls=tf.data.experimental.AUTOTUNE
My questions are:
(1) Is my assumption correct?
(2) How do I reduce this huge burden on my model?
Tensorboard Profiling Overview
Not a proper answer but it's something; by using tensorflow's mixed precision training I was able to reduce the "device-to-device" time to ~ 145ms. This is still an immense burden compared to everything else profiled and I'd love to be able to reduce it further.
I don't know why this helped either. I assume that mp-training means smaller numbers of bytes are being passed around so maybe that helps.
Does it make sense to implement own branch prediction optimization in own VM interpreter or it is enough to run VM on hardware that already has branch prediction optimization support?
It could make sense in a limited sense.
For example, in a JIT complier, when generating assembly you may decide to lay out code based on the observed branch probabilities. This only needs a very simple type of predictor that knows the overall probability but doesn't need to recognize any patterns. If you did recognize patterns you could do more sophisticated optimizations, e.g. a loop with an embedded branch that alternates every iteration could be unrolled 2x and the body created efficient for the observed case.
For an interpreter it seems a bit less useful, but one can imagine some sophisticated designs that fuse some adjacent instructions together into a single operation for efficiency and this might benefit from branch prediction. Similarly an interpreter might benefit from recognizing loops.
Apparently you're talking about a VM that interprets bytecode, not hardware virtualization of a CPU.
Implement how? Branch prediction in CPUs is only needed because they're pipelined, and for speculative out-of-order execution.
None of those things make sense for interpreter software if it would create more work to implement. Software pipelining can be worth it for loops over arrays to hide load and ALU latency, especially on older in-order CPUs, but that doesn't increase the total number of instructions to be run. If you don't know for sure what needs to be done next, leave the speculation to hardware OoO exec.
Note that for a pure non-JITing interpreter, control dependencies in the guest code become data dependencies in the interpreter, while a sequence of different instructions in the guest creates a control dependency in the interpreter (to dispatch to handler functions). See How exactly R is affected by Branch Prediction?
You do potentially need to care about branch prediction in the CPU that will run your code. Recently (like Intel since Haswell), CPUs are finally not bad for that, using IT-TAGE predictors: Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore.
You don't implement branch prediction in software, but for older CPUs it was worth tuning interpreters with hardware branch prediction in mind. X86 prefetching optimizations: "computed goto" threaded code has some links, especially an article by Darek Mihocka discussing how badly it sucks for older CPUs (current at the time it was written) to have one "grand central" dispatch branch, like a single switch that every instruction-handler function returns to. That means the entire pattern of which instruction tends to follow which other instruction has to be predicted for that single branch. Without something like IT-TAGE, the prediction state for a single branch is very limited.
Tuning for older CPUs can involve putting dispatch to the next instruction at the end of each handler function, instead of returning to a single dispatch loop. But again, that's not implementing branch prediction, that's tuning for it.
Recently i looked into reinforcement learning and there was one question bugging me, that i could not find an answer for: How is training effectively done using GPUs? To my understanding constant interaction with an environment is required, which for me seems like a huge bottleneck, since this task is often non-mathematical / non-parallelizable. Yet for example Alpha Go uses multiple TPUs/GPUs. So how are they doing it?
Indeed, you will often have interactions with the environment in between learning steps, which will often be better off running on CPU than GPU. So, if your code for taking actions and your code for running an update / learning step are very fast (as in, for example, tabular RL algorithms), it won't be worth the effort of trying to get those on the GPU.
However, when you have a big neural network, that you need to go through whenever you select an action or run a learning step (as is the case in most of the Deep Reinforcement Learning approaches that are popular these days), the speedup of running these on GPU instead of CPU is often enough for it to be worth the effort of running them on GPU (even if it means you're quite regularly ''switching'' between CPU and GPU, and may need to copy some things from RAM to VRAM or the other way around).
When doing off-policy reinforcement learning (which means you can use transitions samples generated by a "behavioral" policy, different from the one you are currently learning), an experience replay is generally used. Therefore, you can grab a bunch of transitions from this large buffer and use a GPU to optimize the learning objective with SGD (c.f. DQN, DDPG).
One instance of CPU-GPU hybrid approach for RL is this - https://github.com/NVlabs/GA3C.
Here, multiple CPUs are used to interact with different instances of the environment. "Trainer" and "Predictor" processes then collect the interactions using multi-process queues, and pass them to a GPU for back-propagation.
I have been playing around with building some deep learning models in Python and now I have a couple of outcomes I would like to be able to show friends and family.
Unfortunately(?), most of my friends and family aren't really up to the task of installing any of the advanced frameworks that are more or less necessary to have when creating these networks, so I can't just send them my scripts in the present state and hope to have them run.
But then again, I have already created the nets, and just using the finished product is considerably less demanding than making it. We don't need advanced graph compilers or GPU compute powers for the show and tell. We just need the ability to make a few matrix multiplications.
"Just" being a weasel word, regrettably. What I would like to do is convert the the whole model (connectivity,functions and parameters) to a model expressed in e.g. regular Numpy (which, though not part of standard library, is both much easier to install and easier to bundle reliably with a script)
I fail to find any ready solutions to do this. (I find it difficult to pick specific keywords on it for a search engine). But it seems to me that I can't be the first guy who wants to use a ready-made deep learning model on a lower-spec machine operated by people who aren't necessarily inclined to spend months learning how to set the parameters in an artificial neural network.
Are there established ways of transferring a model from e.g. Theano to Numpy?
I'm not necessarily requesting those specific libraries. The main point is I want to go from a GPU-capable framework in the creation phase to one that is trivial to install or bundle in the usage phase, to alleviate or eliminate the threshold the dependencies create for users without extensive technical experience.
An interesting option for you would be to deploy your project to heroku, like explained on this page:
https://github.com/sugyan/tensorflow-mnist