I have been trying to understand how to systematically make Spacy run as fast as possible for a long time and I would like this post to become a wiki-style public post if possible.
Here is what I currently know, with subsidiary questions on each point:
1. Space will run faster on faster hardware. For example, try a computer with more CPU cores, or more RAM/primary memory.
What I do not know:
What specific aspects of the execution of Spacy - especially the main one of instantiating the Doc object - depend more on CPU vs. RAM and why?
Is the instantiation of a Doc object a sequence of arithmetical calculations (the compiled binary of the neural networks), so the more CPU cores, the more calculations can be done at once, therefore faster? Does that mean increasing RAM would not make this process faster?
Are there any other aspects of CPUs or GPUs to watch out for, other than cores, that would make one chip better than another, for Spacy? Someone mentioned "hyper threading".
Is there any standard mathematical estimate of time per pipeline component, such as parser, relative to input string length? Like Parser, seconds = number of characters in input? / number of CPU cores
2. You can make Spacy run faster by removing components you don't need, for example by nlp = spacy.load("en_core_web_sm", disable=['tagger', 'ner', 'lemmatizer', 'textcat'])
Just loading the Spacy module itself with import spacy is slightly slow. If you haven't even loaded the language model yet, what are the most significant things being loaded here, apart from just adding functions to the namespace? Is it possible to only load a part of the module you need?
3. You can make Spacy faster by using certain options that simply make it run faster.
I have read about multiprocessing with nlp.pipe, n_process, batch_size and joblib, but that's for multiple documents and I'm only doing a single document right now.
4. You can make Spacy faster by minimising the number of times it has to perform the same operations.
You can keep Spacy alive on a server and pass processing commands to it when you need to
You can serialize a Doc to reload it later, and you can further exclude attributes you don't need with doc.to_bytes(exclude=["tensor"]) or doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
5. Anything else?
Checklist
The following checklist is focused on runtime performance optimization and not training (i.e. when one utilises existing config.cfg files loaded with the convenience wrapper spacy.load(), instead of training their own models and creating a new config.cfg file), however, most of the points still apply. This list is not comprehensive: the spaCy library is extensive and there are many ways to build pipelines and carry out tasks. Thus, including all cases here is impractical, regardless, this list intends to be a handy reference and starting point.
Summary
If more powerful hardware is available, use it.
Use (optimally) small models/pipelines.
Use your GPU if possible.
Process large texts as a stream and buffer them in batches.
Use multiprocessing (if appropriate).
Use only necessary pipeline components.
Save and load progress to avoid re-computation.
1. If more powerful hardware is available, use it.
CPU. Most of spaCy's work at runtime is going to be using CPU instructions to allocate memory, assign values to memory and perform computations, which, in terms of speed, will be CPU bound not RAM, hence, performance is predominantly dependent on the CPU. So, opting for a better CPU as opposed to more RAM is the smarter choice in most situations. As a general rule, newer CPUs with higher frequencies, more cores/threads, more cache etc. will realise faster spaCy processing times. However, simply comparing these numbers between different CPU architectures is not useful. Instead look at benchmarks like cpu.userbenchmark.com (e.g. i5-12600k vs. Ryzen 9 5900X) and compare the single-core and multi-core performance of prospective CPUs to find those that will likely offer better performance. See Footnote (1) on hyperthreading & core/thread counts.
RAM. The practical consideration for RAM is the size: larger texts require more memory capacity, speed and latency is less important. If you have limited RAM capacity, disable NER and parser when creating your Doc for large input text (e.g. doc = nlp("My really long text", disable = ['ner', 'parser'])). If you require these parts of the pipeline, you'll only be able to process approximately 100,000 * available_RAM_in_GB characters at a time, if you don't, you'll be able to process more than this. Note that the default spaCy input text limit is 1,000,000 characters, however this can be changed by setting nlp.max_length = your_desired_length.
GPU. If you opt to use a GPU, processing times can be improved for certain aspects of the pipeline which make use of GPU-based computations. See the section below on making use of your GPU. The same general rule as with CPUs applies here too: generally, newer GPUs with higher frequencies, more memory, larger memory bus widths, bigger bandwidth etc. will realise faster spaCy processing times.
Overclocking. If you're experienced with overclocking and have the correct hardware to be able to do it (adequate power supply, cooling, motherboard chipset), then another effective way to gain extra performance without changing hardware is to overclock your CPU/GPU.
2. Use (optimally) small models/pipelines.
When computation resources are limited, and/or accuracy is less of a concern (e.g. when experimenting or testing ideas), load spaCy pipelines that are efficiency focused (i.e. those with smaller models). For example:
# Load a "smaller" pipeline for faster processing
nlp = spacy.load("en_core_web_sm")
# Load a "larger" pipeline for more accuracy
nlp = spacy.load("en_core_web_trf")
As a concrete example of the differences, on the same system, the smaller en_core_web_lg pipeline is able to process 10,014 words per second, whereas the en_core_web_trf pipeline only processes 684. Remember that there is often a trade-off between speed and accuracy.
3. Use your GPU if possible.
Due to the nature of neural network-based models, their computations can be efficiently solved using a GPU, leading to boosts in processing times. For instance, the en_core_web_lg pipeline can process 10,014 vs. 14,954 words per second when using a CPU vs. a GPU.
spaCy can be installed for a CUDA compatible GPU (i.e. Nvidia GPUs) by calling pip install -U spacy[cuda] in the command prompt. Once a GPU-enabled spaCy installation is present, one can call spacy.prefer_gpu() or spacy.require_gpu() somewhere in your program before any pipelines have been loaded. Note that require_gpu() will raise an error if no GPU is available. For example:
spacy.prefer_gpu() # Or use spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")
4. Process large texts as a stream and buffer them in batches.
When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts (default is 1000), and process the texts as a stream using nlp.pipe(). For example:
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, batch_size=1000))
5. Use multiprocessing (if appropriate).
To make use of multiple CPU cores, spaCy includes built-in support for multiprocessing with nlp.pipe() using the n_process option. For example,
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=4))
Note that each process requires its own memory. This means that every time a new process is spawned (the default start method), model data has to be copied into memory for every individual process (hence, the larger the model, the more overhead to spawn a process). Therefore, it is recommended that if you are just doing small tasks, that you increase the batch size and use fewer processes. For example,
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=2, batch_size=2000)) # default batch_size = 1000
Finally, multiprocessing is generally not recommended on GPUs because RAM is limited.
6. Use only necessary pipeline components.
Generating predictions from models in the pipeline that you don't require unnecessarily degrades performance. One can prevent this by either disabling or excluding specific components, either when loading a pipeline (i.e. with spacy.load()) or during processing (i.e. with nlp.pipe()).
If you have limited memory, exclude the components you don't need, for example:
# Load the pipeline without the entity recognizer
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
If you might need a particular component later in your program, but still want to improve processing speed for tasks that don't require those components in the interim, use disable, for example:
# Load the tagger but don't enable it
nlp = spacy.load("en_core_web_sm", disable=["tagger"])
# ... perform some tasks with the pipeline that don't require the tagger
# Eventually enable the tagger
nlp.enable_pipe("tagger")
Note that the lemmatizer depends on tagger+attribute_ruler or morphologizer for a number of languages. If you disable any of these components, you’ll see lemmatizer warnings unless the lemmatizer is also disabled.
7. Save and load progress to avoid re-computation.
If one has been modifying the pipeline or vocabulary, made updates to model components, processed documents etc., there is merit in saving one's progress to reload at a later date. This requires one to translate the contents/structure of an object into a format that can be saved -- a process known as serialization.
Serializing the pipeline
nlp = spacy.load("en_core_web_sm")
# ... some changes to pipeline
# Save serialized pipeline
nlp.to_disk("./en_my_pipeline")
# Load serialized pipeline
nlp.from_disk("./en_my_pipeline")
Serializing multiple Doc objects
The DocBin class provides an easy method for serializing/deserializing multiple Doc objects, which is also more efficient than calling Doc.to_bytes() on every Doc object. For example:
from spacy.tokens import DocBin
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts))
doc_bin = DocBin(docs=docs)
# Save the serialized DocBin to a file
doc_bin.to_disk("./data.spacy")
# Load a serialized DocBin from a file
doc_bin = DocBin().from_disk("./data.spacy")
Footnotes
(1) "Hyper-threading" is a term trademarked by Intel used to refer to their proprietary Simultaneous Multi-Threading (SMT) implementation that improves parallelisation of computations (i.e. doing multiple tasks at once). AMD has SMT as well, it just doesn't have a fancy name. In short, processors with 2-way SMT (SMT-2) allow an Operating System (OS) to treat each physical core on the processor as two cores (referred to as "virtual cores"). Processors with SMT will perform better on tasks that can make use of these multiple "cores", sometimes referred to as "threads" (e.g. the Ryzen 5600X is an 6 core/12 thread processor (i.e. 6 physical cores, but with SMT-2, it has 12 "virtual cores" or "threads")). Note that Intel has recently released a CPU architecture with e-cores, which are cores that don't have hyper-threading, despite other cores on the processor (namely, p-cores) having it, hence you will see some chips like the i5-12600k that have 10 cores with hyper-threading, but it has 16 threads not 20. This is because only the 6 p-cores have hyper-threading, while the 4 e-cores do not, hence 16 threads total.
Related
Does it make sense to implement own branch prediction optimization in own VM interpreter or it is enough to run VM on hardware that already has branch prediction optimization support?
It could make sense in a limited sense.
For example, in a JIT complier, when generating assembly you may decide to lay out code based on the observed branch probabilities. This only needs a very simple type of predictor that knows the overall probability but doesn't need to recognize any patterns. If you did recognize patterns you could do more sophisticated optimizations, e.g. a loop with an embedded branch that alternates every iteration could be unrolled 2x and the body created efficient for the observed case.
For an interpreter it seems a bit less useful, but one can imagine some sophisticated designs that fuse some adjacent instructions together into a single operation for efficiency and this might benefit from branch prediction. Similarly an interpreter might benefit from recognizing loops.
Apparently you're talking about a VM that interprets bytecode, not hardware virtualization of a CPU.
Implement how? Branch prediction in CPUs is only needed because they're pipelined, and for speculative out-of-order execution.
None of those things make sense for interpreter software if it would create more work to implement. Software pipelining can be worth it for loops over arrays to hide load and ALU latency, especially on older in-order CPUs, but that doesn't increase the total number of instructions to be run. If you don't know for sure what needs to be done next, leave the speculation to hardware OoO exec.
Note that for a pure non-JITing interpreter, control dependencies in the guest code become data dependencies in the interpreter, while a sequence of different instructions in the guest creates a control dependency in the interpreter (to dispatch to handler functions). See How exactly R is affected by Branch Prediction?
You do potentially need to care about branch prediction in the CPU that will run your code. Recently (like Intel since Haswell), CPUs are finally not bad for that, using IT-TAGE predictors: Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore.
You don't implement branch prediction in software, but for older CPUs it was worth tuning interpreters with hardware branch prediction in mind. X86 prefetching optimizations: "computed goto" threaded code has some links, especially an article by Darek Mihocka discussing how badly it sucks for older CPUs (current at the time it was written) to have one "grand central" dispatch branch, like a single switch that every instruction-handler function returns to. That means the entire pattern of which instruction tends to follow which other instruction has to be predicted for that single branch. Without something like IT-TAGE, the prediction state for a single branch is very limited.
Tuning for older CPUs can involve putting dispatch to the next instruction at the end of each handler function, instead of returning to a single dispatch loop. But again, that's not implementing branch prediction, that's tuning for it.
This is a very generic question. What is the best way to study basic CPU models in gem5 so that i can build my own cpu models using them. DO i need to understand the base models fully. I mean do i need to go through the codes line by line to understand the funcionality of those cpu models in gem5?
If your goal is only to change the timing of different pipeline stages, you can change it in your configuration script, as the cpu models in gem5 have options. You can change instruction latencies, number of functional units, cycles between fetch/decode/execute/...
You could take a look at https://github.com/gem5/gem5/tree/master/configs/common/cores/arm where the authors of these file set some options to change the structure of a cpu core. The core still uses the detailed gem5 out-of-order cpu model, but only the parameters (sizes of structures, latencies between structures ...) are modified.
Using this as an example you could change what you want without having to fully understand the code for the detailed cpu model.
Recently i looked into reinforcement learning and there was one question bugging me, that i could not find an answer for: How is training effectively done using GPUs? To my understanding constant interaction with an environment is required, which for me seems like a huge bottleneck, since this task is often non-mathematical / non-parallelizable. Yet for example Alpha Go uses multiple TPUs/GPUs. So how are they doing it?
Indeed, you will often have interactions with the environment in between learning steps, which will often be better off running on CPU than GPU. So, if your code for taking actions and your code for running an update / learning step are very fast (as in, for example, tabular RL algorithms), it won't be worth the effort of trying to get those on the GPU.
However, when you have a big neural network, that you need to go through whenever you select an action or run a learning step (as is the case in most of the Deep Reinforcement Learning approaches that are popular these days), the speedup of running these on GPU instead of CPU is often enough for it to be worth the effort of running them on GPU (even if it means you're quite regularly ''switching'' between CPU and GPU, and may need to copy some things from RAM to VRAM or the other way around).
When doing off-policy reinforcement learning (which means you can use transitions samples generated by a "behavioral" policy, different from the one you are currently learning), an experience replay is generally used. Therefore, you can grab a bunch of transitions from this large buffer and use a GPU to optimize the learning objective with SGD (c.f. DQN, DDPG).
One instance of CPU-GPU hybrid approach for RL is this - https://github.com/NVlabs/GA3C.
Here, multiple CPUs are used to interact with different instances of the environment. "Trainer" and "Predictor" processes then collect the interactions using multi-process queues, and pass them to a GPU for back-propagation.
I'm interested in implementing a hierarchical softmax model that can handle large vocabularies, say on the order of 10M classes. What is the best way to do this to both be scalable to large class counts and efficient? For instance, at least one paper has shown that HS can achieve a ~25x speedup for large vocabs when using a 2-level tree where each node sqrt(N) classes. I'm interested also in a more general version for an arbitrary depth tree with an arbitrary branching factor.
There are a few options that I see here:
1) Run tf.gather for every batch, where we gather the indices and splits. This creates problems with large batch sizes and fat trees where now the coefficients are being duplicated a lot, leading to OOM errors.
2) Similar to #1, we could use tf.embedding_lookup which would keep help with OOM errors but now keeps everything on the CPU and slows things down quite a bit.
3) Use tf.map_fn with parallel_iterations=1 to process each sample separately and go back to using gather. This is much more scalable but does not really get close to the 25x speedup due to the serialization.
Is there a better way to implement HS? Are there different ways for deep and narrow vs. short and wide trees?
You mention that you want GPU-class performance:
but now keeps everything on the CPU and slows things down quite a bit
and wish to use 300-unit hidden size and 10M-word dictionaries.
This means that (assuming float32), you'll need 4 * 300 * 10M * 2 bytes = 24 GB just to store the parameters and the gradient for the output layer.
Hierarchical Softmax (HSM) doesn't reduce the memory requirements - it just speeds up the training.
Realistically, you'll need a lot more GPU memory, because you'll also need to store:
other parameters and their gradients
optimizer data, e.g. velocities in momentum training
activations and backpropagated temporary data
framework-specific overhead
Therefore, if you want to do all computation on GPUs, you'll have no choice but to distribute this layer across multiple high-memory GPUs.
However, you now have another problem:
To make this concrete, let's suppose you have a 2-level HSM with 3K classes, with 3K words per class (9M words in total). You distribute the 3K classes across 8 GPUs, so that each hosts 384 classes.
What if all target words in a batch are from the same 384 classes, i.e. they belong to the same GPU? One GPU will be doing all the work, while the other 7 wait for it.
The problem is that even if the target words in a batch belong to different GPUs, you'll still have the same performance as in the worst-case scenario, if you want to do this computation in TensorFlow (This is because TensorFlow is a "specify-and-run" framework -- the computational graph is the same for the best case and the worst case)
What is the best way to do this to both be scalable to large class counts and efficient?
The above inefficiency of model parallelism (each GPU must process the whole batch) suggests that one should try to keep everything in one place.
Let us suppose that you are either implementing everything on the host, or on 1 humongous GPU.
If you are not modeling sequences, or if you are, but there is only one output for the whole sequence, then the memory overhead from copying the parameters, to which you referred, is negligible compared to the memory requirements described above:
400 == batch size << number of classes == 3K
In this case, you could simply use gather or embedding_lookup (Although the copying is inefficient)
However, if you do model sequences of length, say, 100, with output at every time step, then the parameter copying becomes a big issue.
In this case, I think you'll need to drop down to C++ / CUDA C and implement this whole layer and its gradient as a custom op.
I have a quad core computer; and I use the parallel computing toolbox.
I set different number for the "worker" number in the parallel computing setting, for example 2,4,8..............
However, no matter what I set, the AVERAGE cpu usage by MATLAB is exactly 25% of total CPU usage; and None of the cores run at 100% (All are around 10%-30%). I am using MATLAB to run optimization problem, so I really want my quad core computer using all its power to do the computing. Please help
Setting a number of workers (up to 4 on a quad-core) is not enough. You also need to use a command like parfor to signal to Matlab what part of the calculation should be distributed among the workers.
I am curious about what kind of optimization you're running. Normally, optimization problems are very difficult to parallelize, since the result of every iteration depends on the previous one. However, if you want to e.g. try and fit multiple models to the data, or if you have to fit multiple data sets, then you can easily run these in parallel as opposed to sequentially.
Note that having many cores may not be sufficient in terms of resources - if performing the optimization on one worker uses k GB of RAM, performing it on n workers requires at least n*k GB of RAM.