How to optimize SpaCy pipe for NER only (using an existing model, no training) - spacy

I am looking to use SpaCy v3 to extract named entities from a large list of sentences. What I have works, but it seems slower than it should be, and before investing in more machines, I'd like to know if I am doing more work than I need to in the pipe.
I've used ntlk to parse everything into sentences as an iterator, then process these using "pipe" to get the named entities. All of this appears to work well, and python appears to be hitting every cpu core on my machine fairly heavily, which is good.
nlp = spacy.load("en_core_web_trf")
for (doc, context) in nlp.pipe(lines, as_tuples=True, batch_size=1000):
for ent in doc.ents:
pass #handle each entity
I understand that I can use nlp.disable_pipes to disable certain elements. Is there anything I can disable that won't impact accuracy and that isn't required for NER?

For NER only with the transformer model en_core_web_trf, you can disable ["tagger", "parser", "attribute_ruler", "lemmatizer"].
If you want to use a non-transformer model like en_core_web_lg (much faster but slightly lower accuracy), you can disable ["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"] and use nlp.pipe(n_process=-1) for multiprocessing on all CPUs (or n_process=N to restrict to N CPUs).

Related

Gensim word2vec saves numpy arrays?

I am running the Word2Vec implementation from gensim twice, and I have a problem with the save function:
model_ = gensim.models.Word2Vec(all_doc, size=int(config['MODEL']['embed_size']),
window=int(config['MODEL']['window']),
workers=multiprocessing.cpu_count(),
sg=1, iter=int(config['MODEL']['iteration']),
negative=int(config['MODEL']['negative']),
min_count=int(config['MODEL']['min_count']), seed=int(config['MODEL']['seed']))
model_.save(config['BASIC']['embedding_dir'])
I obtain different outputs for each time I run it. The first time it gives an "output_embedding", an "output_embedding.trainables.syn1neg.npy" and an "output_embedding.wv.vectors.npy". But the second time it does not give the two npy files, it just generates "output_embedding".
The only thing I change from the first to the second time is the sentences I use as input (all_doc).
Why it does not generate the 3 files ?
Gensim only creates the separate files when the size of the internal numpy arrays is over a certain threshold – so I suspect your all_doc corpus has a very small vocabulary in one case, and a more typically large vocabulary in the other.
When it does generate multiple files, be sure to keep them all together for later loads to work.
(If for some urgent reason you needed to change that behavior, the inherited .save() method takes an optional sep_limit argument to change the threshold - but I'd recommend against mucking with this.)
Separately: that your file names have .trainables. in them suggests you're using a pre-4.0.0 version of Gensim. There've been some improvements to Word2Vec & related algorithms in the latest Gensim, and some older code will need small changes to keep working, so you may want to upgrade to the latest version before building any more functionality on an older base.

Problem when predicting via multiprocess with Tensorflow

I have 4 (or more) models (same structure but different training data). Now I want to ensemble them to make a prediction. I want to pre-load the models and then predict one input message (one message at a time) in parallel via multiprocess. However, the program always stops at "session.run" step. I could not figure it out why.
I tried passing all arguments to the function in each process, as shown in the code below. I also tried using a Queue object and put all the data (except the model object) in the queue. I also tried to set the number of process to 1. It made no difference.
with Manager() as manager:
first_level_test_features=manager.list()
procs =[]
for id in range(4):
p = Process(target=predict, args=(id, (message, models, configs, vocabs, emoji_dict,first_level_test_features)))
procs.append(p)
p.start()
for p in procs:
p.join()
I did not get any error message since it is just stuck there. I would expect the program can start multiple processes and each process uses the model pass to it to make the prediction.
I am unsure how session sharing along different Processes would work, and this is probably where your issue comes from. Given the way TensorFlow works, I would advise implementing the ensemble call as a graph operation, so that it can be run through a single session.run call, with TF handling the parallelization of computations wherever possible.
In practice, if you have symbolic tensors representing the models' predictions, you could use a TF operation to aggregate them (tf.concat, tf.reduce_mean, tf.add_n... whichever suits your design) and end up with a single symbolic tensor representing the ensemble prediction.
I hope this helps; if not, please provide some more details as to what your setting is, notably which form your models have.

Is it possible to train an xgboost model in Python and deploy it Run it in C/C++?

How much cross compatibility is there between the different language APIs?
For example, is it possible to train and save a model in Python and run it in C/C++ or any other language?
I would try this myself however my skills in non-Python languages are very limited.
You can dump the model into a text file as like this:
model.get_booster().dump_model('xgb_model.txt')
Then you should parse the text dump and reproduce the prediction function in C++.
I have implemented this in a little library that I call FastForest, if you want to save some time and want to make sure you use a fast implementation:
https://github.com/guitargeek/XGBoost-FastForest
The mission of the library is to be:
Easy: deploying your xgboost model should be as painless as it can be
Fast: thanks to efficient structure-of-array data structures for storing the trees, this library goes very easy on your CPU and memory (it is about 3 to 5 times faster than xgboost in prediction)
Safe: the FastForest objects are immutable, and therefore they are an excellent choice in multithreading environments
Portable: FastForest has no dependency other than the C++ standard library
Here is a little usage example, loading the model you have dumped before and assuming the model requires 5 features:
std::vector<std::string> features{"f0", "f1", "f2", "f3", "f4"};
FastForest fastForest("xgb_model.txt", features);
std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};
float output = fastForest(input.data());
When you create the FastForest you have to tell it in which order you intend to pass the features, because the text file does not store the order of the features.
Also note that the FastForest does not do the logistic transformation for you, so in order to reproduce predict_proba() you need to apply the logistic transformation:
float proba = 1./(1. + std::exp(-output));
The treelite package(research paper, documentation) enables compilation of tree-based models, including XGBoost, to optimized C code, making inference much faster than with native model libraries.
You could consider dumping your model in a text file using
model.get_booster().dump_model('xgb_model.txt', with_stats=True)
then, after some parsing, you can easily reproduce the .predict() function in C/C++. For the rest I am not aware of native porting of xgboost to C

Spacy 2.0 en_vectors_web_lg vs en_core_web_lg

What is the difference between the word vectors given in en_core_web_lg and en_vectors_web_lg? The number of keys are different: 1.1m vs 685k. I assume this means the en_vectors_web_lg has broader coverage by maintaining morphological information somewhat resulting in more distinct tokens as they are both trained on the common crawl corpus but have a different number of tokens.
The en_vectors_web_lg package has exactly every vector provided by the original GloVe model. The en_core_web_lg model uses the vocabulary from the v1.x en_core_web_lg model, which from memory pruned out all entries which occurred fewer than 10 times in a 10 billion word dump of Reddit comments.
In theory, most of the vectors that were removed should be things that the spaCy tokenizer never produces. However, earlier experiments with the full GloVe vectors did score slightly higher than the current NER model --- so it's possible we're actually missing out on something by losing the extra vectors. I'll do more experiments on this, and likely switch the lg model to include the unpruned vector table, especially now that we have the md model, which strikes a better compromise than the current lg package.

Word2Vec: Any way to train model fastly?

I use Gensim Word2Vec to train word sets in my database.
I have about 400,000 phrase(Each phrase is short. Total 700MB) in my PostgreSQL database.
This is how I train these data using Django ORM:
post_vector_list = []
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
post_vector_list.append(post_vector)
word2vec_model = gensim.models.Word2Vec(post_vector_list, window=10, min_count=2, size=300)
But this job getting a lot of time and feels like not efficient.
Especially, creating post_vector_list part took a lot of time and space..
I want to improve speed of training but have no idea how to do.
Want to get your advices. Thanks.
To optimize such code, you need to collect good information about where the time is spent.
Is most of the time spent preparing post_vector_list?
If so, you will want to make sure my_tokenizer (whose code is not shown) is as efficient as possible. You may want to try to minimize the number of extend()s and append()s that are done on large lists. You might have to even take a look at your DB's configuration or options to speed up the DB-to-Object mapping started inside Post.objects.all().
Is most of the time spent in the call to Word2Vec()?
If so, other steps may help:
ensure you're using gensim's Cython-optimized routines – if not, you should be seeing a logged warning (and training will be up to 100X slower)
consider using a workers=4 or workers=8 optional argument to use more threads, if your machine has at least 4 or 8 CPU cores
consider using a larger min_count, which speeds training somewhat (and since vectors for words where there are only a few examples typically aren't very good anyway, doesn't lose much and can even improve the quality of the surviving words)
consider using a smaller window, since training takes longer for larger windows
consider using a smaller vector_size (previously called size), since training takes longer for larger-size vectors
consider using a more-aggressive (smaller) value for the optional sample argument, which randomly skips more of the most-frequent words. The default is 1e-04, but values of 1e-05 or 1e-06 (especially on larger corpuses) can offer additional speedup, and even often improve the final vectors (by spending relatively less training time on words with an excess of usage examples)
consider using a lower-than-default (5) value for the optional epochs parameter (previously called iter). (I wouldn't recommend this unless the corpus is very large – so it already has many redundant, equally-good examples of the same words throughout.)
you could use a python generator instead of loading all the data into the list. Gensim works with python generators too. The code will look something like this
class Post_Vectors(object):
def __init__(self, Post):
self.Post = Post
def __iter__(self):
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
yield post_vector
post_vectors = Post_Vectors(Post)
word2vec_model = gensim.models.Word2Vec(post_vectors, window=10, min_count=2, size=300, workers=??)
For the gensim speedup, if you have a multi-core CPU, you could use the workers parameter. (By default it is 3)