Spacy - erroneous config.file - config

While training ner with custom labels I created a .json file the exactly similar way but with my own data as stated in the example.
Then I tried to convert it (both train/dev) to the binary format needed for training using the command:
python -m spacy convert train.json ./ -t spacy
which did result in creating 2 files.
The error I got while launching the training process:
[E923] It looks like there is no proper sample data to initialize the Model of component 'ner'. To check your input data paths and annotation, run: python -m spacy debug data config.cfg
The debug command output is the same.

The problem is that there are overlapping entities. For each word there should be only one tag.
The solution of the problem can be (code from spacy_convert_script):
import srsly
import spacy
for f in ["train.json", "dev.json"]:
nlp = spacy.blank("en")
db = DocBin()
for text, annot in srsly.read_json(f):
doc = nlp.make_doc(text)
ents = []
for start, end, label in annot["entities"]:
span = doc.char_span(start, end, label=label)
if span is None:
msg = f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n\n{repr(text)}\n"
doc.ents = ents
print(doc.text, ents) #see which texts cause the problem
That just would result in skipping the texts which cause problems. To choose one of the overlapping entities:
x = 0
for start, end, label in annot["entities"]:
span = doc.char_span(start, end, label=label)
if span is None:
msg = f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n\n{repr(text)}\n"
if start > x and end > x:
x = end


How to load customized NER model from disk with SpaCy?

I have customized NER pipeline with following procedure
doc = nlp("I am going to Vallila. I am going to Sörnäinen.")
for ent in doc.ents:
print(ent.text, ent.label_)
'We need to deliver it to Vallila', {
'entities': [(25, 32, 'DISTRICT')]
'We need to deliver it to somewhere', {
'entities': []
ner = nlp.get_pipe("ner")
optimizer = nlp.get_pipe("ner").create_optimizer()
import random
from import Example
for i in range(25):
for text, annotation in TRAIN_DATA:
example = Example.from_dict(nlp.make_doc(text), annotation)
nlp.update([example], sgd=optimizer)
I tried to save that customized NER to disk and load it again with following code
import spacy
from spacy.pipeline import EntityRecognizer
nlp = spacy.load("en_core_web_lg", disable=['ner'])
ner = EntityRecognizer(nlp.vocab)
I got however following error:
---> 10 ner = EntityRecognizer(nlp.vocab)
11 ner.from_disk('/home/feru/ner')
12 nlp.add_pipe(ner)
~/.local/lib/python3.8/site-packages/spacy/pipeline/ner.pyx in
TypeError: init() takes at least 2 positional arguments (1 given)
This method to save and load custom component from disk seems to be from some erly SpaCy version. What's the second argument EntityRecognizer needs?
The general process you are following of serializing a single component and reloading it is not the recommended way to do this in spaCy. You can do it - it has to be done internally, of course - but you generally want to save and load pipelines using high-level wrappers. In this case this means that you would save like this:
nlp.to_disk("my_model") # NOT ner.to_disk
And then load it with spacy.load("my_model").
You can find more detail about this in the saving and loading docs. Since it seems you're just getting started with spaCy, you might want to go through the course too. It covers the new config-based training in v3, which is much easier than using your own custom training loop like in your code sample.
If you want to mix and match components from different pipelines, you still will generally want to save entire pipelines, and you can then combine components from them using the "sourcing" feature.

Chatbot using Huggingface Transformers

I would like to use Huggingface Transformers to implement a chatbot. Currently, I have the code shown below. The transformer model already takes into account the history of past user input.
Is there something else (additional code) I have to take into account for building the chatbot?
Second, how can I modify my code to run with TensorFlow instead of PyTorch?
Later on, I also plan to fine-tune the model on other data. I also plan to test different models such as BlenderBot and GPT2. I think to test this different models it should be as easy as replacing the corresponding model in AutoTokenizer.from_pretrained("microsoft/DialoGPT-small") and AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
for step in range(5):
# encode the new user input, add the eos_token and return a tensor in Pytorch
new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
# append the new user input tokens to the chat history
bot_input_ids =[chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
# generated a response while limiting the total chat history to 1000 tokens,
chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
# pretty print last ouput tokens from bot
print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
Here is an example of using the DialoGPT model with Tensorflow:
from transformers import TFAutoModelForCausalLM, AutoTokenizer, BlenderbotTokenizer, TFBlenderbotForConditionalGeneration
import tensorflow as tf
chat_bots = {
'BlenderBot': [BlenderbotTokenizer.from_pretrained('facebook/blenderbot-400M-distill'), TFT5ForConditionalGeneration.from_pretrained('facebook/blenderbot-400M-distill')],
'DialoGPT': [AutoTokenizer.from_pretrained("microsoft/DialoGPT-small"), TFAutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")],
key = 'DialoGPT'
tokenizer, model = chat_bots[key]
for step in range(5):
new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='tf')
if step > 0:
bot_input_ids = tf.concat([chat_history_ids, new_user_input_ids], axis=-1)
bot_input_ids = new_user_input_ids
chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
print(key + ": {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
>> User:How are you?
DialoGPT: I'm here
>> User:Why are you here
DialoGPT: I'm here
>> User:But why
DialoGPT: I'm here
>> User:Where is here
DialoGPT: Where is where?
>> User:Here
DialoGPT: Where is here?
If you want to compare different chatbots, you might want to adapt their decoder parameters, because they are not always identical. For example, using BlenderBot and a max_length of 50 you get this kind of response with the current code:
>> User:How are you?
BlenderBot: ! I am am great! how how how are are are???
In general, you should ask yourself which special characters are important for a chatbot (depending on your domain) and which characters should / can be omitted?
You should also experiment with different decoding methods such as greedy search, beam search, random sampling, top-k sampling, and nucleus sampling and find out what works best for your use case. For more information on this topic check out this post

How to perform the Text Similarity using BERT on 10M+ corpus? Using LSH/ ANNOY/ fiass or sklearn?

My idea is to extract the CLS token for all the text in the DB and save it in CSV or somewhere else. So when a new text comes in, instead of using the Cosine Similarity/JAccard/MAnhattan/Euclidean or other distances, I have to use some approximation like LSH, ANN (ANNOY, sklearn.neighbor) or the one given here faiss . How can that be done? I have my code as:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, I am a text")).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
Using Tensorflow:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
and I think can get the CLS token as: (Please correct if wrong)
last_hidden_states = outputs[0]
cls_embedding = last_hidden_states[0][0]
Please tell me if it's the right way to use and how can I use any of the LSH, ANNOT, faiss or something like that?
So for every text, there'll a 768 length vector and we can create a N(No of texts 10M)x768 matrix, how can I find the Index of top-5 data points (texts) which are most similar to the given image/embedding/data point?

Export vectors from fastText to spaCy

I downloaded the vectors of 1.5gb, I used example code spaCy examples vectors_fast_text. I executed the following command in the terminal:
python config/ vectors_loc data/vectors/
After a few minutes with the processor at 100%, I received the following text:
class colspan 0.32231358
What happens from here? How can I export these vectors elsewhere, such as for example with my AWS S3 training templates?
I modified the example script, to load the existing data of my language, read the file word2vec and at the end write all the content in a folder (this folder needs to exist).
[LANGUAGE] = example: "pt"
[FILE_WORD2VEC] = "./data/word2vec.txt"
from __future__ import unicode_literals
import plac
import numpy
import spacy
from spacy.language import Language
def main():
nlp = spacy.load('[LANGUAGE]')
with open("[FILE_WORD2VEC]", 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
count = 0
for line in file_:
count += 1
line = line.rstrip().decode('utf8')
pieces = line.rsplit(' ', int(nr_dim))
word = pieces[0]
print("{} - {}".format(count, word))
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
if __name__ == '__main__':
Type in the terminal:
It will take about 10 minutes to finish, depending on the size of the word2vec file. In the script I made the print of the word, so that you can follow.
After that, you must type in the terminal:
python -m spacy package ./models/new_nlp/ ./my_models/
python sdist
And then you will have a "zip" file.
pip install /path/to/pt_example_model-1.0.0.tar.gz
A detailed tutorial can be found on the spaCy website:

Displaying tf.summary.text with underscores correctly in Tensorboard

I want to log a few strings with underscores to tensorboard. However, the underscores are treated as emphasis somewhere in the pipeline. Here's some example code to illustrate the problem. I've included a few versions that attempt to escape the underscores
import tensorflow as tf
sess = tf.InteractiveSession()
text0 = """/a/b/c_d/f_g_h_2017"""
text1 = """/a/b/c\_d/f\_g\_h\_2017"""
text2 = """/a/b/c\\_d/f\\_g\\_h\\_2017"""
summary_op0 = tf.summary.text('text', tf.convert_to_tensor(text0))
summary_op1 = tf.summary.text('text', tf.convert_to_tensor(text1))
summary_op2 = tf.summary.text('text', tf.convert_to_tensor(text2))
summary_op = tf.summary.merge([summary_op0, summary_op1, summary_op2])
summary_writer = tf.summary.FileWriter('/tmp/tensorboard', sess.graph)
summary =
summary_writer.add_summary(summary, 0)
Here's the output:
How can I use tensorboard to properly render strings with tensorboard?
Package versions: Tensorflow 1.3.0, TensorBoard 0.1.8
This is working as intended. The docs for tf.summary.text and also for tensorboard.summary.text state that the text will be rendered using Markdown formatting—just like the text in this question and answer—and in Markdown, underscores create italics.
If you don't want this to be the case, you can consider formatting these strings as code, by using either
text0 = """`/a/b/c_d/f_g_h_2017`""" # backticks: inline code formatting
text1 = """ /a/b/c\_d/f\_g\_h\_2017""" # four-space indent: code block
This yields the following result:
(Disclaimer: I work on TensorBoard.)
According to this github issue, this is a bug with the current tensorsorboard and Python 3. For now, using backticks as suggested in another answer is sufficient to render the underscores correctly.