How do I convert simple training style data to spaCy's command line JSON format? - spacy

I have the training data for a new NER type in the "Training an additional entity type" section of the spaCy documentation.
TRAIN_DATA = [
("Horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("Do they bite?", {
'entities': []
}),
("horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("horses pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("they pretend to care about your feelings, those horses", {
'entities': [(48, 54, 'ANIMAL')]
}),
("horses?", {
'entities': [(0, 6, 'ANIMAL')]
})
]
I want to train an NER model on this data using the spacy command line application. This requires data in spaCy's JSON format. How do I write the above data (i.e. text with labeled character offset spans) in this JSON format?
After looking at the documentation for that format, it's not clear to me how to manually write data in this format. (For example, do I have partition everything into paragraphs?) There is also a convert command line utility that converts from non-spaCy data formats to spaCy's format, but that doesn't take a spaCy format like the one above as input.
I understand the examples of NER training code that uses the "Simple training style", but I'd like to be able to use the command line utility for training. (Though as is apparent from my previous spaCy question, I'm unclear when you're supposed to use that style and when you're supposed to use the command line.)
Can someone show me an example of the above data in "spaCy's JSON format", or point to documentation that explains how to make this transformation.

There's a built in function to spaCy that will get you most of the way there:
from spacy.gold import biluo_tags_from_offsets
That takes in the "offset" type annotations you have there and converts them to the token-by-token BILOU format.
To put the NER annotations into the final training JSON format, you just need a bit more wrapping around them to fill out the other slots the data requires:
sentences = []
for t in TRAIN_DATA:
doc = nlp(t[0])
tags = biluo_tags_from_offsets(doc, t[1]['entities'])
ner_info = list(zip(doc, tags))
tokens = []
for n, i in enumerate(ner_info):
token = {"head" : 0,
"dep" : "",
"tag" : "",
"orth" : i[0].string,
"ner" : i[1],
"id" : n}
tokens.append(token)
sentences.append(tokens)
Make sure that you disable the non-NER pipelines before training with this data.
I've run into some issues using spacy train on NER-only data. See #1907 and also check out this discussion on the Prodigy forum for some possible workarounds.

Related

Intent chatbot with numeric and strings data

I am trying to build a chatbot using intents as json file, an example of the intent is
{"tag": "thanks",
"patterns": ["Thanks", "Thank you", "That's helpful"],
"responses": ["Happy to help!", "Any time!", "My pleasure"]
},
I have a lot of other tags but I want the chatbot to detect the response based on the input text and other factors for example speech intensity which will be a value ranges from 1-10.
The chatbot has been trained using tensorflow.
How can I modify the intent file and input to the chatbot a text together with some information.
Chatbot operates by this code
def chat():
print("Start talking with the bot (type quit to stop)!")
while True:
inp = input("You: ")
if inp.lower() == "quit":
break
results = model.predict([bag_of_words(inp, words)])
results_index = numpy.argmax(results)
tag = labels[results_index]
for tg in data["intents"]:
if tg['tag'] == tag:
responses = tg['responses']
print("Chatbot: ",random.choice(responses))
What I want is that for example speech intensity is 8 and the sentence is “what do you want” the response should be something like this
“Why are you nervous?”
You can add to the intent file a function that can do any purpose in each intent. for example
{"tag": "thanks",
"patterns": ["Thanks", "Thank you", "That's helpful"],
"responses": ["Happy to help!", "Any time!", "My pleasure"]
"action": getname()
},

Convert CONLL file to a list of Doc objects

Is there a way to convert CONLL file into list of Doc objects without having to parse the sentence using the nlp object. I have a list of annotations that I have to pass to the automatic component that uses Doc objects as input. I have found a way to create the doc:
doc = Doc(nlp.vocab, words=[...])
And that I can use the from_array function to recreate the other linguistic features. This array can be recreated by using index value from StringStore object, I have successfully created Doc object with LEMMA and TAG information but cannot recreate HEAD data. My question is how to pass HEAD data to Doc object using from_array method.
The confusing thing about the HEAD is that for sentence that has this structure:
Ona 2
je 2
otišla 2
u 4
školu 2
. 2
The output of this code snippet:
from spacy.attrs import TAG, HEAD, DEP
doc.to_array([TAG, HEAD, DEP])
is:
array([[10468770234730083819, 2, 429],
[ 5333907774816518795, 1, 405],
[11670076340363994323, 0, 8206900633647566924],
[ 6471273018469892813, 1, 8110129090154140942],
[ 7055653905424136462, 18446744073709551614, 435],
[ 7173976090571422945, 18446744073709551613, 445]],
dtype=uint64)
I cannot correlate the center column of the from_array output to dependency tree structure given above.
Thanks in advance for the help,
Daniel
Ok, so I finally cracked it it appears that head - index if the index is lower than head and 18446744073709551616 - index otherwise. This is the function that I used if anyone else needed it:
import numpy as np
from spacy.tokens import Doc
docs = []
for sent in sents:
generated_doc = Doc(doc.vocab, words=[word["word"] for word in sent])
heads = []
for idx, word in enumerate(sent):
if word["pos"] not in doc.vocab.strings:
doc.vocab.strings.add(word["pos"])
if word["dep"] not in doc.vocab.strings:
doc.vocab.strings.add(word["dep"])
if word["head"] >= idx:
heads.append(word["head"] - idx)
else:
heads.append(18446744073709551616-idx)
np_array = np.array([np.array([doc.vocab.strings[word["pos"]], heads[idx], doc.vocab.strings[word["dep"]]], dtype=np.uint64) for idx, word in enumerate(sent)], dtype=np.uint64)
generated_doc.from_array([TAG, HEAD, DEP], np_array)
docs.append(generated_doc)

Adding a Retokenize pipe while training NER model

I am currenly attempting to train a NER model centered around Property Descriptions. I could get a fully trained model to function to my liking however, I now want to add a retokenize pipe to the model so that I can set up the model to train other things.
From here, I am having issues getting the retokenize pipe to actually work. Here is the definition:
def retok(doc):
ents = [(ent.start, ent.end, ent.label) for ent in doc.ents]
with doc.retokenize() as retok:
string_store = doc.vocab.strings
for start, end, label in ents:
retok.merge(
doc[start: end],
attrs=intify_attrs({'ent_type':label},string_store))
return doc
i am adding it into my training like this:
nlp.add_pipe(retok, after="ner")
and I am adding it into the Language Factories like this:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
The issue I keep getting is "AttributeError: 'English' object has no attribute 'ents'". Now I am assuming I am getting this error because the parameter that is being passed through this function is not a doc but actually the NLP model itself. I am not really sure to get a doc to flow into this pipe during training. At this point I don't really know where to go from here to get the pipe to function the way I want.
Any help is appreciated, thanks.
You can potentially use the built-in merge_entities pipeline component: https://spacy.io/api/pipeline-functions#merge_entities
The example copied from the docs:
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]
merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]
If you need to customize it further, the current implementation of merge_entities (v2.2) is a good starting point:
def merge_entities(doc):
"""Merge entities into a single token.
doc (Doc): The Doc object.
RETURNS (Doc): The Doc object with merged entities.
DOCS: https://spacy.io/api/pipeline-functions#merge_entities
"""
with doc.retokenize() as retokenizer:
for ent in doc.ents:
attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.l
abel}
retokenizer.merge(ent, attrs=attrs)
return doc
P.S. You are passing nlp to retok() below, which is where the error is coming from:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
See a related question: Spacy - Save custom pipeline

Tensorflow serving: Unable to base64 decode

I use the slim package resnet_v2_152 to train a classification model.
Then it is exported to .pb file to provide a service.
Because the input is image, so it would be encoded with web-safe base64 encoding. It looks like:
serialized_tf_example = tf.placeholder(dtype=tf.string, name='tf_example')
decoded = tf.decode_base64(serialized_tf_example)
I then encode an image with base64 such that:
img_path = '/Users/wuyanxue/Desktop/not_emoji1.jpeg'
img_b64 = base64.b64encode(open(img_path, 'rb').read())
s = str(img_b64, encoding='utf-8')
s = s.replace('+', '-').replace(r'/', '_')
My post data is as structured as follow:
post_data = {
'signature_name': 'predict',
'instances':[ {
'inputs':
{ 'b64': s }
}]
}
Finally, I post a HTTP request to this server:
res = requests.post('server_address', json=post_data)
It gives me:
'{ "error": "Failed to process element: 0 key: inputs of \\\'instances\\\' list. Error: Invalid argument: Unable to base64 decode" }'
I want to know how it could be encountered? And are there some solutions for that?
I had the same issue when using python3. I solved it by adding a 'b' - a byte-like object instead of the default str to the encode function:
b'{"instances" : [{"b64": "%s"}]}' % base64.b64encode(
dl_request.content)
Hope that helps, please see this answer for extra info.
This question is already solved.
post_data = {
'signature_name': 'predict',
'instances':[ {
'inputs':
{ 'b64': s }
}]
}
We see that inputs is with 'b64' flag, which illustrates that tensorflow serving will decode s with base64 code.
It belongs to the tensorflow serving internal method.
So, the placeholder:
serialized_tf_example = tf.placeholder(dtype=tf.string, name='tf_example')
will directly receive the binary format of the input data BUT NOT base64 format.
So, finally,
decoded = tf.decode_base64(serialized_tf_example)
is NOT necessary.

How to feed inputs into a loaded Tensorflow model using C++

I want to create and train a model, export it and run inference in C++.
I'm following the tutorial listed here: https://www.tensorflow.org/tutorials/wide_and_deep
I'm also trying to use the SavedModel approach as described here since this is the canonical way to export TensorFlow graphs for serving:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md.
At the very end, I export the saved model as follows:
feature_spec = tf.contrib.layers.create_feature_spec_for_parsing(feature_columns)
serving_input_fn = input_fn_utils.build_parsing_serving_input_fn(feature_spec)
output = model.export_savedmodel(model_dir, serving_input_fn, as_text=True)
print('Model saved to {}'.format(output))
I see the saved_model.pbtxt has the following signature definition.
signature_def {
key: "serving_default"
value {
inputs {
key: "inputs"
value {
name: "input_example_tensor:0"
dtype: DT_STRING
tensor_shape {
dim {
size: -1
}
}
}
}
outputs {
...
I can load the saved model on the C++ side
SavedModelBundle bundle;
const std::string graph_path = "models/1498572863";
const std::unordered_set<std::string> tags = {"serve"};
Status status = LoadSavedModel(session_options,
run_options, graph_path,
tags, &bundle);
I'm stuck at the last part where I need to feed the input into this model.
The Run function expects the input parameter to be of the form: std::vector<std::pair<string, Tensor>>.
I would have expected this to be a vector of pairs where the key is the feature name used in the python code and the Tensor is multiple values for that feature.
However, it seems to expect the string to be "input_example_tensor".
I'm not sure how I'm supposed to now feed the model with different features using a single Tensor.
std::vector<string> output_tensor_names = {
"binary_logistic_head/_classification_output_alternatives/classes_tensor"};
// How do I create input_tensor?
status = bundle.session->Run({{"input_example_tensor", input_tensor}}
output_tensor_names, {}, &outputs);
Solution
I did something like this
tensorflow::Example example;
auto& tf_feature_map = *(example.mutable_features()->mutable_feature());
tf_feature_map["name"].mutable_int64_list()->add_value(15);
const std::string& serialized = example.SerializeAsString();
tensorflow::Input input({serialized});
status = bundle.session->Run({{"input_example_tensor", input.tensor()}}
output_tensor_names, {}, &outputs);
Your model signature suggests that it is expecting a DT_STRING tensor as input. When using tensorflow::Example, this typically means that the protocol buffer needs to be serialized into a tensor with a string as the type of its elements.
To convert the tensorflow::Example object to a string, you can use the protocol buffer methods such as SerializeToString, SerializeAsString etc.
Hope that helps.