Correct annotation to train spaCy's NER - spacy

I'm having some troubles finding the right way how to annotate my data. I'm dealing with laboratory test related texts and I am using the following labels:
1) Test specification (e.g. voltage, length,...)
2) Test object (e.g. battery, steal beam,...)
3) Test value (e.g. 5 V; 5 m...)
Let's take this example sentences:
The battery voltage should be 5 V.
I would annotate this sentences like this:
The
battery voltage (test specification)
should
be
5 V (Test value)
.
However, if this sentences looks like this:
The voltage of the battery should be 5 V.
I would use the following annotation:
The
voltage (Test specification)
of
the
battery (Test object)
should
be
5 V (Test value)
.
Is anyone experienced in annotating data to explain if this is the right way? Or should I use in he first example the Test object label for battery as well? Or should I combine the labels in the second example voltage of the battery as Test specification?
I am annotating the data to perform information extraction.
Thanks for any help!

All of your examples are unusual annotations formats. The typical way to tag NER data (in text) is to use an IOB/BILOU format, where each token is on one line, the file is a TSV, and one of the columns is a label. So for your data it would look like:
The
voltage U-SPEC
of
the
battery U-OBJ
should
be
5 B-VALUE
V L-VALUE
.
Pretend that is TSV, and I have omitted O tags, which are used for "other" items.
You can find documentation of these schema in the spaCy docs.
If you already have data in the format you provided, or you find it easier to make it that way, it should be easy to convert at least. For training NER spaCy requires the data be provided in a particular format, see the docs for details, but basically you need the input text, character spans, and the labels of those spans. Here's example data:
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
This format is trickier to produce manually than the above TSV type format, so generally you would produce the TSV-like format, possibly using a tool, and then convert it.

The main rule to correctly annotate entities is to be consistent (i.e. you always apply the same rules when deciding which entity is what). I can see you already khave some rules in terms of when battery voltage should be considered test object or test specification.
Apply those rules consistently and you'll be ok.
Have a look at the spacy-annotator.
It is a library that helps you annotating data in the way you want.
Example:
import pandas as pd
import re
from spacy_annotator.pandas_annotations import annotate as pd_annotate
# Data
df = pd.DataFrame.from_dict({'full_text' : [The battery voltage should be 5 V., 'The voltage of the battery should be 5 V.']})
# Annotations
pd_dd = pd_annotate(df,
col_text = 'full_text', # Column in pandas dataframe containing text to be labelled
labels = ['test_specification', 'test object', 'test_value'], # List of labels
sample_size=1, # Size of the sample to be labelled
delimiter=',', # Delimiter to separate entities in GUI
model = None, # spaCy model for noisy pre-labelling
regex_flags=re.IGNORECASE # One (or more) regex flags to be applied when searching for entities in text
)
# Example output
pd_dd['annotations'][0]
The code will show you a user interface you can use to annotate each relevant entities.

Related

How to train data of different lengths in machine learning?

I am analyzing the text of some literary works and I want to look at the distance between certain words in the text. Specifically, I am looking for parallelism.
Since I can’t know the specific number of tokens in a text I can’t simply put all words in the text in the training data because it would not be uniform across all training data.
For example, the text:
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today."
Is not the same text length as
"My fellow Americans, ask not what your country can do for you, ask what you can do for your country."
So therefore I could not columns out of each word and then assign the distance in a row because the lengths would be different.
How could I go about representing this in training data? I was under the assumption that training data had to be the same type and length.
In order to solve this problem you can use something called pad_sequence,so follow this process, sure you are going to transform the data throught some word embedding techniques like TF-IDF or any other algorithm, and after finishing the process of converting the textual data into vectors and by using the shape method you can figure the maximum length you have and than use that maximum in the pad-sequence method, and here is a how you implement this method:
'''
from keras.preprocessing.sequence import pad_sequences
padded_data= pad_sequences(name-of-your-data, maxlen=your-maximum-shape, padding='post', truncating='post')
'''

Can I use labeled data and rule-based matching for multiclass text classification with Spacy?

I have some labeled data (down to 1000) (shape: text, category ) and up to 10k unlabeled data. I want to use the rules-based matching linguistic tool of Spacy to define for every category a pattern. After this, I would like to train a new model using the rules and the data that I've had labeled. It is this possible? I've seen some tutorial on youtube* which does something similar, but they use the labeled data to determine if a sentence contains some entity. On the other hand, I want to put a label on an entire paragraph.
https://www.youtube.com/watch?v=IqOJU1-_Fi0

Applying LSA on term document matrix when number of documents are very less

I have a term-document matrix (X) of shape (6, 25931). The first 5 documents are my source documents and the last document is my target document. The column represents counts for different words in the vocabulary set. I want to get the cosine similarity of the last document with each of the other documents.
But since SVD produces an S of size (min(6, 25931),), If I used the S to reduce my X, I get a 6 * 6 matrix. But In this case, I feel that I will be losing too much information since I am reducing a vector of size (25931,) to (6,).
And when you think about it, usually, the number of documents will always be less than number of vocabulary words. In this case, using SVD to reduce dimensionality will always produce vectors that are of size (no documents,).
According to everything that I have read, when SVD is used like this on a term-document matrix, it's called LSA.
Am I implementing LSA correctly?
If this is correct, then is there any other way to reduce the dimensionality and get denser vectors where the size of the compressed vector is greater than (6,)?
P.S.: I also tried using fit_transform from sklearn.decomposition.TruncatedSVD which expects the vector to be of the form (n_samples, n_components) which is why the shape of my term-document matrix is (6, 25931) and not (25931, 6). I kept getting a (6, 6) matrix which initially confused me. But now it makes sense after I remembered the math behind SVD.
If the objective of the exercise is to find the cosine similarity, then the following approach can help. The author is only attempting to solve for the objective and not to comment on the definition of Latent Semantic Analysis or the definition of Singular Value Decomposition mentioned by the questioner.
Let us first invoke all the required libraries. Please install them if they do not exist in the machine.
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Let us generate some sample data for this exercise.
df = {'sentence': ['one two three','two three four','four five','six seven eight nine ten']}
df = pd.DataFrame(df, columns = ['sentence'])
The first step is to get the exhaustive list of all the possible features. So collate all of the content at one place.
all_content = [' '.join(df['sentence'])]
Let us build a vectorizer and fit it now. Please note that the arguments in the vectorizer are not explained by the author as the focus is on solving the problem.
vectorizer = TfidfVectorizer(encoding = 'latin-1',norm = 'l2', min_df = 0.03, ngram_range = (1,2), max_features = 5000)
vectorizer.fit(all_content)
We can inspect the vocabulary to see if it makes sense. If needed, one could add stop words in the vectorizer above and supress them to see if they are indeed supressed.
print(vectorizer.vocabulary_)
Let us vectorize the sentences for us to deploy cosine similarity.
s1Tokens = vectorizer.transform(df.iloc[1,])
s2Tokens = vectorizer.transform(df.iloc[2,])
Finally, the cosine of the similarity can be computed as follows.
cosine_similarity(s1Tokens , s2Tokens)

Any way to extract the exhaustive vocabulary of the google universal sentence encoder large?

I have some sentences for which I am creating an embedding and it works great for similarity searching unless there are some truly unusual words in the sentence.
In that case, the truly unusual words in fact contain the very most similarity information of any words in the sentence BUT all of that information is lost during embedding due to the fact that the word is apparently not in the vocabulary of the model.
I'd like to get a list of all of the words known by the GUSE embedding model so that I can mask those known words out of my sentence, leaving only the "novel" words.
I can then do an exact word search for those novel words in my target corpus and achieve usability for my similar sentence searching.
e.g. "I love to use Xapian!" gets embedded as "I love to use UNK".
If I just do a keyword search for "Xapian" instead of a semantic similarity search, I'll get much more relevant results than I would using GUSE and vector KNN.
Any ideas on how I can extract the vocabulary known/used by GUSE?
I combine the earlier answer from #Roee Shenberg and the solution provided here to come up with solution, which is applicable for USE v4:
import importlib
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model("/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/")
graph = saved_model.meta_graphs[0].graph_def
fns = [f for f in saved_model.meta_graphs[0].graph_def.library.function if "ptb" in str(f).lower()];
print(len(fns)) # should be 1
nodes_with_sp = [n for n in fns[0].node_def if n.name == "Embeddings_words"]
print(len(nodes_with_sp)) # should be 1
words_tensor = nodes_with_sp[0].attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # should be 400004
If you are just curious about the words I upload them here.
I'm assuming you have tensorflow & tensorflow_hub installed, and youhave already downloaded the model.
IMPORTANT: I'm assuming you're looking at https://tfhub.dev/google/universal-sentence-encoder/4! There's no guarantee the object graph looks the same for different versions, it's likely that modifications will be needed.
Find it's location on disk - it's somewhere at /tmp/tfhub_modules unless you set the TFHUB_CACHE_DIR environment variable (Windows/Mac have different locations). The path should contain a file called saved_model.pb, which is the model, serialized using Protocol Buffers.
Unfortunately, the dictionary is serialized inside the model's Protocol Buffers file and not as an external asset, so we'll have to load the model and get the variable from it.
The strategy is to use tensorflow's code to deserialize the file, and then travel down the serialized object tree all the way to the dictionary.
import importlib
MODEL_PATH = 'path/to/model/dir' # e.g. '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'
# Use the tensorflow internal Protobuf loader. A regular import statement will fail.
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model(MODEL_PATH)
# reach into the object graph to get the tensor
graph = saved_model.meta_graphs[0].graph_def
function = graph.library.function
node_type, node_value = function[5].node_def
# if you print(node_type) you'll see it's called "text_preprocessor/hash_table"
# as well as get insight into this branch of the object graph we're looking at
words_tensor = node_value.attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # -> 400004
Some resources that helped:
A GitHub issue relating to changing the vocabulary
A Tensorflow Google-group thread linked from the issue
Extra Notes
Despite what the GitHub issue may lead you to think, the 400k words here are not the GloVe 400k vocabulary. You can verify this by downloading the GloVe 6B embeddings (file link), extracting glove.6B.50d.txt, and then using the following code to compare the two dictionaries:
with open('/path/to/glove.6B.50d.txt') as f:
glove_vocabulary = set(line.strip().split(maxsplit=1)[0] for line in f)
USE_vocabulary = set(word_list) # from above
print(len(USE_vocabulary - glove_vocabulary)) # -> 281150
Inspecting the different vocabularies is interesting in and of itself, e.g. why does GloVe have an entry for '287.9'?

Tensorflow: training on JSON data to generate similar output

Assume one has JSON data containining instructions for generating the following 10x5 cell patterns, and that each cell can contain one of the following characters: _ 0 x y z
Also assume that each character can be displayed in various colors.
pattern 1:
_yx_0zzyxx
_0__yz_0y_
x0_0x000yx
_y__x000zx
zyyzx_z_0y
pattern 2:
xx0z00yy_z
zzx_0000_x
_yxy0y__yx
_xz0z__0_y
y__x0_0_y_
pattern 3:
yx0x_xz0_z
xz_x0_xxxz
_yy0x_0z00
zyy0__0zyx
z_xy0_0xz0
These were randomly generated, and are all black, but assume they were devised according to some set of rules, and in color.
The JSON for the first pattern would look something like:
{
width: 10,
height: 5,
cells: [
{
value: '_',
color: 'red'
},
{
value: 'y',
color: 'blue'
}, ...
]
}
If one wanted to train on this data in order to generate new yet similar patterns (again, assuming these were not randomly generated), what is the recommended approach for:
reading the data in (I'd imagine putting the JSON into an Example protobuf, serializing the buffer to string with tf.parse_example, and then writing that to TFRecord files)
training on that data
generating new patterns based on the trained model
supplying seed data for the generated patterns, e.g. first cell is the character "x' with the color blue.
I want to achieve something similar to what I've seen in style transfer with art/photos, and with music/MIDI data (see: Google Magenta). In those cases, here the model is trained an a distinctive set of artwork or melodic style, and a seed in the form of a photograph or primer melody is supplied in order to generate content similar to the data used in training.
Thanks!
I dislike preprocessing the dataset into new forms, it makes it difficult to change later on and slows future development, it's like technical debt in my opinion.
My approach would be to keep your JSON as-is and write some simple python code (a generator specifically which mean you use yield instead of return statements) to read the JSON file and spit out samples in sequence.
Then use the tensorflow Dataset input pipeline with Dataset.from_generator(...) to take data from your input function.
https://www.tensorflow.org/programmers_guide/datasets
The Dataset pipeline provides everything you need to manage the various transformations you'll want to apply, you can buffer, shuffle, batch, prefetch, and map functions onto your data trivially and in a nice modular, testable, framework that feeds naturally into your tensorflow model.