Can anyone give me few sentences on when the dependency parser fails and why they failed and what is the fix for it?
Consider the below sentence :
Sands had already begun to trickle into the bottom.
Tree: (ROOT (S
(NP (NNP Sands))
(VP (VBD had)
(ADVP (RB already))
(VP (VBN begun)
(S
(VP (TO to)
(VP (VB trickle)
(PP (IN into)
(NP (DT the) (NN bottom))))))))
(. .)))
dependency parser: [nsubj(begun-4, Sands-1), nsubj:xsubj(trickle-6,
Sands-1), aux(begun-4, had-2), advmod(begun-4, already-3),
root(ROOT-0, begun-4), mark(trickle-6, to-5), xcomp(begun-4,
trickle-6), case(bottom-9, into-7), det(bottom-9, the-8),
nmod:into(trickle-6, bottom-9), punct(begun-4, .-10)]
There can be two reasons why dependency parser fails.
1)Here the word "Sands" is a Proper noun plural(NNPS) but the POS tagger output gives NNP which is proper noun, so there is an error in the tagger which in turn propagates to the dependency parser as it uses POS to generate dependencies".
To handle this case you can train the POS tagger with the sentences it fails on.
2)The context of the sentence may be completely new to dependency parser, as most of the parsers like spacy , stanford , nltk etc are trained ML models so in order to handle this case you can train the dependency parser separately with new sentences.
you can refer this link to understand how to train POS tagger and Dependency parser:
https://spacy.io/usage/training#section-tagger-parser
Hope it answers your questions.
Related
I have some sentences for which I am creating an embedding and it works great for similarity searching unless there are some truly unusual words in the sentence.
In that case, the truly unusual words in fact contain the very most similarity information of any words in the sentence BUT all of that information is lost during embedding due to the fact that the word is apparently not in the vocabulary of the model.
I'd like to get a list of all of the words known by the GUSE embedding model so that I can mask those known words out of my sentence, leaving only the "novel" words.
I can then do an exact word search for those novel words in my target corpus and achieve usability for my similar sentence searching.
e.g. "I love to use Xapian!" gets embedded as "I love to use UNK".
If I just do a keyword search for "Xapian" instead of a semantic similarity search, I'll get much more relevant results than I would using GUSE and vector KNN.
Any ideas on how I can extract the vocabulary known/used by GUSE?
I combine the earlier answer from #Roee Shenberg and the solution provided here to come up with solution, which is applicable for USE v4:
import importlib
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model("/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/")
graph = saved_model.meta_graphs[0].graph_def
fns = [f for f in saved_model.meta_graphs[0].graph_def.library.function if "ptb" in str(f).lower()];
print(len(fns)) # should be 1
nodes_with_sp = [n for n in fns[0].node_def if n.name == "Embeddings_words"]
print(len(nodes_with_sp)) # should be 1
words_tensor = nodes_with_sp[0].attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # should be 400004
If you are just curious about the words I upload them here.
I'm assuming you have tensorflow & tensorflow_hub installed, and youhave already downloaded the model.
IMPORTANT: I'm assuming you're looking at https://tfhub.dev/google/universal-sentence-encoder/4! There's no guarantee the object graph looks the same for different versions, it's likely that modifications will be needed.
Find it's location on disk - it's somewhere at /tmp/tfhub_modules unless you set the TFHUB_CACHE_DIR environment variable (Windows/Mac have different locations). The path should contain a file called saved_model.pb, which is the model, serialized using Protocol Buffers.
Unfortunately, the dictionary is serialized inside the model's Protocol Buffers file and not as an external asset, so we'll have to load the model and get the variable from it.
The strategy is to use tensorflow's code to deserialize the file, and then travel down the serialized object tree all the way to the dictionary.
import importlib
MODEL_PATH = 'path/to/model/dir' # e.g. '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'
# Use the tensorflow internal Protobuf loader. A regular import statement will fail.
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model(MODEL_PATH)
# reach into the object graph to get the tensor
graph = saved_model.meta_graphs[0].graph_def
function = graph.library.function
node_type, node_value = function[5].node_def
# if you print(node_type) you'll see it's called "text_preprocessor/hash_table"
# as well as get insight into this branch of the object graph we're looking at
words_tensor = node_value.attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # -> 400004
Some resources that helped:
A GitHub issue relating to changing the vocabulary
A Tensorflow Google-group thread linked from the issue
Extra Notes
Despite what the GitHub issue may lead you to think, the 400k words here are not the GloVe 400k vocabulary. You can verify this by downloading the GloVe 6B embeddings (file link), extracting glove.6B.50d.txt, and then using the following code to compare the two dictionaries:
with open('/path/to/glove.6B.50d.txt') as f:
glove_vocabulary = set(line.strip().split(maxsplit=1)[0] for line in f)
USE_vocabulary = set(word_list) # from above
print(len(USE_vocabulary - glove_vocabulary)) # -> 281150
Inspecting the different vocabularies is interesting in and of itself, e.g. why does GloVe have an entry for '287.9'?
According to documentation:
spaCy's small models (all packages that end in sm) don't ship with
word vectors, and only include context-sensitive tensors. [...]
individual tokens won't have any vectors assigned.
But when I use the de_core_news_sm model, the tokens Do have entries for x.vector and x.has_vector=True.
It looks like these are context_vectors, but as far as I understood the documentation only word vectors are accessible through the vector attribute and sm models should have none. Why does this work for a "small model"?
has_vector behaves differently than you expect.
This is discussed in the comments on an issue raised on github. The gist is, since vectors are available, it is True, even though those vectors are context vectors. Note that you can still use them, eg to compute similarity.
Quote from spaCy contributor Ines:
We've been going back and forth on how the has_vector should behave in
cases like this. There is a vector, so having it return False would be
misleading. Similarly, if the model doesn't come with a pre-trained
vocab, technically all lexemes are OOV.
Version 2.1.0 has been announced to include German word vectors.
I installed spaCy v2.0.2 on Ubuntu 16.04. I then used
sudo python3 -m spacy download en
to download the English model.
After that I use Spacy as follows:
from spacy.lang.en import English
p = English(parser=True, tagger=True, entity=True)
d = p("This is a sentence. I am who I am.")
print(list(d.sents))
I get this error however:
File "doc.pyx", line 511, in __get__
ValueError: Sentence boundary detection requires the dependency parse, which requires a statistical model to be installed and loaded. For more info, see the documentation:
https://spacy.io/usage/models
I really can't figure out what is going on. I have this version of the 'en' model installed:
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
which I think is the default. Any help is appreciated. Thank you.
I think the problem here is quite simple – when you call this:
p = English(parser=True, tagger=True, entity=True)
... spaCy will load the English language class containing the language data and special-case rules, but no model data and weights, which enable the parser, tagger and entity recognizer to make predictions. This is by design, because spaCy has no way of knowing if you want to load in model data and if so, which package.
So if you want to load an English model, you'll have to use spacy.load(), which will take care of loading the data, and putting together the language and processing pipeline:
nlp = spacy.load('en_core_web_sm') # name of model, shortcut name or path
Under the hood, spacy.load() will look for an installed model package called en_core_web_sm, load it and check the model's meta data to determine which language the model needs (in this case, English) and which pipeline it supports (in this case, tagger, parser and NER). It then initialises an instance of English, creates the pipeline, loads in the binary data from the model package and returns the object so you can call it on your text. See this section for a more detailed explanation of this.
I've found a grammar online which I want to rewrite to BNF so I can use it in a grammatical evolution experiment. From what I've read online BNF is given by this form:
<symbol> := <expression> | <term>
...but I don't see where probabilities factor into it.
In a probabilistic context-free grammar (PCFG), every production also is assigned a probability. How you choose to write this probability is up to you; I don't know of a standard notation.
Generally, the probabilities are learned rather than assigned, so the representation issue doesn't come up; the system is given a normal CFG as well as a large corpus with corresponding parse trees, and it derives probabilities by analysing the parse trees.
Note that PCFGs are usually ambiguous. Probabilities are not used to decide whether a sentence is in the language but rather which parse is correct, so with an unambiguous grammar, the probabilities would be of little use.
I have my own POS data with below format.
Sentence:
I love Stack Overflow.
POS:
I/PRP love/VBP Stack/NNP Overflow/NNP ./.
So, how I train Syntaxnet with this data?
And also I want to get this output:
**(ROOT
(S
(NP (PRP I))
(VP (VBP love)
(NP (NNP Stack) (NNP Overflow)))
(. .)))**
What is the format of "record_format: 'english-text'" in Syntaxnet context.pbtxt file? How its look like?
The output that you are interested in is a constituency parse tree. I am afraid you won't be able to use SyntaxNet to produce constituency trees without some significant code changes.
For POS tagging only, please use the CoNLL format where you fill in only columns 1, 2 and 5:
http://ilk.uvt.nl/conll/#dataformat