Finding non-existing words with spaCy? - spacy

I am new to spaCy. I have a (German) text in which I want to find all the words not in the dictionary (using the de_core_news_lg pipeline). Reading spaCy's documentation, the only thing I found that looked promising was Token.has_vector(). When I check all the tokens in the Doc object I get by running nlp(TEXT) I find that, indeed, the tokens for which has_vector() returns False seem to be either typos or rare words not likely to be in the dictionary.
So my hypothesis is that returning False from Token.has_vector() is equivalent to not having found the respective word in the dictionary. Am I correct? Is there a better way for finding words not in dictionary?

spaCy does not include functionality for checking if a word is in the dictionary or not.
If you've loaded a pipeline with vectors, you can use has_vector to check if a word vector is present for a given token. This is kind of similar to checking if a word is in the dictionary, but it depends on the vectors - for most languages the vectors just include any word that appeared at least a certain number of times in a training corpus, so common typos or other strange things will be present, while some words may be randomly missing.
If you want to detect "real" words in some way it's best to source your own list.

Related

find indexed terms in non-indexed document/string

Sorry if I'm using the wrong terminology here, I'm new to Lucene :D
Assume that I have indexed all titles of the English Wikipedia in Lucene.
Let's say I'm visiting a news website. Within the article I want to convert all phrases (that match a title in the Wikipedia) into a link to the Wikipedia page.
To clarify: I don't want to put the news article into the Lucene index, but rather use the indexed WP titles to find matches within a given string (the article). We also don't want to bother with the JS/HTML stuff, just focus on Lucene for now.
I'd also like to match greedy: i.e. if the text contains "Stack Overflow", I'd like to link to SO, rather than to "Stack" and "Overflow". But if I can get shorter matches as well, that would be neat, too. (I kindof want to do both, but I'll settle for either one if having both is difficult).
naive solution: I can see that I'd be able to query for single words iteratively and whenever I hit an index, try to find the current word plus the next word until I miss. Then convert the last match into a link and continue after that, until I'm through the complete document.
But, that seems really awkward and I have a suspicion that Lucene might have some functionality that could support me here (or at least I hope so :D), but I have no clue what I'd be looking for.
Lucene's inverted index should make this a pretty fast operation, so I guess this might perform reasonably well, even with the naive approach.
Any pointers? I'm stuck :3
There is such a thing, it's called the Tagger Handler:
Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. It’s used for named entity recognition (NER).
It seems a bit fiddly to set-up, but it's exactly what I wanted :D

Storing word list in objective c

I have previously made an anagram solver where if you gave a set of 9 letters, the program would find every possible 3-9 letter word that could be made out of those 9 letters.
I made this in javascript, where a word list of 100,000+ words were stored in a single array form which suitable answers could be found.
To find every subword of a 9 letter set, the program would only need to search through the whole array once, meaning that no matter what 9 letter set of letters you gave the program, the list of subwords were always given in under a second.
I am now making the same program but in objective-c as part of an ios app i intend to make.
Would there be any issues in storing a 100,000+ word list in an NSArray in objective-c?
Issues such as memory usage, look up speeds etc.
Are there any better ways of storing this word list that would make lookups faster or perhaps use less memory.
(I am a novice in objective-c)
Thank you for your time.
The simple answer is to try it and see. You can then use Instruments.app to see the performance.
You may find Alternative Objective-C object allocation for large arrays a worth while read.

Identification of an important document

I have a set of text documents in java . I have to identify the most important document (just as what an expert would identify) using a computer.
eg. I have 10 books on java , the system identifies Java complete reference as the most important document or the most relevant.(based on similarities with the wikipedia page about java)
One method would be to have a reference document and find similarities between this document and the set of documents at hand (as mentioned in the previous example). And provide a result saying the one which has maximum similarity is the most important docuemnt .
I want to identify other more efficient methods of performing this.
please suggest other methods for finding the relevant document (in a unsupervised way if possible) .
I think another mechanism would be, have a dictionary of words and ranking map associated with each document.
For example, in Java complete reference book case, there will be a dictionary of keywords and its ranking.
Java-10
J2ee-5
J2SDK-10
Java5-10 etc.,
Note:If your documents are dynamic streams and names also dynamic, I am not sure how to handle it.

Obj-C / iOS: Look through a document for any one of several thousand words?

As part of a document reader I'm writing for iPhone/iPad, I need the following functionality:
Search through a document of between appx 500 and 10000 words for words and phrases that appear in one of several lists. Each list contains between 100 and 5000 words and phrases. When I find a word in the document that appears in one of those lists, I mark it and move on.
I will know the word lists ahead of time, but the documents will be unknown until the moment they need to be processed.
And this needs to be VERY FAST.
Any help would be greatly appreciated!
This presentation and paper present a fast multi-pattern string search algorithm. It also mentions some predecessors, should this one not fit your needs.
Multifast is an open source (LGPLed) C library that implements the Aho-Corasick algorithm.
I would create a huge hashmap with the phrases and words to search against at load time, since searching through hashmaps is very, very fast, especially at these sizes. Obviously a memory-hungry solution, but pretty trivial.
iOS 4 and above seems to have functionality for custom dictionaries; perhaps you could exploit that somehow?

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.