How to add user-defined words to spaCy - spacy

I am a novice of spaCy and am using spaCy to process medical literature. I found that Tokenizer would divide the Latin name composed of two words into two independent words, which is inappropriate. In addition, I have thousands of customized words, which are basically biological names (usually composed of two words, such as Angelica sinensis). How can I add these customized words to spaCy and let Tokenizer recognize these multi-word words as a single token without splitting them. Thank you.

If you have a list of multi-word expressions that you would like to treat as tokens, the easiest thing to do is use an EntityRuler to mark them as entities and then use the merge_entitites component.

Related

How to deal with several number in camel case?

I have a variable like model_1_5_7 and I need to rename it in camel case. I need it because earlier my models were functions in Python but now I need to make them as classes.
You could use a letter instead of the underscore, for instance p (for point), which would give model1p5p7. If the module numbers are dense you could also use a three-dimensional array named models and designate the individual models with three indices, for example models[1][5][7].

Lucene: combining ASCII folding and stemming for French

I am implementing a Lucene search for French text. The search must work regardless of whether the user has typed accents or not, and it must also support stemming. I am currently using the Snowball-based French stemmer in Lucene 3.
On the indexing side, I have added an ASCIIFoldingFilter into my analyzer, which runs after the stemmer.
However, on the search side, the operation is not reversible: the stemmer only works given the input content contains accents. For example, it stems the ité from the end of université, but with a user search input of universite, the stemmer returns universit during query analysis. Of course, since the index contains the term univers, the search for universit returns no results.
A solution seems to be to change the order of stemming and folding in the analyzer: instead of stemming and then folding, do the folding before stemming. This effectively makes the operation reversible, but also significantly hobbles the stemmer since many words no longer match the stemming rules.
Alternatively, the stemmer could be modified to operate on folded input i.e. ignore accents, but could this result in over-stemming?
Is there a way to effectively do folded searches without changing the behavior of the stemming algorithm?
Step 1.) Use an exhaustive lemma synonym mapping file
Step 2.) ASCII (ICU) Fold after lemmatizing.
You can get exhaustive French lemmas here:
http://www.lexiconista.com/datasets/lemmatization/
Also, because lemmatizers are NOT destructive like stemmers you can apply the lemmatizer multiple times, perhaps your lemmatizer also contains accent-free normalizations... Just apply the lemmatizer again.

How to convert foreign characters to English characters in SQL Query?

I have to create sql function that converts special Characters, International Characters(French, Chinese...) to english.
Is there any special function in sql, can i get??
Thanks for your help.
If you are after English names for the characters, that is an achievable goal, as they all have published names as part of the Unicode standard.
See for example:
http://www.unicode.org/ucd/
http://www.unicode.org/Public/UNIDATA/
Your task then is to simply turn the list of unicode characters into a table with 100,000 or so rows. Unfortunately the names you get will be things like ARABIC LIGATURE LAM WITH MEEM MEDIAL FORM.
On the other hand, if you want to actually translate the meaning, you need to be looking at machine translation software. Both Microsoft and Google have well-known cloud translation offerings and there are several other well-thought of products too.
I think the short answer is you can't unless you narrow your requirements a lot. It seems you want to take a text sample, A, and convert it into romanized text B.
There are a few problems to tackle:
Languages are typically not romanized on a single character basis. The correct pronunciation of a character is often dependent on the characters and words around it, and can even have special rules for just one word (learning English can be tough because it is filled with these, having borrowed words from many languages without normalizing the spelling).
Even if you code rules for every language you want to support you still have homographs, words that are spelled using exactly the same characters, but that have different pronunciations (and thus romanization) depending on what was meant - for example "sow" meaning a pig, or "sow" (where the w is silent) meaning to plant seeds.
And then you get into the problem of what language you are romanizing: Characters and even words are not unique to one language, but the actual meaning and romanization can vary. The fact that many languages include loan words from those language they share characters with complicates any attempt to automatically determine which language you are trying to romanize.
Given all these difficulties, what it is you actually want to achieve (what problem are you solving)?
You mention French among the languages you want to "convert" into English - yet French (with its accented characters) is already written in the roman alphabet. Even everyday words used in English occasionally make use of accented characters, though these are rare enough that the meaning and pronunciation is understood even if they are omitted (ex. résumé).
Is your problem really that you can't store unicode/extended ASCII? There are numerous ways to correct or work around that.

Indexing multilingual words in lucene

I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.
It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).
so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese

What a single sentence consist of? How to name it?

I'm designing architecture of a text parser. Example sentence: Content here, content here.
Whole sentence is a... sentence, that's obvious. The, quick etc are words; , and . are punctuation marks. But what are words and punctuation marks all together in general? Are they just symbols? I simply don't know how to name what a single sentence consists of in the most reasonable abstract way (because one may write it consists of letters/vowels etc).
Thanks for any help :)
What you're doing is technically lexical analysis ("lexing"), which takes a sequence of input symbols and generates a series of tokens or lexemes. So word, punctuation and white-space are all tokens.
In (E)BNF terms, lexemes or tokens are synonymous with "terminal symbols". If you think of the set of parsing rules as a tree the terminal symbols are the leaves of the tree.
So what's the atom of your input? Is it a word or a sentence? If it's words (and white-space) then a sentence is more akin to a parsing rule. In fact the term "sentence" can itself be misleading. It's not uncommon to refer to the entire input sequence as a sentence.
A semi-common term for a sequence of non-white-space characters is a "textrun".
A common term comprising the two sub-categories "words" and "punctuation", often used when talking about parsing, is "tokens".
Depending on what stage of your lexical analysis of input text you are looking at, these would be either "lexemes" or "tokens."