Fuziness In UIMA ruta - apache

Is there any option of fuzziness in case of word matching, or ignoring some special cases.
For ex:
STRINGLIST AMIMALLIST = {"LION","TIGER","MONKEY"};
DECLARE ANIMAL;
Document {-> MARKFAST(ANIMAL, AMIMALLIST, true)};
I need to match words with list in case I face some special character like
Tiger- or MONKEY$
According to documentation There are different evaluator any idea how to use?
Or can I use SCORE or MARKSCORE

There are several aspects to consider here. In general, UIMA Ruta does not support fuzziness in the dictionary lookup. SCORE and MARKSCORE are language elements which can be utilized to introduce some heurstic scoring (not really fuzziness) in sequential rules. In the examples you gave in your question, you do not really need fuzzy matching.
The dictionary lookup in UIMA Ruta works on the RutaBasic annotation. These annotations are automatically created and maintained by UIMA Ruta itself (and should not be changed by other analysis engines or rules directly). The RutaBasic annotations represent the smallest fragments annotations are referring to. By default, the seeder of the RutaEngine creates annotations for words (W -> CW, SW, CAP) and many other tokens like SPECIAL for - or $. This means that there is also a RutaBasic annotation, and that the dictionary lookup can distinghish between these tokens. As a result, Tiger and Monkey should be annotated and the example in your question should actually work (I tested it). You maybe need some postprossesing in order to include the SPECIAL in ANIMAL.
I have to mention that there is also the functionality to use an edit distance in the dictionary lookup (Multi Tree Word List, TRIE). However, this functionality has not been maintained for several years. It should also support different weights for specific replacements. I do not know if this counts as fuzziness.
DISCLAIMER: I am a developer of UIMA Ruta

Related

Use MeCab to separate Japanese sentences into words not morphemes in vb.net

I am using the following code to split Japanese sentences into its words:
Dim parameter = New MeCabParam()
Dim tagger = MeCabTagger.Create(parameter)
For Each node In tagger.ParseToNodes(sentence)
If node.CharType > 0 Then
Dim features = node.Feature.Split(",")
Console.Write(node.Surface)
Console.WriteLine(" (" & features(7) & ") " & features(1))
End If
Next
An input of それに応じて大きくになります。 outputs morphemes:
それ (それ) 代名詞
に (に) 格助詞
応じ (おうじ) 自立
て (て) 接続助詞
大きく (おおきく) 自立
に (に) 格助詞
なり (なり) 自立
ます (ます) *
。 (。) 句点
Rather than words like so:
それ
に
応じて
大きく
に
なります
。
Is there a way I can use a parameter to get MeCab to output the latter? I am very new to coding so would appreciate it if you explain simply. Thanks.
This is actually pretty hard to do. MeCab, Kuromoji, Sudachi, KyTea, Rakuten-MA—all of these Japanese parsers and the dictionary databases they consume (IPADIC, UniDic, Neologd, etc.) have chosen to parse morphemes, the smallest units of meaning, instead of what you call "words", which as your example shows often contain multiple morphemes.
There are some strategies that usually folks combine to improve on this.
Experiment with different dictionaries. I've noticed that UniDic is sometimes more consistent than IPADIC.
Use a bunsetsu chunker like J.DepP, which consumes the output of MeCab to chunk together morphemes into bunsetsu. Per this paper, "We use the notion of a bunsetsu which roughly corresponds to a minimum phrase in English and consists of a content words (basically nouns or verbs) and the functional words surrounding them." The bunsetsu output by J.DepP often correspond to "words". I personally don't think of, say, a noun + particle phrase as a "word" but you might—these two are usually in a single bunsetsu. (J.DepP is also pretttty fancy, in that it also outputs a dependency tree between bunsetsu, so you can see which one modifies or is secondary to which other one. See my example.)
A last technique that you shouldn't overlook is scanning the dictionary (JMdict) for runs of adjacent morphemes; this helps find idioms or set phrases. It can get complicated because the dictionary may have a deconjugated form of a phrase in your sentence, so you might have to search both the literal sentence form and the deconjugated (lemma) form of MeCab output.
I have an open-source package that combines all of the above called Curtiz: it runs text through MeCab, chunks them into bunsetsu with J.DepP to find groups of morphemes that belong together, identifies vocabulary by looking them up in the dictionary, separates particles and conjugated phrases, etc. It is likely not going to be useful for you, since I use it to support my activities in learning Japanese and making Japanese learning tools but it shows how the above pieces can be combined to get to what you need in Japanese NLP.
Hopefully that's helpful. I'm happy to elaborate more on any of the above topics.

Automatically detect security identifier columns using Visions

I'm interested in using the Visions library to automate the process of identifying certain types of security (stock) identifiers. The documentation mentions that it could be used in such a way for ISBN codes but I'm looking for a more concrete example of how to do it. I think the process would be pretty much identical for the fields I'm thinking of as they all have check digits (ISIN, SEDOL, CUSIP).
My general idea is that I would create custom types for the different identifier types and could use those types to
Take a dataframe where the types are unknown and identify columns matching the types (even if it's not a 100% match)
Validate the types on a dataframe where the intended type is known
Great question and use-case! Unfortunately, the documentation on making new types probably needs a little love right now as there were API breaking changes with the 0.7.0 release. Both the previous link and this post from August, 2020 should cover the conceptual idea of type creation in greater detail. If any of those examples break then mea culpa and our apologies, we switched to a dispatch based implementation to support different backends (pandas, numpy, dask, spark, etc...) for each type. You shouldn't have to worry about that for now but if you're interested you can find the default type definitions here with their backends here.
Building an ISBN Type
We need to make two basic decisions when defining a type:
What defines the type
What other types are our new type related to?
For the ISBN use-case O'Reilly provides a validation regex to match ISBN-10 and ISBN-13 codes. So,
What defines a type?
We want every element in the sequence to be a string which matches a corresponding ISBN-10 or ISBN-13 regex
What other types are our new type related to?
Since ISBN's are themselves strings we can use the default String type provided by visions.
Type Definition
from typing import Sequence
import pandas as pd
from visions.relations import IdentityRelation, TypeRelation
from visions.types.string import String
from visions.types.type import VisionsBaseType
isbn_regex = "^(?:ISBN(?:-1[03])?:?●)?(?=[0-9X]{10}$|(?=(?:[0-9]+[-●]){3})[-●0-9X]{13}$|97[89][0-9]{10}$|(?=(?:[0-9]+[-●]){4})[-●0-9]{17}$)(?:97[89][-●]?)?[0-9]{1,5}[-●]?[0-9]+[-●]?[0-9]+[-●]?[0-9X]$"
class ISBN(VisionsBaseType):
#staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(String),
]
return relations
#staticmethod
def contains_op(series: pd.Series, state: dict) -> bool:
return series.str.contains(isbn_regex).all()
Looking at this closely there are three things to take note of.
The new type inherits from VisionsBaseType
We had to define a get_relations method which is how we relate a new type to others we might want to use in a typeset. In this case, I've used an IdentityRelation to String which means ISBNs are subsets of String. We can also use InferenceRelation's when we want to support relations which change the underlying data (say converting the string '4.2' to the float 4.2).
A contains_op this is our definition of the type. In this case, we are applying a regex string to every element in the input and verifying it matched the regex provided by O'Reilly.
Extensions
In theory ISBNs can be encoded in what looks like a 10 or 13 digit integer as well - to work with those you might want to create an InferenceRelation between Integer and ISBN. A simple implementation would involve coercing Integers to string and applying the above regex.

Find MMT Unicode abbreviations for given symbol (e.g. given ☞, find "juri")

The usual IDEs/editors for MMT (e.g. IntelliJ + MMT plugin or jEdit) feature an autocompletion feature for certain useful Unicode characters. For instance, I can type jle and immediately get suggested jleftrightarrow that, upon autocompletion, is replaced by ↔.
Is there a way to find out the reverse association? E.g. I have the symbol ☞ at hand and would like to know the autocompletion abbreviation starting with j — if it exists. For that hand, I would get juri.
The MMT OnlineTools I developed allow this: https://comfreek.github.io/mmteditor.
See screenshot below: if you already have a string full of Unicode symbols that you don't know how to type, just paste it under "how do I type X?". And if you are looking for a specific abbreviation — by Unicode character or by (parts of its) name — use the "abbreviation search" feature.
Internally, my tools pulls from (a copy of) the same resource file that Dennis linked in his answer.
As far as I know, there currently isn't a good way to look up or search for the ASCII abbreviations, except to go straight for the source — which at least has the advantage that it's guaranteed to be up-to-date.
The IDE plugins all have access to an mmt.jar and load their abbreviations from a specific resource file embedded therein. You can find it here on GitHub: https://github.com/UniFormal/MMT/blob/master/src/mmt-api/resources/unicode/unicode-latex-map.
In the long term, we should consider extending that file with a third "field" that gives a short description, and e.g. have a text field in IntelliJ to search for a specific abbreviation.

How to assign lexical features to new unanalyzable tokens in spaCy?

I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.

Is there an APL idiom to get a vector of all alphabetical characters?

I know you can get a character vector of all numbers with ∊⍕¨⍳10, but is there a platform independent idiom for getting a vector of all alphabetical characters, aside from manually typing 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'? I know that I can do ⎕AV[(⍳26)+(⎕AV⍳'a')-1] to get all lowercase characters (and uppercase by changing the 'a' to 'A') in Dyalog APL, but I presume the system variable ⎕AV isn't available in other environments.
Not really.
In Dyalog APL, what I generally do is use ⎕A for the uppercase characters and ⎕UCS 96+⍳26 for the lowercase characters. (And ⎕A,⎕UCS 96+⍳26 for the whole alphabet.)
⎕AV is usually present, but its contents are not standard. (For example, NARS2000's ⎕AV is different from Dyalog's.) By the way, in Dyalog ⎕AV is considered deprecated in favour of ⎕UCS. Any APL that implements ⎕UCS will do it the same way, because Unicode is a set standard.
If you want a guaranteed implementation-independent, readable way to define the alphabet I would indeed recommend to just store abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ in your workspace.
However, I would not recommend trying to write implementation-independent APL code to begin with. APL dialects are rather divergent, so this is decidedly nontrivial (if possible at all for complex code), and will be difficult to maintain.
Even though Quad names (⎕xxx) are usually case insensitive, MicroAPL distinguishes between ⎕A and ⎕a, so ⎕a,⎕A gives 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'.
Yes, in the latest versions*, write ⎕A,819⌶⎕A, for ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.
Try it online!
Documentation.
* Latest builds of 14.1, and all versions of 15.0 and up.