Seeking the right token filters for my requirements and getting desperate - lucene

I'm indexing documents which contain normal text, programming code and other non-linguistic fragments. For reasons which aren't particularly relevant I am trying to tokenise the content into lowercased strings of normal language, and single character symbols.
Thus the input
a few words. Cost*count
should generate the tokens
[a] [few] [words] [.] [cost] [*] [count]
Thus far thus extremely straightforward. But I want to handle "compound" words too, because the content can include words like order_date and object-oriented and class.method as well.
I'm following the principle that any of [-], [_] and [.] should be treated as a compound word conjunction rather than a symbol IF they are between two word characters, and should be treated as a separate symbol character if they are adjacent to a space, another symbol character, or the beginning or end of a string. I can handle all of this with a PatternTokenizer, like so:
public static final String tokenRgx = "(([A-Za-z0-9]+[-_.])*[A-Za-z0-9]+)|[^A-Za-z0-9\\s]{1}";
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRgx), 0);
TokenStream result = new LowerCaseFilter(src);
return new TokenStreamComponents(src, result);
}
This successfully distinguishes between full stops at the end of sentences and full stops in compounds, between hyphens introducing negative numbers and hyphenated words, etc. So in the above analyzer, the input:
a few words. class.simple_method_name. dd-mm-yyyy.
produces the tokens
[a] [few] [words] [.] [class.simple_method_name] [.] [dd-mm-yyyy] [.]
We're almost there, but not quite. Finally I want to split the compound terms into their parts RETAINING the trailing hyphen/underscore/stop character in each part. So I think I need to introduce another filter step to my analyzer so that the final set of tokens I end up with is this:
[a] [few] [words] [.] [class.] [simple_] [method_] [name] [.] [dd-] [mm-] [yyyy] [.]
And this is the piece that I am having trouble with. I presume that some kind of PatternCaptureGroupTokenFilter is required here but I haven't been able to find the right set of expressions to get the exact tokens I want emerging from the analyzer.
I know it must be possible, but I seem to have walked into a regular expression wall that blocks me. I need a flash of insight or a hint, if anyone can offer me one.
Thanks,
T
Edit:
Thanks to #rici for pointing me towards the solution
The string which works (including support for decimal numbers) is:
String tokenRegex = "-?[0-9]+\\.[0-9]+|[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]";

Seems to me like it would be easier to do the whole thing in one scan, using a regex like:
[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]
That uses a zero-width forward assertion in order to only add [-._] to the preceding word if it is immediately followed by a letter or digit. (Because (?=…) is an assertion, it doesn't include the following character in the match.)
To my mind, that won't correctly handle decimal numbers; -3.14159 will be three tokens rather than a single number token. But it depends on your precise needs.

Related

ANTLR lexer patern [\p{Emoji}]+ is matching numbers

The ANTLR4 lexer pattern [\p{Emoji}]+ is matching numbers. See screenshot. Note that it correctly rejects alpha chars. Is there an issue with the pattern?
\p{Emoji} matches everything that has the Unicode Emoji property. Numbers do have that property, so \p{Emoji} is correct in matching them. Why though?
The Unicode standard defines any codepoint to have the Emoji property if it can appear as part of an Emoji. Numbers can appear as parts of emojis (for example I think shapes with numbers on them, which for them reason count as emojis, consist of a shape, followed by a join, followed by the number), so they have that property.
If you only want to match codepoints that are emojis by themselves, you can just use the Emoji_Presentation property instead. This will fail to match combined emojis though.
If you want to match any sequence that creates an emoji, I think you'll want to match something like "Emoji_Presentation, followed by zero or more of '(Join_Control or Variation_Selector) followed by Emoji'" (here you want Emoji instead of Emoji_Presentation because that's where numbers are allowed).
However, for the purpose of allowing emojis in identifiers (as opposed to a lexer rule to match emojis and nothing else), you don't actually have to worry about whether a number is part of an emoji or not, just that it doesn't appear as the first character of the identifier. So you could simply define your fragment for the starting character to only include Emoji_Presentation and then the fragment for continuing characters to include Emoji as well as Join_Control and Variation_Selector.
So something like this would work:
fragment IdStart
: [_\p{Alpha}\p{General_Category=Other_Letter}\p{Emoji_Presentation}]
;
fragment IdContinue
: IdStart
// The `\p{Number}` might be redundant, I'm not sure. I don't know
// whether there are any (non-ascii) numeric codepoints that don't
// also have the `Emoji` property.
| [\p{Number}\p{Emoji}\p{Join_Control}\p{Variation_Selector}]
;
Identifier: IdStart IdContinue*;
Of course that's assuming you actually want to allow characters besides emojis. The definition in your question only included emojis (or was meant to anyway), but since it was called Identifier, I'm assuming you just removed the other allowed categories to simplify it.
Looking at the code that seems to define emoji code points:
UnicodeSet emojiRKUnicodeSet = new UnicodeSet("[\\p{GCB=Regional_Indicator}\\*#0-9\\u00a9\\u00ae\\u2122\\u3030\\u303d]");
it looks to be including digits (why, I don't know, checkout sepp2k's excellent explanation). You can always raise an issue if you think something is wrong.
You could also just use a character class like this instead:
Identifier
: [\u00a9\u00ae\u2000-\u3300\ud83c\ud000-\udfff\ud83d\ud000-\udfff\ud83e\ud000-\udfff]+
;

Nearley Tokenizers vs Rules

I'm pretty new to nearly.js, and I would like to know what tokenizers/lexers do compared to rules, according to the website:
By default, nearley splits the input into a stream of characters. This is called scannerless parsing.
A tokenizer splits the input into a stream of larger units called tokens. This happens in a separate stage before parsing. For example, a tokenizer might convert 512 + 10 into ["512", "+", "10"]: notice how it removed the whitespace, and combined multi-digit numbers into a single number.
Wouldn't that be the same as:
Math -> Number _ "+" _ Number
Number -> [0-9]:+
I don't see what the purpose of lexers are. I see that rules are always useable in this case and there is no need for lexers.
After fiddling around with them, I found out the use of tokenizers, say we had the following:
Keyword -> "if"|"else"
Identifier -> [a-zA-Z_]+
This won't work, if we try compiling this, we get ambiguous grammar, "if" will be matched as both a keyword and an Identifier, a tokenizer however:
{
"keyword": /if|else/
"identifier": /[a-zA-Z_]+/
}
Trying to compile this will not result in ambiguous grammar, because tokenizers are smart (at least the one shown in this example, which is Moo).

Regex matching sequence of characters

I have a test string such as: The Sun and the Moon together, forever
I want to be able to type a few characters or words and be able to match this string if the characters appear in the correct sequence together, even if there are missing words. For example, the following search word(s) should all match against this string:
The Moon
Sun tog
Tsmoon
The get ever
What regex pattern should I be using for this? I should add that the supplied test strings are going to be dynamic within an app, and so I'd like to be able to use a pattern based on the search string.
From your example Tsmoon you show partial words (T), ignoring case (s, m) and allow anything between each entered character. So as a first attempt you can:
Set the ignore case option
Between each chapter input insert the regular expression to match zero or more of anything. You can choose whether to match the shortest or longest run.
Try that, reading the documentation for NSRegularExpression if you're stuck, and see how it goes. If you get stuck ask a new question showing your code and the RE constructed and explain what happens/doesn't work as expected.
HTH

User input text translation

I'm working on a translator that will take English language text (as user input into a UITextView) and (with a button press) replace specific words with alternatives. I have both the English words in scope plus their alternatives in separate Arrays (englishArray and alternativeArray), indexed correspondingly.
My challenge is finding an algorithm that will allow me to identify a word in the input text (a UITextView) ignoring characters like <",.()>, lookup the word in englishArray (case insensitive), locate the corresponding word in alternativeArray and then use that word in place of the original - writing it back to the UITextView.
Any help greatly appreciated.
NB. I have created a Category extending the NSArray functionality with a indexOfCaseInsensitiveString method that ignores case when doing an indexOfObject type lookup if that helps.
Tony.
I think that using an NSScanner would be best to parse the string into separate words which you could then pass to your indexOfCaseInsensitiveString method. scanCharactersFromSet:intoString: using a set of all the characters you want to ignore, including whitespace and newline characters should get you to the start of a word, and then you could use scanUpToCharactersFromSet:intoString: using the same set to scan to the end of the word. Using scanLocation at the beginning and end of each scan should allow you to get the range of that word, so if you find a match in your array, you will know where in your string to make the replacement.
Thanks for your suggestion. It's working with one exception.
I want to capture all punctuation so I can recreate the original input but with the substituted words. Even though I have a 'space' in my Character Set, the scanner is not putting the spaces into the 'intoString'. Other characters I specify in the Character Set such as '(' and ';' are represented in the 'intoString'.
Net is that when I recreate the input, it's perfect except that I get individual words running into each other.
UPDATE: I fixed that issue by including:
[theScanner setCharactersToBeSkipped:nil];
Thanks again.

Comparison of Lucene Analyzers

Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.
In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.
Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.
Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.
KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.
Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].
And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.
And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.
In my perspective, I have used StandAnalyzer and SmartCNAnalyzer. As I have to search text in Chinese. Obviously, SmartCnAnalyzer is better at handling Chinese. For diiferent purposes, you have to choose properest analyzer.