I'm pretty new to nearly.js, and I would like to know what tokenizers/lexers do compared to rules, according to the website:
By default, nearley splits the input into a stream of characters. This is called scannerless parsing.
A tokenizer splits the input into a stream of larger units called tokens. This happens in a separate stage before parsing. For example, a tokenizer might convert 512 + 10 into ["512", "+", "10"]: notice how it removed the whitespace, and combined multi-digit numbers into a single number.
Wouldn't that be the same as:
Math -> Number _ "+" _ Number
Number -> [0-9]:+
I don't see what the purpose of lexers are. I see that rules are always useable in this case and there is no need for lexers.
After fiddling around with them, I found out the use of tokenizers, say we had the following:
Keyword -> "if"|"else"
Identifier -> [a-zA-Z_]+
This won't work, if we try compiling this, we get ambiguous grammar, "if" will be matched as both a keyword and an Identifier, a tokenizer however:
{
"keyword": /if|else/
"identifier": /[a-zA-Z_]+/
}
Trying to compile this will not result in ambiguous grammar, because tokenizers are smart (at least the one shown in this example, which is Moo).
Related
I'm indexing documents which contain normal text, programming code and other non-linguistic fragments. For reasons which aren't particularly relevant I am trying to tokenise the content into lowercased strings of normal language, and single character symbols.
Thus the input
a few words. Cost*count
should generate the tokens
[a] [few] [words] [.] [cost] [*] [count]
Thus far thus extremely straightforward. But I want to handle "compound" words too, because the content can include words like order_date and object-oriented and class.method as well.
I'm following the principle that any of [-], [_] and [.] should be treated as a compound word conjunction rather than a symbol IF they are between two word characters, and should be treated as a separate symbol character if they are adjacent to a space, another symbol character, or the beginning or end of a string. I can handle all of this with a PatternTokenizer, like so:
public static final String tokenRgx = "(([A-Za-z0-9]+[-_.])*[A-Za-z0-9]+)|[^A-Za-z0-9\\s]{1}";
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRgx), 0);
TokenStream result = new LowerCaseFilter(src);
return new TokenStreamComponents(src, result);
}
This successfully distinguishes between full stops at the end of sentences and full stops in compounds, between hyphens introducing negative numbers and hyphenated words, etc. So in the above analyzer, the input:
a few words. class.simple_method_name. dd-mm-yyyy.
produces the tokens
[a] [few] [words] [.] [class.simple_method_name] [.] [dd-mm-yyyy] [.]
We're almost there, but not quite. Finally I want to split the compound terms into their parts RETAINING the trailing hyphen/underscore/stop character in each part. So I think I need to introduce another filter step to my analyzer so that the final set of tokens I end up with is this:
[a] [few] [words] [.] [class.] [simple_] [method_] [name] [.] [dd-] [mm-] [yyyy] [.]
And this is the piece that I am having trouble with. I presume that some kind of PatternCaptureGroupTokenFilter is required here but I haven't been able to find the right set of expressions to get the exact tokens I want emerging from the analyzer.
I know it must be possible, but I seem to have walked into a regular expression wall that blocks me. I need a flash of insight or a hint, if anyone can offer me one.
Thanks,
T
Edit:
Thanks to #rici for pointing me towards the solution
The string which works (including support for decimal numbers) is:
String tokenRegex = "-?[0-9]+\\.[0-9]+|[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]";
Seems to me like it would be easier to do the whole thing in one scan, using a regex like:
[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]
That uses a zero-width forward assertion in order to only add [-._] to the preceding word if it is immediately followed by a letter or digit. (Because (?=…) is an assertion, it doesn't include the following character in the match.)
To my mind, that won't correctly handle decimal numbers; -3.14159 will be three tokens rather than a single number token. But it depends on your precise needs.
My problem is that I have an identical pattern of characters that I wish to parse differently given their context. In one part of the file I need to parse a version number of the form "#.#" Obviously, this is identical to a floating point number. The lexer always opts to return a floating point number. I think I can switch the sequence of rules to give the version # precedence(?), but that doesn't do me any good when I need to later parse floating point numbers.
I suppose I could forget about asking the parser to return each piece of the version separately and split the floating point number into pieces, but I had hoped to have it done for me.
There is actually a bit more to the context of the version #. Its complete form is "ABC #.# XYZ" where "ABC" and "XYZ" never change. I have tried a couple of things to leverage the context of the version #, but haven't made it work.
Is there a way to provide some context to the lexer to parse the components of the version? Am I stuck with receiving a floating point number and parsing it myself?
You have a few possibilities.
The simplest one is to do the string to number conversion in the parser instead of the scanner. That requires making a copy of the number as a string, but the overhead should not be too high: malloc of short strings is well-optimized on almost all platforms. And the benefit is that the code is extremely straightforward and robust:
Parser
%union {
char* string;
double number;
// other types, including version
}
%token <string> DOTTED
%token <number> NUMBER
%type <number> number
%type <version> version
%%
number : NUMBER
| DOTTED { $$ = atod($1); free($1); }
version: DOTTED { $$ = make_version($1); free($1); }
Scanner
[[:digit:]]+\.[[:digit:]]+ { yylval.string = strdup(yytext); return DOTTED; }
[[:digit:]]+\.?|\.[[:digit:]]+ { yylval.number = atod(yytext); }
The above assumes that version numbers are always single-dotted, as in the OP. In applications where version numbers could have multiple dots or non-numeric characters, you would end up with three possible token types: unambiguous numbers, unambiguous version strings, and single-dotted numeric strings which could be either. Aside from adding the VERSION token type and the pattern for unambiguous version strings to the scanner, the only change is to add | VERSION to the version production in the parser.
Another possibility, if you can easily figure out in the scanner whether a number or a version is required, is to use a start condition. You can also change the condition from the parser but it's more complicated and you need to understand the parsing algorithm in order to ensure that you are communicating the state changes correctly.
Finally, you could do both conversions in the scanner, and pick the correct one when you reduce the token in the parser. If the version is a small and simple data structure, this might well turn out to be optimal.
I am studying YACC and the concept of a terminal symbol vs a token keeps coming up. Could someone explain to me what the difference is or point me to an article or tutorial that might help?
They are really two names for the same thing, but usually "terminal" is used to describe what the parser is working with, while "token" is used to describe the corresponding sequence of symbols in the source.
In a parser generator like yacc, the grammar of the language is defined in terms of an "alphabet" of "terminals". The word "alphabet" is a little confusing because they are strings, not letters. But from the parser's perspective, every terminal is an indivisible unit indistinguishable from any other use of the same kind of terminal. So the source code:
total = 17 + subtotal;
will be presented to the parser as something like:
ID EQUALS NUMBER PLUS ID SEMICOLON
There is a correspondence between the stream of terminals which the parser sees and substrings of the input language. So we say that the "token" total is an instance of the "terminal" ID. There may be an unlimited number of potential tokens corresponding to a given terminal (or they may be just one, as with the terminal EQUALS) but what the parser actually works with is a smallish finite set of terminals.
I use JAVACC to parse some string defined by a bnf grammar with initial non-terminal G.
I would like to catch errors thrown by TokenMgrError.
In particular, I want to handle the following two cases:
If some prefix of the input satisfies G, but not all of the symbols are read from the input, consider this case as normal and return AST for found prefix by a call to G().
If the input has no prefix satisfying G, return null from G().
Currently I'm getting TokenMgrError 's in each of this case instead.
I started to modify the generated files (i.e, to change Error to Exception and add appropriate try/catch/throws statements), but I found it to be tedious. In addition, automatic generation of the modified files produced by JAVACC does not work. Is there a smarter way to accomplish this?
You can always eliminate all TokenMgrErrors by including
<*> TOKEN : { <UNEXPECTED: ~[] > }
as the final rule. This pushes all you issues to the grammar level where you can generally deal with them more easily.
Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.
In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.
Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.
Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.
KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.
Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].
And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.
And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.
In my perspective, I have used StandAnalyzer and SmartCNAnalyzer. As I have to search text in Chinese. Obviously, SmartCnAnalyzer is better at handling Chinese. For diiferent purposes, you have to choose properest analyzer.