I'm starting to study ANTLR.
The aim is to 'translate' Strings into SQL statements.
One simple example of what I want to do:
If I receive the String "name = A and age = B" --- ANTLR ---> "select * from USERS where name = 'A' and age = 'B'"
I've been reading some information about ANTLR, and following some examples, but those just convert the input stream of characters (source file) into a AST. But how can I use ANTLR to translate the input message, and use the translated output?
Can you give me some highlights or tell me where can I found some information about that?
I'm using the Eclipse IDE and Maven ANTLR Plugin.
ANTLR is just a parser generator. You can insert actions into the grammar that collect information or directly print output. The most common mechanism is to allow ANTLR to create an intermediate presentation in the form of an AST or, with ANTLR 4, a parse tree. From there, you build a tree walker to either build an internal model or directly generate output. From the internal model, which represent constructs in your output language, you can then generate the output. I typically use StringTemplate for generating structured text.
When the input and output are very similar and, more importantly, the order of output is very similar, you can get away with syntax directed translation: i.e. actions directly in the grammar or actions applied directly to a parse tree.
When the order of output is very different, you have to build some form of intermediate representation. Imagine simply reading in a bunch of integers and printing them back out in reverse order. You can do that by simply printing out the numbers as you see them. This is all explained in my [shameless plug] book, Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages http://amzn.com/B00A376HGG
Related
I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.
I am trying create a grammar for a format that follows a type-length-value convention. Can ANTLR4 read in a length value and then parse that many characters?
NO ...
From your question (which is very short so I could miss something ...) I gather you are mixing grammar and encoding rules.
When you say type-length-value, it sounds like an encoding rule to me (how to serialize a data). In my experience, you write this code yourself.
A grammar is at a higher level: it's a piece of text that describes something. Antlr will help you breaking this text into tokens and then into a tree that you can navigate.
This step only handles text: if you were going that way to solve your problem, you would still have to handle type, length and value yourself.
EDIT:
with a bit of googling I found this https://github.com/NickstaDB/SerializationDumper
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.
Im looking for a tool that can validate if a given text\paragraph subject to a specific format .
for example :
I can be able to check if the text is as following :
xxx{
sss:aaa;
}
yyy();
preferably open source tool, with easy rule sets like xml or something .
by text i mean a string that i get from i.e fgets(), or any function that reads from a file .
For something like this I'd suggest a parser (see, for instance, What is Parse/parsing?). You can build one from a definition of the language that you want to parse using a parser generator like Yacc or its free GNU equivalent Bison, or any number of other parser generators, many of which are also freely available.
Most parsers are used to transform a text that complies with a grammar into some other form (e.g. an intermediate language or a machine code) but that isn't neccesary - in your case the parser could simply say (at a minimum) "Yes" if the text conforms to a given grammar.
Parsers for simple grammars can be built by hand but, if you have the tools available, using a parser generator is easier and more robust in my experience.
Further, the text that you've shown is similar to a portion of code written in the C language (something close to a struct declaration followed by a function call), so you would be able to re-use parts of the grammar that you need from an existing Yacc grammar for C like this one.
I use JAVACC to parse some string defined by a bnf grammar with initial non-terminal G.
I would like to catch errors thrown by TokenMgrError.
In particular, I want to handle the following two cases:
If some prefix of the input satisfies G, but not all of the symbols are read from the input, consider this case as normal and return AST for found prefix by a call to G().
If the input has no prefix satisfying G, return null from G().
Currently I'm getting TokenMgrError 's in each of this case instead.
I started to modify the generated files (i.e, to change Error to Exception and add appropriate try/catch/throws statements), but I found it to be tedious. In addition, automatic generation of the modified files produced by JAVACC does not work. Is there a smarter way to accomplish this?
You can always eliminate all TokenMgrErrors by including
<*> TOKEN : { <UNEXPECTED: ~[] > }
as the final rule. This pushes all you issues to the grammar level where you can generally deal with them more easily.