Using $ as delimiter in StringTemplate from ANTRL rewriter grammars - antlr

I'm trying to write an ANTLR3 grammar that generates HTML output using StringTemplate. To avoid having to escape all the HTML tags in the template rules (e.g. \<p\><variable>\</p\>), I'd prefer to use dollar as the delimiter for StringTemplate (e.g. <p>$variable$</p>).
While the latter seems to be the default when StringTemplate is used on its own, the parser code generated by ANTRL always uses AngleBracketTemplateLexer when initializing StringTemplate.
How can I get ANTLR to generate code using DefaultTemplateLexer (i.e. the variant that uses dollar as the delimiter)?

Try setting the DefaultTemplateLexer.class in the StringTemplateGroup like this:
StringTemplateGroup group = new StringTemplateGroup(reader,
DefaultTemplateLexer.class);

Related

Is it a way to split chars with ANTLR?

I'm tryna do an ANTLR translator from Markdown format to HTML document and I found this problem when I try to recognize bold format. This is my ANTLR rule:
TxtNegrita : ('**' | '__') .*? ('**' | '__') {System.out.println('<span class="bold">' + getText() + '</span>');};
Unfortunately, the getText() function retrieves all the recognized String, including ** at the beginning and at the end of the String. Is it a way to delete that chars using ANTLR (obviously, in Java is perfectly possible).
Thanks!
You’ve created a Lexer rule which results in a single token. That is the expected behavior.
That rule looks more like something I’d expect in a parser rule.
(rules begin with upper case characters (conventionally all uppercase to make them stand out), and parser rules begin with lowercase letters and result in parse trees where each node has a context which gives you access to the component parts of your parser rule.
In ANTLR it is quite important to understand the difference between Lexer rules and parser rules.
Put simply... your input stream of characters is converted to an input stream of tokens using Lexer rules, and that stream of tokens is processed by parser rules.
Tokens are pretty much the “atoms” that parser rules deal with and their values are simply the string of characters that matched the Lexer Rule.

Is there a parser tag for comments in K?

Is there any built-in tag for block, line or in-line comments for the parser generator?
e.g.
comment blocks "(*" Exp "*)" or inline comments "//" Exp.
In a parser generator like menhir, I would normally handle comments by pattern matching with the lexer, so comments wouldn't be part of the AST. Is there an equivalent in K?
If not, what is the recommended way of implementing comments?
You can declare the builtin sort #Layout to be the concatenation via pipes of a set of regular expression terminals (i.e., r"//[^\\n]*"). Any tokens which lex as one of these tokens are simply discarded by the lexer and the parser does not even see them. Note that this applies only to parsing terms using a generated parser or kast; parsing rules in .k files will still require the usual K syntax for comments.
Note that this is also how whitespace is parsed, so unless your language is whitespace sensitive, make sure to include in #Layout any whitespace characters which you want the parser to ignore.

awk pattern to match an XML PI at the start of a line

I have an XML document containing a number of XML Processing Instructions which are of the form:
<?cpdoc something?>
I am trying to match them in awk with the pattern
/^\<\?cpdoc/
but it's not returning anything. If I remove the ^ anchor, it works (but I have other similar PIs which don't start a line which I don't want matched).
It looks as if it's being confused by the \<\? but why is it ignoring the line-start anchor?
Don't parse XML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, XML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint
xmlstarlet
saxon-lint (my own project)
Check: Using regular expressions with HTML tags
Example using xpath :
xmllint --xpath '//processing-instruction()' file.xml
Solution by OP and explanation by Ed Morton.
It works if the less-than is not escaped, as otherwise it's a word boundary. So instead of:
\<\?
I should use literal:
<\?
This is because we can't just go escaping any character and hoping for the best, we have to know which characters are metacharacters and then escape them if we want them treated as literal.

Modifiying ANTLR v4 auto-generated lexer?

So i am writing a small language and i am using antlrv4 as my tool. Antlr autogenerates lexer and parser files when u compile your grammar file(.g4). I am using javac btw. I want my language to have no semicolons and the way i want to do this is: if there is an identifier or ")" as the last token in a line, the lexer will automatically put the semicolon(Similar to what "go" language does). How would i approach something like this? There are other things like ATN(which i think is augmented transition network) and dfa(which i think is deterministic finite automaton) in the lexer file which i don't understand or how they relate to the lexing process?. Any help is appreciated. (btw i am still working on the grammar file so i don't have that fully completed).
Several points here: the ATN and the DFA are internal structures for parser + lexer and not something you would touch to change parsing behavior. Also, it's not clear to me why you want to have the lexer insert a semicolon at some point. What exactly do you want to accomplish by that (don't say: to make semicolons optional in the parser, I mean the underlying reason).
If you want to accept a command without a trailing semicolon you can make that optional:
assignment: simpleAssignment | complexAssignment SEMI?;
The parser will give you the content of the assignment rule regardless whether there is a trailing semicolon or not. Is that what you want?

Parsing a G4 file to generate doc / schema

I realize this question is a bit meta, but I essentially want to parse an ANTLR4 grammar (an actual .g4 file) to then generate documentation and other artifacts based on the grammar (not an instance of the grammar).
For example, consider the example Java grammar that contains this rule:
compilationUnit
: packageDeclaration? importDeclaration* typeDeclaration* EOF
;
I want to be able to parse the Java.g4 file and produce documentation that says "A compilationUnit contains an optional packageDeclaration, 0 or more importDeclarations, and 0 or more typeDeclarations". Or perhaps I want to produce an XSD with a data type called "compilationUnit" that contains "packageDeclaration", "importDeclaration", and "typeDeclaration" elements (with proper cardinality set).
What is the best way of accomplishing something like this? Is it to create a target (even though the goal isn't to create lexers/parsers), or is it to use the example antlr4 grammar to parse the g4 file, or is it something else?
Thanks!
This would be a very typical use of ANTLR, and convenient given the existing ANTLR 4 grammar.