I'm a complete noob with ANTLR, so apologies if this is a really basic question.
I'm trying to parse a file that has a weird JSON-like syntax. These files are huge, hundreds of MB, so I'm avoiding creating the parse tree and I'm just using grammar actions to manipulate the data into what I want.
As usual, I'm sending Whitespaces and Newlines to the HIDDEN channel. However, there are a couple cases where it'd be helpful if I could detect that the next character is one of those, because that delimits the property value.
Here's an excerpt from a file
game_speed=4
mapmode=0
dyn_title=
{
title="e_dyn_188785"
nick=nick_the_just hist=yes
base_title="k_mongolia"
is_custom=yes
is_dynamic=yes
claim=
{
title=k_bulgaria
pressed=yes
weak=yes
}
claim=
{
title=c_karvuna
pressed=yes
}
claim=
{
title=c_tyrnovo
}
claim=
{
title=c_mesembria
pressed=yes
}
}
And here's the relevant parts of my grammar:
property: key ASSIGNMENT value { insertProp(stack[scopeLevel], $key.text, currentVal) };
key: (LOWERCASE | UPPERCASE | UNDERSCORE | DIGIT | DOT | bool)+;
value:
bool { currentVal = $bool.text === 'yes' }
| string { currentVal = $string.text.replace(/\"/gi, '') }
| number { currentVal = parseFloat($number.text, 10) }
| date { currentVal = $date.text }
| specific_value { currentVal = $specific_value.text }
| (numberArray { currentVal = toArray($numberArray.text) }| array)
| object
;
bool: 'yes' | 'no';
number: DASH? (DIGIT+ | (DIGIT+ '.' DIGIT+));
string:
'"'
( number
| bool
| specific_value
| NONALPLHA
| UNDERSCORE
| DOT
| OPEN_CURLY_BRACES
| CLOSE_CURLY_BRACES
)*
'"'
;
specific_value: (LOWERCASE | UPPERCASE | UNDERSCORE | DASH | bool)+ ;
WS: ([\t\r\n] | ' ') -> channel(HIDDEN);
NEWLINE: ( '\r'? '\n' | '\r')+ -> channel(HIDDEN);
So, as you can see, the input syntax can have property values that are strings but are not delimited by ". And, in fact, for some odd reason, sometimes the next property appears on the same line. Ignoring the WS and NEWLINE means that the parser doesn't recognise that specific_value rule terminates so it grabs part of the next key as well. See output example below:
{
game_speed: 4,
mapmode: 0,
dyn_title:
{
title: 'e_dyn_188785',
nick: 'nick_the_just\t\t\this',
t: true,
base_title: 'k_mongolia',
is_custom: true,
is_dynamic: true,
claim: { title: 'k_bulgaria\n\t\t\t\tpresse', d: true, weak: true },
claim2: { title: 'c_karvuna\n\t\t\t\tpresse', d: true },
claim3: { title: 'c_tyrnovo' },
claim4: { title: 'c_mesembria\n\t\t\t\tpresse', d: true
}
},
What's an appropriate solution here to specify that specific_value shouldn't grab any characters once it reaches a WS or NEWLINE?
Thanks in advance! :D
I'd handle as much a possible in the lexer (like identifiers, numbers and strings). That could look like this in your case:
grammar JsonLike;
parse
: object? EOF
;
object
: '{' key_value* '}'
;
key_value
: key '=' value
;
key
: SPECIFIC_VALUE
| BOOL
// More tokens that can be a key?
;
value
: object
| array
| BOOL
| STRING
| NUMBER
| SPECIFIC_VALUE
;
array
: '[' value+ ']'
;
BOOL
: 'yes'
| 'no'
;
STRING
: '"' ( ~["\\] | '\\' ["\\] )* '"'
;
NUMBER
: '-'? [0-9]+ ( '.' [0-9]+ )?
;
SPECIFIC_VALUE
: [a-zA-Z_] [a-zA-Z_0-9]*
;
SPACES
: [ \t\r\n]+ -> channel(HIDDEN)
;
Resulting in the following parse:
Related
Trying to pull out text value out of column with json using varchar but get an invalid argument error on snowflake while running on mode. This json has a bit of different structure that what I'm used to seeing.
Have tried these to pull out the text:
changes:comment:new_value::varchar
changes:new_value::varchar
changes:comment::varchar
JSON looks like this:
{
"comment":
{
"new_value": "Hello there. Welcome to our facility.",
"old_value": ""
}
}
Wish to pull out the data in this column so the output reads:
Hello there. Welcome to our facility.
You can't extract fields from VARCHAR. If your string is JSON, you have to convert it to the VARIANT type, e.g. through PARSE_JSON function.
Example below:
create or replace table x(v varchar) as select * from values('{
"comment":
{
"new_value": "Hello there. Welcome to our facility.",
"old_value": ""
}
}');
select v, parse_json(v):comment.new_value::varchar from x;
--------------------------------------------------------------+------------------------------------------+
V | PARSE_JSON(V):COMMENT.NEW_VALUE::VARCHAR |
--------------------------------------------------------------+------------------------------------------+
{ | Hello there. Welcome to our facility. |
"comment": | |
{ | |
"new_value": "Hello there. Welcome to our facility.", | |
"old_value": "" | |
} | |
} | |
--------------------------------------------------------------+------------------------------------------+
I wanted to know if there is any way to format a JSON in Oracle (as does this web site example)
In XML I used:
SELECT XMLSERIALIZE(Document XMLTYPE(V_RESPONSE) AS CLOB INDENT SIZE = 2)
INTO V_RESPONSE
FROM DUAL;
And it works very well.
With Oracle 12c, you can use the JSON_QUERY() function with the RETURNING ... PRETTY clause :
PRETTY : Specify PRETTY to pretty-print the return character string by inserting newline characters and indenting
Expression :
JSON_QUERY(js_value, '$' RETURNING VARCHAR2(4000) PRETTY)
Demo on DB Fiddle :
with t as (select '{"a":1, "b": [{"b1":2}, {"b2": "z"}]}' js from dual)
select json_query(js, '$' returning varchar2(4000) pretty) pretty_js, js from t;
Yields :
PRETTY_JS | JS
--------------------------|----------------------------------------
{ | {"a":1, "b": [{"b1":2}, {"b2": "z"}]}
"a" : 1, |
"b" : |
[ |
{ |
"b1" : 2 |
}, |
{ |
"b2" : "z" |
} |
] |
} |
When you're lucky enough to get to Oracle Database 19c, there's another option for pretty printing: JSON_serialize.
This allows you to convert JSON between VARCHAR2/CLOB/BLOB. And includes a PRETTY clause:
with t as (
select '{"a":1, "b": [{"b1":2}, {"b2": "z"}]}' js
from dual
)
select json_serialize (
js returning varchar2 pretty
) pretty_js,
js
from t;
PRETTY_JS JS
{ {"a":1, "b": [{"b1":2}, {"b2": "z"}]}
"a" : 1,
"b" :
[
{
"b1" : 2
},
{
"b2" : "z"
}
]
}
Reposted from the #perl6 IRC channel, by jkramer, with permission
I'm playing with grammars and trying to parse an ini-style file but somehow Grammar.parse seems to loop forever and use 100% CPU. Any ideas what's wrong here?
grammar Format {
token TOP {
[
<comment>*
[
<section>
[ <line> | <comment> ]*
]*
]*
}
rule section {
'[' <identifier> <subsection>? ']'
}
rule subsection {
'"' <identifier> '"'
}
rule identifier {
<[A..Za..z]> <[A..Za..z0..9_-]>+
}
rule comment {
<[";]> .*? $$
}
rule line {
<key> '=' <value>
}
rule key {
<identifier>
}
rule value {
.*? $$
}
}
Format.parse('lol.conf'.IO.slurp)
Token TOP has the * quantifier on a subregex that can parse an empty string (because both <comment> and the group that contains <section> have a * quantifier on their own).
If the inner subregex matches the empty string, it can do so infinitely many times without advancing the cursor. Currently, Perl 6 has no protection against this kind of error.
It looks to me like you could simplify your code to
token TOP {
<comment>*
[
<section>
[ <line> | <comment> ]*
]*
}
(there is no need for the outer group of [...]*, because the last <comment> also matches comments before sections.
i'm using AntlrWords 2.1 to create a grammar for antlr v4. The rParanthesis is not recognised for whatever reason. I have searched a lot but couldnt find a reason why. Can you find any errors?
grammar bracketsGrammar;
OPENINGBRACKET : '[';
CLOSINGBRACKET : ']';
lParanthesis : OPENINGBRACKET ;
rParanthesis : CLOSINGBRACKET ;
WS : ' ' ->skip;
WORD : ~[ "]+ ;
parenthesizedWord : lParanthesis WS+ WORD WS+ rParanthesis ;
fullfile: parenthesizedWord EOF ;
And my input is
[ Manuel ]
And the output is
(fullfile (parenthesizedWord (lParanthesis [) Manuel ]) <EOF>)
As you can see both [ and ] are part of the output but my rParanthesis is not recognised.
Thanks for your help
Manuel
You skip spaces in the lexer but are using the WS token inside parser rules: remove them.
It shouldn't be:
parenthesizedWord : lParanthesis WS+ WORD WS+ rParanthesis ;
but:
parenthesizedWord : lParanthesis WORD rParanthesis ;
instead.
And then the following parse tree will be created:
Still learning how to properly use ANTLR... Here's my problem.
Say you have a (subset) of an UML grammar and an ANTLR Lexer/Parser with the following rules :
// Parser Rules
model
: 'MODEL' IDENTIFIER list_dec
;
list_dec
: declaration*
;
declaration
: class_dec ';'
| association ';'
| generalization ';'
| aggregation ';'
;
class_dec
: 'CLASS' IDENTIFIER class_content
;
...
association
: 'RELATION' IDENTIFIER 'ROLES' two_roles
;
two_roles
: role ',' role
;
role
: 'CLASS' IDENTIFIER multiplicity
;
...
I would like the 'role' rule to only allow the IDENTIFIER token if it matches an existing class IDENTIFIER. In other words, if you are given an input file and you run the lexer/parser on it, then all the classes that are referenced (e.g. the IDENTIFIER in the association rule) should exist. The problem is that a class might not exist (yet) at runtime (it can be declared anywhere in the file). What is the best approach to this ?
Thanks in advance...
This is probably best done after parsing. The parser creates some sort of tree for you, and afterwards you walk the tree and collect information about declared classes, and walk it a second time to validate the role tree/rule.
Of course, some things could be done with a bit of custom code:
grammar G;
options {
...
}
#parser::members {
java.util.Set<String> declaredClasses = new java.util.HashSet<String>();
}
model
: 'MODEL' IDENTIFIER list_dec
;
...
class_dec
: 'CLASS' id=IDENTIFIER class_content
{
declaredClasses.add($id.text);
}
;
...
role
: 'CLASS' id=IDENTIFIER multiplicity
{
if(!declaredClasses.contains($id.text)) {
// warning or exception in here
}
}
;
...
EDIT
Or with custom methods:
#parser::members {
java.util.Set<String> declaredClasses = new java.util.HashSet<String>();
void addClass(String id) {
boolean added = declaredClasses.add(id);
if(!added) {
// 'id' was already present, do something, perhaps?
}
}
void checkClass(String id) {
if(!declaredClasses.contains(id)) {
// exception, error or warning?
}
}
}
...
class_dec
: 'CLASS' id=IDENTIFIER class_content {addClass($id.text);}
;
role
: 'CLASS' id=IDENTIFIER multiplicity {checkClass($id.text);}
;