Reposted from the #perl6 IRC channel, by jkramer, with permission
I'm playing with grammars and trying to parse an ini-style file but somehow Grammar.parse seems to loop forever and use 100% CPU. Any ideas what's wrong here?
grammar Format {
token TOP {
[
<comment>*
[
<section>
[ <line> | <comment> ]*
]*
]*
}
rule section {
'[' <identifier> <subsection>? ']'
}
rule subsection {
'"' <identifier> '"'
}
rule identifier {
<[A..Za..z]> <[A..Za..z0..9_-]>+
}
rule comment {
<[";]> .*? $$
}
rule line {
<key> '=' <value>
}
rule key {
<identifier>
}
rule value {
.*? $$
}
}
Format.parse('lol.conf'.IO.slurp)
Token TOP has the * quantifier on a subregex that can parse an empty string (because both <comment> and the group that contains <section> have a * quantifier on their own).
If the inner subregex matches the empty string, it can do so infinitely many times without advancing the cursor. Currently, Perl 6 has no protection against this kind of error.
It looks to me like you could simplify your code to
token TOP {
<comment>*
[
<section>
[ <line> | <comment> ]*
]*
}
(there is no need for the outer group of [...]*, because the last <comment> also matches comments before sections.
Related
I'm a complete noob with ANTLR, so apologies if this is a really basic question.
I'm trying to parse a file that has a weird JSON-like syntax. These files are huge, hundreds of MB, so I'm avoiding creating the parse tree and I'm just using grammar actions to manipulate the data into what I want.
As usual, I'm sending Whitespaces and Newlines to the HIDDEN channel. However, there are a couple cases where it'd be helpful if I could detect that the next character is one of those, because that delimits the property value.
Here's an excerpt from a file
game_speed=4
mapmode=0
dyn_title=
{
title="e_dyn_188785"
nick=nick_the_just hist=yes
base_title="k_mongolia"
is_custom=yes
is_dynamic=yes
claim=
{
title=k_bulgaria
pressed=yes
weak=yes
}
claim=
{
title=c_karvuna
pressed=yes
}
claim=
{
title=c_tyrnovo
}
claim=
{
title=c_mesembria
pressed=yes
}
}
And here's the relevant parts of my grammar:
property: key ASSIGNMENT value { insertProp(stack[scopeLevel], $key.text, currentVal) };
key: (LOWERCASE | UPPERCASE | UNDERSCORE | DIGIT | DOT | bool)+;
value:
bool { currentVal = $bool.text === 'yes' }
| string { currentVal = $string.text.replace(/\"/gi, '') }
| number { currentVal = parseFloat($number.text, 10) }
| date { currentVal = $date.text }
| specific_value { currentVal = $specific_value.text }
| (numberArray { currentVal = toArray($numberArray.text) }| array)
| object
;
bool: 'yes' | 'no';
number: DASH? (DIGIT+ | (DIGIT+ '.' DIGIT+));
string:
'"'
( number
| bool
| specific_value
| NONALPLHA
| UNDERSCORE
| DOT
| OPEN_CURLY_BRACES
| CLOSE_CURLY_BRACES
)*
'"'
;
specific_value: (LOWERCASE | UPPERCASE | UNDERSCORE | DASH | bool)+ ;
WS: ([\t\r\n] | ' ') -> channel(HIDDEN);
NEWLINE: ( '\r'? '\n' | '\r')+ -> channel(HIDDEN);
So, as you can see, the input syntax can have property values that are strings but are not delimited by ". And, in fact, for some odd reason, sometimes the next property appears on the same line. Ignoring the WS and NEWLINE means that the parser doesn't recognise that specific_value rule terminates so it grabs part of the next key as well. See output example below:
{
game_speed: 4,
mapmode: 0,
dyn_title:
{
title: 'e_dyn_188785',
nick: 'nick_the_just\t\t\this',
t: true,
base_title: 'k_mongolia',
is_custom: true,
is_dynamic: true,
claim: { title: 'k_bulgaria\n\t\t\t\tpresse', d: true, weak: true },
claim2: { title: 'c_karvuna\n\t\t\t\tpresse', d: true },
claim3: { title: 'c_tyrnovo' },
claim4: { title: 'c_mesembria\n\t\t\t\tpresse', d: true
}
},
What's an appropriate solution here to specify that specific_value shouldn't grab any characters once it reaches a WS or NEWLINE?
Thanks in advance! :D
I'd handle as much a possible in the lexer (like identifiers, numbers and strings). That could look like this in your case:
grammar JsonLike;
parse
: object? EOF
;
object
: '{' key_value* '}'
;
key_value
: key '=' value
;
key
: SPECIFIC_VALUE
| BOOL
// More tokens that can be a key?
;
value
: object
| array
| BOOL
| STRING
| NUMBER
| SPECIFIC_VALUE
;
array
: '[' value+ ']'
;
BOOL
: 'yes'
| 'no'
;
STRING
: '"' ( ~["\\] | '\\' ["\\] )* '"'
;
NUMBER
: '-'? [0-9]+ ( '.' [0-9]+ )?
;
SPECIFIC_VALUE
: [a-zA-Z_] [a-zA-Z_0-9]*
;
SPACES
: [ \t\r\n]+ -> channel(HIDDEN)
;
Resulting in the following parse:
I create my index with following custom analyzer
"analyzers":[
{
"name":"shinglewhite_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters":[
"map_dash"
],
"tokenizer":"whitespace",
"tokenFilters":[
"shingle"
]
}
],
"charFilters":[
{
"name":"map_dash",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings":[ "_=> " ]
}
]
The problem is that word like ice_cream from input will not match query ice cream, it matches icecream though. Can someone help me understand how this works and if I have done something wrong?
Also we'd like query "ice cream" to match "ice cream", "icecream" and "ice and cream" but favor those in order.
in order to map to a space please use the following notation (we'll update the docs to include this information):
{
"name":"map_dash",
"#odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings":[ "_=>\\u0020" ]
}
Also, by default the shingle token filter separates tokens with a space. If you want to join subsequent tokens into one without a separator you need to customize your filter like in the following example:
{
"name": "my_shingle",
"#odata.type":"#Microsoft.Azure.Search.ShingleTokenFilter",
"tokenSeparator": ""
}
With those two changes for token ice_cream your analyzer will generate: ice, icecream, cream.
I hope that helps
Still learning how to properly use ANTLR... Here's my problem.
Say you have a (subset) of an UML grammar and an ANTLR Lexer/Parser with the following rules :
// Parser Rules
model
: 'MODEL' IDENTIFIER list_dec
;
list_dec
: declaration*
;
declaration
: class_dec ';'
| association ';'
| generalization ';'
| aggregation ';'
;
class_dec
: 'CLASS' IDENTIFIER class_content
;
...
association
: 'RELATION' IDENTIFIER 'ROLES' two_roles
;
two_roles
: role ',' role
;
role
: 'CLASS' IDENTIFIER multiplicity
;
...
I would like the 'role' rule to only allow the IDENTIFIER token if it matches an existing class IDENTIFIER. In other words, if you are given an input file and you run the lexer/parser on it, then all the classes that are referenced (e.g. the IDENTIFIER in the association rule) should exist. The problem is that a class might not exist (yet) at runtime (it can be declared anywhere in the file). What is the best approach to this ?
Thanks in advance...
This is probably best done after parsing. The parser creates some sort of tree for you, and afterwards you walk the tree and collect information about declared classes, and walk it a second time to validate the role tree/rule.
Of course, some things could be done with a bit of custom code:
grammar G;
options {
...
}
#parser::members {
java.util.Set<String> declaredClasses = new java.util.HashSet<String>();
}
model
: 'MODEL' IDENTIFIER list_dec
;
...
class_dec
: 'CLASS' id=IDENTIFIER class_content
{
declaredClasses.add($id.text);
}
;
...
role
: 'CLASS' id=IDENTIFIER multiplicity
{
if(!declaredClasses.contains($id.text)) {
// warning or exception in here
}
}
;
...
EDIT
Or with custom methods:
#parser::members {
java.util.Set<String> declaredClasses = new java.util.HashSet<String>();
void addClass(String id) {
boolean added = declaredClasses.add(id);
if(!added) {
// 'id' was already present, do something, perhaps?
}
}
void checkClass(String id) {
if(!declaredClasses.contains(id)) {
// exception, error or warning?
}
}
}
...
class_dec
: 'CLASS' id=IDENTIFIER class_content {addClass($id.text);}
;
role
: 'CLASS' id=IDENTIFIER multiplicity {checkClass($id.text);}
;
I am using Facet Terms to get all the unique values and their count for a field. And I am getting wrong results.
term: web
Count: 1191979
term: misc
Count: 1191979
term: passwd
Count: 1191979
term: etc
Count: 1191979
While the actual result should be:
term: WEB-MISC /etc/passwd
Count: 1191979
Here is my sample query:
{
"facets": {
"terms1": {
"terms": {
"field": "message"
}
}
}
}
If reindexing is an option, it would be the best to change mapping and mark this fields as not_analyzed
"your_field" : { "type": "string", "index" : "not_analyzed" }
You can use multi field type if keeping an analyzed version of the field is desired:
"your_field" : {
"type" : "multi_field",
"fields" : {
"your_field" : {"type" : "string", "index" : "analyzed"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
This way, you can continue using your_field in the queries, while running facet searches using your_field.untouched.
Alternatively, if this field is stored, you can use a script field facet instead:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_fields.your_field.value"
}
}
}
As the last resort, if this field is not stored, but record source is stored in the index, you can try this:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_source.your_field"
}
}
}
The first solution is the most efficient. The last solution is the least efficient and may take a lot of time on a large index.
Wow, I also got this same issue today while term aggregating in the recent elastic-search. After googling and some partial understanding, found how this geeky indexing works(which is very simple).
Queries can find only terms that actually exist in the inverted index
When you index the following string
"WEB-MISC /etc/passwd"
it will be passed to an analyzer. The analyzer might tokenize it into
"WEB", "MISC", "etc" and "passwd"
with its position details. And this tokens might filtered to lowercase such as
"web", "misc", "etc" and "passwd"
So, after indexing,the search query can see the above 4 only. not the complete word "WEB-MISC /etc/passwd". For your requirement the following are my options you can use
1.Change the Default Analyzer used by elasticsearch([link][1])
2.If it is not need, just TurnOff the analyzer by setting 'not_analyzed' for the fields you need
3.To convert the already indexed data searchable, re-indexing is the only option
I have briefly explained this problem and proposed two solutions here.
I have talked about multiple approaches here.
One is use of not_analyzed to preserve the string as it is. But then as it has the drawback of being case insensitive , a better approach would be use keyword tokenizer + lowercase filter
The functions "REGEX()" and "TRIM()" in this script don't work as I would expect.
The REGEX-function returns always true and the TRIM-function returns the "trim_char", not the trimmed string. (When I write the TRIM-function with FROM instead the "," I get an error message.)
#!/usr/bin/perl
use warnings;
use strict;
use 5.010;
use DBI;
my $dbh = DBI->connect( "DBI:CSV:", undef, undef, { RaiseError => 1, AutoCommit => 1 } );
my $table = 'artikel';
my $array_ref = [ [ 'a_nr', 'a_name', 'a_preis' ],
[ 12, 'Oberhemd', 39.80, ],
[ 22, 'Mantel', 360.00, ],
[ 11, 'Oberhemd', 44.20, ],
[ 13, 'Hose', 119.50, ],
];
$dbh->do( "CREATE TEMP TABLE $table AS IMPORT(?)", {}, $array_ref );
say "";
# purpose : test if a string matches a perl regular expression
# arguments : a string and a regex to match the string against
# returns : boolean value of the regex match
# example : ... WHERE REGEX(col3,'/^fun/i') ... matches rows
# in which col3 starts with "fun", ignoring case
my $sth = $dbh->prepare( "SELECT a_name FROM $table WHERE REGEX( a_name, '/^O/')" );
$sth->execute();
$sth->dump_results();
say "\n";
# TRIM ( [ [LEADING|TRAILING|BOTH] ['trim_char'] FROM ] string )
$sth = $dbh->prepare( "SELECT a_name, TRIM( TRAILING 'd', a_name ) AS new_name FROM $table" );
$sth->execute();
$sth->dump_results();
say "";
$dbh->disconnect();
Has somebody a piece of advice?
Edit:
DBD::SQLite : 1.25
DBD::ExampleP : 12.010007
DBD::Sponge : 12.010002
DBD::CSV : 0.26
DBD::Gofer : 0.011565
DBD::DBM : 0.03
DBD::Proxy : 0.2004
DBI : 1.609
DBD::File : 0.37
SQL::Statement : 1.23
Answer: Neat issue. Short answers from my testing with SQL::Statement-1.23 and DBD::CSV under 5.10.0 with your script:
REGEX() appears to work, but returns a number, not a boolean, which needs to be handled a bit specially:
Fix:
SELECT a_name FROM $table WHERE REGEX( a_name, '/^O/') = 1
TRIM() does not take a comma (as in your example); however, it seems unusably broken to me.
Any use of TRIM( FROM ), in my testing, greatly confused the parser about table names, and any other interesting use seemed to parse out, as you discovered, as a string literal.
Workaround:
SELECT a_name, REPLACE(a_name, 's/d\$//') AS new_name FROM $table
N.B.: you'll need to backslash that dollar sign in the s///, as I have, to keep your interpolating quotes...
Appeal: Please file bugs with test cases for this module. SQL::Statement may not be ready for prime time as an SQL engine, but we can help get it there!
You should boil your code down to the minimal example necessary to exhibit the problem, and then compare the results you get to what happens when you type those commands into the DB's command line interface. (e.g. try comparing a simple "SELECT TRIM(...)" command.
Also, what DB and version are you using?
Are you sure the underlying SQL engine (DBI::SQL::Nano I guess) has implemented those functions? It may be best to select the data and process it using Perl.