I am reading Rebol Wikipedia page.
"Parse expressions are written in the parse dialect, which, like the do dialect, is an expression-oriented sublanguage of the data exchange dialect. Unlike the do dialect, the parse dialect uses keywords representing operators and the most important nonterminals"
Can you explain what are terminals and nonterminals? I have read a lot about grammars, but did not understand what they mean. Here is another link where this words are used very often.
Definitions of terminal and non-terminal symbols are not Parse-specific, but are concerned with grammars in general. Things like this wiki page or intro in Grune's book explain them quite well. OTOH, if you're interested in how Red Parse works and yearn for simple examples and guidance, I suggest to drop by our dedicated chat room.
"parsing" has slightly different meanings, but the one I prefer is conversion of linear structure (string of symbols, in a broad sense) to a hierarchical structure (derivation tree) via a formal recipe (grammar), or checking if a given string has a tree-like structure specified by a grammar (i.e. if "string" belongs to a "language").
All symbols in a string are terminals, in a sense that tree derivation "terminates" on them (i.e. they are leaves in a tree). Non-terminals, in turn, are a form of abstraction that is used in grammar rules - they group terminals and non-terminals together (i.e. they are nodes in a tree).
For example, in the following Parse grammar:
greeting: ['hi | 'hello | 'howdy]
person: [name surname]
name: ['john | 'jane]
surname: ['doe | 'smith]
sentence: [greeting person]
greeting, person, name, surname and sentence are non-terminals (because they never actually appear in the linear input sequence, only in grammar rules);
hi, hello, howdy with john, jane, doe and smith are terminals (because parser cannot "expand" them into a set of terminals and non-terminals as it does with non-terminals, hence it "terminates" by reaching the bottom).
>> parse [hi jane doe] sentence
== true
>> parse [howdy john smith] sentence
== true
>> parse [wazzup bubba ?] sentence
== false
As you can see, terminal and non-terminal are disjoint sets, i.e. a symbol can be either in one of them, but not in both; moreso, inside grammar rules, only non-terminals can be written on the left side.
One grammar can match different strings, and one string can be matched by different grammars (in the example above, it could be [greeting name surname], or [exclamation 2 noun], or even [some noun], provided that exclamation and noun non-terminals are defined).
And, as usual, one picture is worth a thousand words:
Hope that helps.
think of it like that
a digit can be 1-9
now i will tell you to write down on a page a digit.
so you know that you can write down 1,2,3,4,5,6,7,8,9
basically the nonterminal symbol is "digit"
and the terminals symbols are the 1,2,3,4,5,6,7,8,9
when i told you to write down on a page a digit you wrote down 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
you didn't wrote down the word "digit" you wrote down the 1 or 2 or 3....
do you see where i'm going ?
let's try to make our own "rules"
let's "create" a nonterminal symbol we will call it "Olaf"
Olaf can be a dog (NOTE: dog is terminal)
Olaf can be a cat (NOTE: cat is terminal)
Olaf can be a digit (NOTE: digit is nonterminal)
Now i'm telling you that you can write down on a page an Olaf.
so that's mean that you can write down "dog"
you can also write down "cat"
you can also write down a digit so that's mean you can write down 1 or 2 or 3...
because digit is nonterminal symbol you dont write down "digit" you write down
the symbols that digit is referring to which is 1 or 2 or 3 etc...
in the end only terminals symbols are written on the "page"
one more thing i have to say is something that you may encounter one day, basically when you say "a nonterminal can be something".
there is a special term for that and that's basically called a "production rule"(can also be called a "production")
for example
Olaf can be "dog"
Olaf can be "cat"
Olaf can be digit
we got 3 productions here in other words we got here 3 definitions of Olaf
specifications of Programming languages use those ideas quite a lot when defining a syntax of a language
S -> Sa | SbSa | ε
I found a similar question but i don't understand : http://automatasteps.blogspot.co.id/2007/08/unambiguous-grammar.html
How do i change this to the unambiguous one?
My string is bbaaa
What is the best way to treat repetitions in regexes like abc | cde | abc | cde | cde | abc or <regex1> | <regex2> | <regex3> | <regex4> | <regex5> | <regex6>, where many of regexN will be the same literals?
To explain what I mean, I'll give an example from German. Here is a sample grammar that can parse several Present tense verbal forms.
grammar Verb {
token TOP {
<base>
<ending>
}
token base {
geh |
spiel |
mach
}
token ending {
e | # 1sg
st | # 2sg
t | # 3sg
en | # 1pl
t | # 2pl
en # 3pl
}
}
my #verbs = <gehe spielst machen>;
for #verbs -> $verb {
my $match = Verb.parse($verb);
say $match;
}
Endings for 1pl and 3pl (en) are the same, but for the sake of clarity it's more convenient to put them both into the token (in my real-life data inflexional paradigms are much more complex, and it's easy to get lost). The token ending works as expected, but I understand that if I put en only once, the program would work a bit faster (I've made tests with regexes consisting of many such repeated elements, and yes, the performance suffers greatly). With my data, there are lots of such repetitions, so I wonder what is the best way to treat them?
Of course, I could put the endings in an array outside the grammar, make this array .unique and then just pass the values:
my #endings = < ... >;
#endings .= unique;
...
token ending { #endings }
But taking data out of the grammar will make it less clear. Also, in some cases it might be necessary to make each ending a separate token (token ending {<ending_1sg> | <ending_2sg> ... <ending_3pl>}, which would be impossible if they are defined outside the grammar.
If I understand you, for the sake of clarity, you want to repeat regex terms with a comment that describes which notes it's a separate concept? Just comment the line out.
By the way, since empty regexes are ignored in this case, it's okay to begin the line with your branch operator, instead of putting it at the end. It makes things easier, especially when you need to add and remove lines. So I suggest something like this:
grammar Verb {
# ...
token ending {
| e # 1sg
| st # 2sg
| t # 3sg
| en # 1pl
#| t # 2pl
#| en # 3pl
}
}
Because what you're writing is exclusively for the human, not for the parser. Now, if you wanted to use the different regexes to go into different parse matches so you could access the ending as either $<_3sg> or $<_2p1> (named regexes so both would succeed), you can't comment it out, and you're gonna have to force the computer to do the extra work. And obviously you'll need to use :exhaustive or :overlap. Instead, I would suggest you make a named regex that represents both 3sg and 2p1, and define it like I did above: write them both but comment one out.
I want to parse this
VALID_EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(\.[a-z]+)*\.[a-z]+\z/i
and other variations of course of regular expressions.
Does someone know how to do this properly?
Thanks in advance.
Edit: I tried throwing in all regex signs and chars in one lexer rule like this
REGEX: ( DIV | ('i') | ('#') | ('[') | (']') | ('+') | ('.') | ('*') | ('-') | ('\\') | ('(') | (')') |('A') |('w') |('a') |('z') |('Z')
//|('w')|('a'));
and then make a parser rule like this:
regex_assignment: (REGEX)+
but there are recognition errors(extraneous input). This is definetly because these signs are ofc used in other rules before.
The thing is I actually don't need to process these regex assignments, I just want it to be recognized correctly without errors. Does anyone have an approach for this in ANTLR? For me a solution would suffice, that just recognzies this as regex and skips it for example.
Unfortunately, there is no regex grammar yet in the ANTLR grammar repository, but similar questions have come up before, e.g. Regex Grammar. Once you have the (E)BNF you can convert that to ANTLR. Or alternatively, you can use the BNF grammar to check your own grammar rules to see if they are correctly defined. Simply throwing together all possible input chars in a single rule won't work.
Say there are two grammar rules
Rule 1 B -> aB | cB
and
Rule 2 B -> Ba | Bc
I'm a bit confused as the difference of these two. Would rule 1's expression be (a+c)* ? Then what would Rule 2's expression be?
Both of those grammars yield the empty language since there is no non-recursive rule, so no sentence consisting only of terminals can be derived.
If you add the production B→ε, both grammars would yield the same language, equivalent to the regular expression (a+c)*. However, the parse trees produced by the parse would be quite different.