What are terminal and nonterminal symbols? - grammar

I am reading Rebol Wikipedia page.
"Parse expressions are written in the parse dialect, which, like the do dialect, is an expression-oriented sublanguage of the data exchange dialect. Unlike the do dialect, the parse dialect uses keywords representing operators and the most important nonterminals"
Can you explain what are terminals and nonterminals? I have read a lot about grammars, but did not understand what they mean. Here is another link where this words are used very often.

Definitions of terminal and non-terminal symbols are not Parse-specific, but are concerned with grammars in general. Things like this wiki page or intro in Grune's book explain them quite well. OTOH, if you're interested in how Red Parse works and yearn for simple examples and guidance, I suggest to drop by our dedicated chat room.
"parsing" has slightly different meanings, but the one I prefer is conversion of linear structure (string of symbols, in a broad sense) to a hierarchical structure (derivation tree) via a formal recipe (grammar), or checking if a given string has a tree-like structure specified by a grammar (i.e. if "string" belongs to a "language").
All symbols in a string are terminals, in a sense that tree derivation "terminates" on them (i.e. they are leaves in a tree). Non-terminals, in turn, are a form of abstraction that is used in grammar rules - they group terminals and non-terminals together (i.e. they are nodes in a tree).
For example, in the following Parse grammar:
greeting: ['hi | 'hello | 'howdy]
person: [name surname]
name: ['john | 'jane]
surname: ['doe | 'smith]
sentence: [greeting person]
greeting, person, name, surname and sentence are non-terminals (because they never actually appear in the linear input sequence, only in grammar rules);
hi, hello, howdy with john, jane, doe and smith are terminals (because parser cannot "expand" them into a set of terminals and non-terminals as it does with non-terminals, hence it "terminates" by reaching the bottom).
>> parse [hi jane doe] sentence
== true
>> parse [howdy john smith] sentence
== true
>> parse [wazzup bubba ?] sentence
== false
As you can see, terminal and non-terminal are disjoint sets, i.e. a symbol can be either in one of them, but not in both; moreso, inside grammar rules, only non-terminals can be written on the left side.
One grammar can match different strings, and one string can be matched by different grammars (in the example above, it could be [greeting name surname], or [exclamation 2 noun], or even [some noun], provided that exclamation and noun non-terminals are defined).
And, as usual, one picture is worth a thousand words:
Hope that helps.

think of it like that
a digit can be 1-9
now i will tell you to write down on a page a digit.
so you know that you can write down 1,2,3,4,5,6,7,8,9
basically the nonterminal symbol is "digit"
and the terminals symbols are the 1,2,3,4,5,6,7,8,9
when i told you to write down on a page a digit you wrote down 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
you didn't wrote down the word "digit" you wrote down the 1 or 2 or 3....
do you see where i'm going ?
let's try to make our own "rules"
let's "create" a nonterminal symbol we will call it "Olaf"
Olaf can be a dog (NOTE: dog is terminal)
Olaf can be a cat (NOTE: cat is terminal)
Olaf can be a digit (NOTE: digit is nonterminal)
Now i'm telling you that you can write down on a page an Olaf.
so that's mean that you can write down "dog"
you can also write down "cat"
you can also write down a digit so that's mean you can write down 1 or 2 or 3...
because digit is nonterminal symbol you dont write down "digit" you write down
the symbols that digit is referring to which is 1 or 2 or 3 etc...
in the end only terminals symbols are written on the "page"
one more thing i have to say is something that you may encounter one day, basically when you say "a nonterminal can be something".
there is a special term for that and that's basically called a "production rule"(can also be called a "production")
for example
Olaf can be "dog"
Olaf can be "cat"
Olaf can be digit
we got 3 productions here in other words we got here 3 definitions of Olaf
specifications of Programming languages use those ideas quite a lot when defining a syntax of a language

Related

Difference between grammar rules

Say there are two grammar rules
Rule 1 B -> aB | cB
and
Rule 2 B -> Ba | Bc
I'm a bit confused as the difference of these two. Would rule 1's expression be (a+c)* ? Then what would Rule 2's expression be?
Both of those grammars yield the empty language since there is no non-recursive rule, so no sentence consisting only of terminals can be derived.
If you add the production B→ε, both grammars would yield the same language, equivalent to the regular expression (a+c)*. However, the parse trees produced by the parse would be quite different.

regexp_like to selects rows where an attribute string contains several different words

A bit new to regexp and looking for some help understanding some of the capabilities. I'am currently trying to select some sets of data that start with a word followed by a space and then several possible words.
Example 1:
I am basically looking to select data such as Product1 green, Product1 red, Product1 blue (green, red or blue basically) but not:
xyz Product1, Product1 black, Product1 white, Product1 garbage red.
I have tried to the following queries with not much success:
Where regexp_like(item, 'Product1 [green | red | blue]');
Where regexp_like(item, 'Product1 [green, red, blue]');
Where regexp_like(item, '^Product1 [green, red, blue]');
Hypothetically, does anybody know of a way I could also implement an 'AND', for example selecting items which contain the words green and red in the same attribute.
Example 2:
Similar situation, but trying to match a word after a punctuation
Where regexp_like (job, 'Commerce [[:punct:]] .*');
With this query I am looking to select jobs which have
Commerce - test
Commerce : abcdefg
These queries are not working as I would expect them to and I'm not able to quite figure out why. I am assuming I have misunderstood the construct of these regular expressions.
Any help / explanations would be greatly appreciated!
For the first, try the following
WHERE REGEXP_LIKE(ITEM, '^Product1.*(green|red|blue)')
or
WHERE REGEXP_LIKE(ITEM, '^Product1 (green|red|blue)')
or
WHERE REGEXP_LIKE(ITEM, '^Product1 +(green|red|blue)')
depending on what you expect after the Product1 - the first case allows zero or more characters of any kind, the second requires that there be a single space after Product1, and the third requires one or more blanks after Product1.
Not sure where you're going exactly on the second one. If you really want strings that begin with 'Commerce', followed by a space, followed by a punctuation character, another space, and then anything, try
WHERE REGEXP_LIKE(JOB, '^Commerce [:punct:] .*');
If instead of a punctuation character you're looking for either ':' or '-', try
WHERE REGEXP_LIKE(JOB, '^Commerce [:-] .*');
I'm no great expert on regular expressions but I'll try to offer some explanations:
^ requires that the following element be at the beginning of the string. Thus, in the first case ^Product1 means "'Product1' must be at the the start of the string".
In regular expressions parentheses are used to group expressions, so in the first case (green|red|blue) are grouped together.
| is a logical OR, so (green|red|blue) means "must be one of 'green' or 'red' or 'blue'".
Square brackets are used for character classes. You can use either predefined classes, such as :punct: or :space:, or you can make up your own as in [:-]. During regular expression interpretation a square bracket character class, no matter how long, represents a single character in the string being matched. So in the regular expression ^Commerce [:-] .* the character class [:-] means "look for either a colon or a dash". If you want to indicate that you expect multiple occurrences of characters in the class, one after another, use one of the repetition operators (* or +) after the class - so [abc]* would match all of abcabcabc.
Also keep in mind that in a regular expression every character means something, so you can't use whitespace to make regular expressions more legible because the whitespace becomes something that will be looked for when the expression is interpreted.
Share and enjoy.
Edit
Didn't notice your question about AND earlier. A simple way to AND together multiple expressions is to just put them one after another. To look for (green|red|blue), followed by a space, followed by (green|red|blue) a simple expression would be
WHERE REGEXP_LIKE(ITEM, '^Product1 (green|red|blue) (green|red|blue)')
If potentially multiple spaces were to be allowed between the colors
WHERE REGEXP_LIKE(ITEM, '^Product1 (green|red|blue) +(green|red|blue)')
could be used.
Resistance is useless.

Chomsky Language Types

I'm trying to understand the four different Chomsky language types but the definitions that I have found don't really mean anything to me. I know type 0 is free grammar, type 1 is context sensitive, type 2 is context free whilst type 3 is regular. So, could someone please explain this and put it into context, thanks.
A language is the set of words that belong to that language. Many times, however, instead of listing each and every word in the language, it is enough to specify the set of rules that generate the words of the language (and only those) to identify what is the language-in-question.
Note: there can be more than one set of rules that desrcibe the same language.
In general, the more restrictions placed on the rules, the less expressive the language (less words can be generated from the rules), but easier to recognize if a word belongs to the language the rules specify. Because of the latter, we want to identify languages with the most restrictions on their rules that will still allow us to generate the same language.
A few words about the rules: In general, you describe a formal language with four items (AKA a four-tuple):
The set of non-terminal symbols (N)
The set of terminal symbols (T)
The set of production rules (P)
The start symbol (S)
The terminal symbols (AKA letters) are the symbols that words of the language consist of, ususally a subset of lowercase English letters (e.g. 'a', 'b', 'c')
The non-terminal symbols are marking an intermediate state in the generation of a word, indicating that a possible transformation can still be applied to the intermediate "word". There is no overlap between the terminal and non-terminal symbols (i.e. the intersection of the two sets are empty). The symbols used for non-terminals are usually subsets of uppercase English letters (e.g. 'A', 'B', 'C')
The rules denote possible transformations on a series of terminal and non-terminal symbols. They are in the form of: left-side -> right-side, where both the left-side and the right-side consists of series of terminal and non-terminal symbols. An example rule: aBc -> cBa, which means that a series of symbols "aBc" (as part of intermediary "words") can be replaced with the series "cBa" during the generation of words.
The start symbol is a dedicated non-terminal symbol (usually denoted by S) that denotes the "root" or the "start" of the word generation, i.e. the first rule applied in the series of word-generation always has the start-symbol as its left-side.
The generation of a word is successfully over when all non-terminals have been replaced with terminals (so the final "intermediary word" consists only of terminal symbols, which indicates that we arrived at a word of the language-in-question).
The generation of a word is unsuccessful, when not all non-terminals have been replaced with terminals, but there are no production rules that can be applied on the current intermediary "word". In this case the generation has to strart anew from the starting symbol, following a different path of production rule applications.
Example:
L={T, N, P, S},
where
T={a, b, c}
N={A, B, C, S}
P={S->A, S->AB, S->BC, A->a, B->bb, C->ccc}
which denotes the language of three words: "a", "abb" and "bbccc"
An example application of the rules:
S->AB->aB->abb
where we 1) started from the start symbol (S), 2) applied the second rule (replacing S with AB), 3) applied the fourth rule (replacing A with a) and 4) applied the fifth rule (replacing B with bb). As there are no non-terminals in the resulting "abb", it is a word of the language.
When talking in general about the rules, the Greek symbols alpha, beta, gamma etc. denote (a potentially empty) series of terminal+non-terminal symbols; the Greek letter epsilon denotes the empty string (i.e. the empty series of symbols).
The four different types in the Chomsky hierarchy describe grammars of different expressive power (different restrictions on the rules).
Languages generated by Type 0 (or Unrestricted) grammars are most expressive (less restricted). The set of Recursively Enumerable languages contain the languages that can be generated using a Turing machine (basically a computer). This means that if you have a language that is more expressive than this type (e.g. English), you cannot write an algorithm that can list each an every (and only these) words of the language. The rules have one restriction: the left-side of a rule cannot be empty (no symbols can be introduced "out of the blue").
Languages generated by Type 1 (Context-sensitive) grammars are the Context-sensitive languages. The rules have the restriction that they are in the form: alpha A beta -> alpha gamma beta, where alpha and beta can be empty, and gamma is non-empty (exception: the S->epsilon rule, which is only allowed if the start symbol S does not appear on the right-side of any rules). This restricts the rules to have at least one non-terminal on their left-side and allows them to have a "context": the non-terminal A in this rule example can be replaced with gamma, only if it is surrounded by ("is in the context of") alpha and beta. The application of the rule preserves the context (i.e. alpha and beta does not change).
Languages generated by Type 2 (Context-free) grammars are the Context-free languages. The rules have the restriction that they are in the form: A -> gamma. This restricts the rules to have exactly one non-terminal on their left-side and have no "context". This essentially means that if you see a non-terminal symbol in an intermediary word, you can apply any one of the rules that have that non-terminal symbol on their left-side to replace it with their right-side, regardless of the surroundings of the non-terminal symbol. Most programming languages have context free generating grammars.
Languages generated by Type 3 (Regular) grammars are the Regular languages. The rules have the restriction that they are of the form: A->a or A->aB (the rule S->epsilon is permitted if the starting symbol S does not appear on the right-side of any rules), which means that each non-terminal must produce exactly one terminal symbol (and possibly one non-terminal as well). The regular expressions generate/recognize languages of this type.
Some of these restrictions can be lifted/modified in a way to keep the modified grammar have the same expressive power. The modified rules can allow other algorithms to recognize the words of a language.
Note that (as noted earlier) a language can often be generated by multiple grammars (even grammars belonging to different types). The expressive power of a language family is usually equated with the expressive power of the type of the most restrictive grammars that can generate those languages (e.g. languages generated by regular (Type 3) grammars can also be generated by Type 2 grammars, but their expressive power is still that of Type 3 grammars).
Examples
The regular grammar
T={a, b}
N={A, B, S}
P={S->aA, A->aA, A->aB, B->bB, B->b}
generates the language which contains words that start with a non-zero number of 'a's, followed by a non-zero number of 'b's. Note that is it not possible to describe a language where each word consists of a number of 'a's followed by an equal number of 'b's with regular grammars.
The context-free grammar
T={a, b}
N={A, B, S}
P={S->ASB, A->a, B->b}
generates the language which contains words that start with a non-zero number of 'a's, followed by an equal number of 'b's. Note that it is not possible to describe a language where each word consists of a number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's with context-free grammars.
The context-sensitive grammar
T={a, b, c}
N={A, B, C, H, S}
P={S->aBC, S->aSBC, CB->HB, HB->HC, HC->BC, aB->ab, bB->bb, bC->bc, cC->cc}
generates tha language which contains words that start with non-zero number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's. The role of H in this grammar is to enable "swapping" a CB combination to a BC combination, so the B's can be gathered on the left, and the C's can be gathered on the right. Note that it is not possible to describe a language where the words consist of a series of 'a's, where the number of 'a's is a prime with context-sensitive grammars, but it is possible to write an unrestricted grammar that generates that language.
There are 4 types of grammars
TYPE-0 :
Grammar accepted by Unrestricted Grammar
Language accepted by Recursively enumerable language
Automaton is Turing machine
TYPE-1 :
Grammar accepted by Context sensitive Grammar
Language accepted by Context sensitive language
Automaton is Linear bounded automaton
TYPE-2 :
Grammar accepted by Context free Grammar
Language accepted by Context free language
Automaton is PushDown automaton
TYPE-3 :
Grammar accepted by Regular Grammar
Language accepted by Regular language
Automaton is Finite state automaton
-
Chomsky is a Normal Form that each production has form:
A->BC or A->a
There can be Two variables or Only one terminal for rule in
Chomsky

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.

ANTLR Parser Question

I'm trying to parse a number of text records where elements in a record are separated by a '+' char, and where the entire record is terminated by a '#' char. For example E1+E2+E3+E4+E5+E6#
Individual elements can be required or optional. If an element is optional, its value is simply missing. For example, if E2 were missing, the input string would be: E1++E3+E4+E5+E6#.
When dealing with empty trailing elements, however, the separator char ('+') may be missing as well. If, for example, the last 3 elements were missing, the string could be: E1+E2+E3#, but it could also be:
E1+E2+E3+++#
I have tried the following rule in Antlr:
'R1' 'E1 + E2 + E3' '+'? 'E4'? '+'? 'E5'? '+'? 'E6'? '#
but Antlr complains that it's ambiguous which of course is correct (every token following E3 could be E4, E5 or E6). The input syntax is fixed (it's from a legacy mainframe system), so I was wondering if anybody has a solution to this problem ?
An alternative would be to specify all the different permutations in the rule, but that would be a major task.
Best regards and thanks,
Michael
That task sounds like excessive overkill for ANTLR, any reason you're just not splitting the string into an array using the '+' as a separator?
If it's coming from a mainframe, it most likely was intended to be processed in a trivial way.
e.g.,
C++ : http://www.cplusplus.com/reference/clibrary/cstring/strtok/
PHP : http://us3.php.net/manual/en/function.explode.php
Java: http://java.sun.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29
C# : http://msdn.microsoft.com/en-us/library/system.string.split%28VS.71%29.aspx
Just a thought.
If this is ambiguous, it's likely because your Es all have the same format (a more complicated case would be that your Es all just start with the same k characters where k is your lookahead, but I'm going to assume that's not the case. If it is, this will still work; it will just require an extra step.)
So it looks like you can have up to 6 Es and up to 5 +s. We'll say a "segment" is an optional E followed by a + - you can have 5 segments, and an optional trailing E.
This grammar can be represented roughly like this (imperfect ANTLR syntax since I'm not very familiar with it):
r : (e_opt? PLUS){1,5} e_opt? END
e_opt : E // whatever your E is
PLUS : '+'
END : '#'
If ANTLR doesn't support anything like {1,5} then this is the same as:
(e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) (e_opt? PLUS)?)?)?)?
which is not that clean, so maybe there is a nicer way to do it.