Which one of these grammars (if not both) is appropriate for the following context-free language? - finite-automata

I was looking at the context free language defined by the following:
{w | na(w) = nb(w)}
I understand that this is the language where any given string w will have an equal number of a's and b's. I have looked at two possible grammars for this language that both seem correct, which are as follows:
S --> aSb | bSa | SS | ɛ
S --> aSbS | bSaS | ɛ
Are either of these grammars incorrect, or do both of them work (and would therefore be equivalent)?

Related

How can I define repeating elements in CFG?

I am making a program that loads a cfg from a file and uses it to load programming languages into a syntax tree.
What is the proper way to define an identifier in a context free grammar. For now, I have a format like this:
IdentifierStart => $l$ | _;
IdentifierChar => "$l$$IdentifierChar$" | "_$IdentifierChar$" | "$i$$IdentifierChar$" | $e$;
Identifier => "$IdentifierStart$$IdentifierChar$" $w$;
Format:
$l$ = any letter
$e$ = epsilon
$i$ = any integer
$o$ = any operator
$n$ = new line
$w$ = whitespace
$a$ = any atom
Quotes mean the whitespace needs to match the inside of the quotes
While this does work, it is inefficient because it creates a deep tree when each letter could justifiably just be listed next to each other. For example,
Pragma => $n$ "#direct" $w$ $String$;
with the rules:
IdentifierStart => $l$ | _;
IdentifierChar => "$l$$IdentifierChar$" | "_$IdentifierChar$" | "$i$$IdentifierChar$" | $e$;
Identifier => "$IdentifierStart$$IdentifierChar$";
Symbol => $l$ | $i$ | $o$ | \$$Identifier$\$;
Def => $Symbol$ $Def$ | $Symbol$ | "Def";
Assignment => $Def$ \| $Assignment$ | $Def$;
Definition => $Identifier$ "=>" $Assignment$\;;
creates the following tree (where each space represents a level in the tree):
Definition:Pragma => $n$ "#pragma" $w$ $String$;
Identifier:Pragma
IdentifierStart:P
Terminal:P
IdentifierChar:ragma
Terminal:r
IdentifierChar:agma
Terminal:a
IdentifierChar:gma
Terminal:g
IdentifierChar:ma
Terminal:m
IdentifierChar:a
Terminal:a
Terminal:=
Terminal:>
Assignment:$n$ "#direct" $w$ $String$
...
While this is fine in the case of an identifier, I noticed there was a problem when I realized I had to define the file format in the same recursive manner:
File => $ValidDirective$ $File$;
ValidDirective => $Comment$ | $Include$ | $Define$ | $Undef$ | $IfPart$ | $Error$ | $Pragma$ | $String$;
Each element of the file will be stored in a sub-tree of the previous element! I don't think this is acceptable because in a program with millions of lines, it will be incredibly inefficient.
Is there any way I can fix this problem while staying true to the conventions of a CFG?
A true CFG does indeed define repetition via recursion, which leads to the nested parse tree you observed.
A programming language would typically use regular expressions (or something similar) to define the syntax of symbols like identifiers. In that case, parsing an identifier would result in a single token, rather than a tree, which might answer your concerns about inefficiency.
However, that approach doesn't apply to higher-level repetitive constructs, e.g. a StatementList or ArgumentList: for those, regular expressions are insufficient, and you need something at least as 'powerful' as a CFG. It's unclear if you think that storing a StatementList or ArgumentList as a deeply-nested tree is inefficient.
If you're obliged to use a true CFG, and you don't have control of the data structures created during the parsing process, you could run a post-process that converts recursive structures into non-recursive structures, but you may find that the efficiency gains are not that much.
Most programming language grammars don't confine themselves to pure CFGs, though.

menhir - associate AST nodes with token locations in source file

I am using Menhir to parse a DSL. My parser builds an AST using an elaborate collection of nested types. During later typecheck and other passes in error reports generated for a user, I would like to refer to source file position where it occurred. These are not parsing errors, and they generated after parsing is completed.
A naive solution would be to equip all AST types with additional location information, but that would make working with them (e.g. constructing or matching) unnecessary clumsy. What are the established practices to do that?
I don't know if it's a best practice, but I like the approach taken in the abstract syntax tree of the Frama-C system; see https://github.com/Frama-C/Frama-C-snapshot/blob/master/src/kernel_services/ast_data/cil_types.mli
This approach uses "layers" of records and algebraic types nested in each other. The records hold meta-information like source locations, as well as the algebraic "node" you can match on.
For example, here is a part of the representation of expressions:
type ...
and exp = {
eid: int; (** unique identifier *)
enode: exp_node; (** the expression itself *)
eloc: location; (** location of the expression. *)
}
and exp_node =
| Const of constant (** Constant *)
| Lval of lval (** Lvalue *)
| UnOp of unop * exp * typ
| BinOp of binop * exp * exp * typ
...
So given a variable e of type exp, you can access its source location with e.eloc, and pattern match on its abstract syntax tree in e.enode.
So simple, "top-level" matches on syntax are very easy:
let rec is_const_expr e =
match e.enode with
| Const _ -> true
| Lval _ -> false
| UnOp (_op, e', _typ) -> is_const_expr e'
| BinOp (_op, l, r, _typ) -> is_const_expr l && is_const_expr r
To match deeper in an expression, you have to go through a record at each level. This adds some syntactic clutter, but not too much, as you can pattern match on only the one record field that interests you:
let optimize_double_negation e =
match e.enode with
| UnOp (Neg, { enode = UnOp (Neg, e', _) }, _) -> e'
| _ -> e
For comparison, on a pure AST without metadata, this would be something like:
let optimize_double_negation e =
match e.enode with
| UnOp (Neg, UnOp (Neg, e', _), _) -> e'
| _ -> e
I find that Frama-C's approach works well in practice.
You need somehow to attach the location information to your nodes. The usual solution is to encode your AST node as a record, e.g.,
type node =
| Typedef of typdef
| Typeexp of typeexp
| Literal of string
| Constant of int
| ...
type annotated_node = { node : node; loc : loc}
Since you're using records, you can still pattern match without too much syntactic overhead, e.g.,
match node with
| {node=Typedef t} -> pp_typedef t
| ...
Depending on your representation, you may choose between wrapping each branch of your type individually, wrapping the whole type, or recursively, like in Frama-C example by #Isabelle Newbie.
A similar but more general approach is to extend a node not with the location, but just with a unique identifier and to use a final map to add arbitrary data to nodes. The benefit of this approach is that you can extend your nodes with arbitrary data as you actually externalize node attributes. The drawback is that you can't actually guarantee the totality of an attribute since finite maps are no total. Thus it is harder to preserve an invariant that, for example, all nodes have a location.
Since every heap allocated object already has an implicit unique identifier, the address, it is possible to attach data to the heap allocated objects without actually wrapping it in another type. For example, we can still use type node as it is and use finite maps to attach arbitrary pieces of information to them, as long as each node is a heap object, i.e., the node definition doesn't contain constant constructors (in case if it has, you can work around it by adding a bogus unit value, e.g., | End can be represented as | End of unit.
Of course, by saying an address, I do not literally mean the physical or virtual address of an object. OCaml uses a moving GC so an actual address of an OCaml object may change during a program execution. Moreover, an address, in general, is not unique, as once an object is deallocated its address can be grabbed by a completely different entity.
Fortunately, after ephemera were added to the recent version of OCaml it is no longer a problem. Moreover, an ephemeron will play nicely with the GC, so that if a node is no longer reachable its attributes (like file locations) will be collected by the GC. So, let's ground this with a concrete example. Suppose we have two nodes c1 and c2:
let c1 = Literal "hello"
let c2 = Constant 42
Now we can create a location mapping from nodes to locations (we will represent the latter as just strings)
module Locations = Ephemeron.K1.Make(struct
type t = node
let hash = Hashtbl.hash (* or your own hash if you have one *)
let equal = (=) (* or a specilized equal operator *)
end)
The Locations module provides an interface of a typical imperative hash table. So let's use it. In the parser, whenever you create a new node you should register its locations in the global locations value, e.g.,
let locations = Locations.create 1337
(* somewhere in the semantics actions, where c1 and c2 are created *)
Locations.add c1 "hello.ml:12:32"
Locations.add c2 "hello.ml:13:56"
And later, you can extract the location:
# Locations.find locs c1;;
- : string = "hello.ml:12:32"
As you see, although the solution is nice in the sense, that it doesn't touch the node data type, so the rest of your code can pattern match on it nice and easy, it is still a little bit dirty, as it requires global mutable state, that is hard to maintain. Also, since we are using an object address as a key, every newly created object, even if it was logically derived from the original object, will have a different identity. For example, suppose you have a function, that normalizes all literals:
let normalize = function
| Literal str -> Literal (normalize_literal str)
| node -> node
It will create a new Literal node from the original nodes, so all the location information will be lost. That means, that you need to update the location information, every time you derive one node from another.
Another issue with ephemera is that they can't survive the marshaling or serialization. I.e., if you store your AST somewhere in a file, and then you restore it, all nodes will loose their identity, and the location table will become empty.
Speaking of the "monadic approach" that you mentioned in comments. Though monads are magic, they still can't magically solve all the problems. They are not silver bullets :) In order to attach something to a node we still need to extend it with an extra attribute - either a location information directly or an identity through which we can attach properties indirectly. The monad can be useful for the latter though, as instead of having a global reference to the last assigned identifier, we can use a state monad, to encapsulate our id generator. And for the sake of completeness, instead of using a state monad or a global reference to generate unique identifiers, you can use UUID and get identifiers that are not only unique in a program run, but are also universally unique, in the sense that there are no other objects in the world with the same identifier, no matter how often you run your program (in the sane world). And although it looks like that generating the UUID doesn't use any state, underneath the hood it still uses an imperative random number generator, so it is sort of cheating, but still can seen as pure functional, as it doesn't contain observable effects.

Context-free grammar for L = { b^n c^n a^n , n>=1}

I have a language L, which is defined as: L = { b^n c^n a^n , n>=1}
The corresponding grammar would be able to create words such as:
bca
bbccaa
bbbcccaaa
...
How would such a grammar look like? Making two variables dependent of each other is relatively simple, but I have trouble with doing it for three.
Thanks in advance!
L = { b^n c^n a^n , n>=1}
As pointed out in the comments, this is a canonical example of a language which is not context free. It can be shown using the pumping lemma for context-free languages. Basically, consider a string like b^p c^p a^p where p is the pumping length and then show no matter what part you pump, you will throw off the balance (basically, the size of the part that's pumped is less than p, so it cannot "span" all three symbols to keep them in sync).
L = {a^m b^n c^n a^(m+n) |m ≥ 0,n ≥ 1}
As suggested in the comments, this is not context free either. It can be shown using the pumping lemma for context-free languages as well. However, given a proof (or acceptance) of the above, there is an easier way. Recall that the intersection of a regular language and a context-free language must be context free. Assume L is context-free. Then so must be its intersection with the regular language (b+c)(b+c)* a*. However, that intersection can be expressed as b^n c^n a^n (since m is forced to be zero), which we know is not context-free, a contradiction. Therefore, our assumption was wrong and L is not context free either.

Chomsky languages: how to recognize them?

I have a problem with the recognition of languages. Given a certain language, for example ancb2n, n > 0, how do I determine quickly what type belongs according Chomsky?
My idea was to determine the grammar that generates it and then up to the language but it is a long process. I think there's another way to recognize it by eye,
without writing grammars or automata.
can someone help me?
Unfortunately, associating an arbitrary language with a level of the Chomsky hierarchy is, in the general case, undecidable. (See Rice's Theorem.)
Of course, it is easy to categorize a given grammar, since the Chomsky hierarchy is defined by a simple syntactic analysis of the grammar itself. However, languages do not have unique grammars; the existence of (for example) a Type 2 (context-free) grammar for a language does not mean that there is not a Type 3 (regular) grammar for the same language.
So there are no shortcuts.
However, there is a lot to be said for experience. The language { ancb2n | n > 0 } is context-free (and not regular), as are all languages of similar forms. That it is context free is demonstrated by the grammar
L → c
L → a L b b
and the fact that it is not regular can be proven using the pumping lemma for regular languages. (The linked Wikipedia article contains, as an example of the use of the lemma, a proof for a similar language which should be easy to adapt.)
On the other hand, a language which requires three equal counts ({ ancnbn | n > 0 }) is not context-free (but is context-sensitive). (That's not the same as { ancn+mbm | n > 0 }, which is context-free.)

Automata theory : Conversion of a Context free grammar to a DFA

How to convert a Context Free Grammar to a DFA? This is easy if we have transitions like
A->a B. But when we have the transitions as A->a B c. Then how should we represent it as a DFA
There is no general procedure to convert an arbitrary CFG into a DFA. For example, consider this CFG:
S → aSb | ε
This grammar is for the language { anbn | n ≥ 0 }, which is a canonical nonregular language. Since we can only build DFAs for regular languages, there’s no way to build a DFA with the same language as this CFG
First, you should convert your language to CNF (Chomskey Normal Form).
Then steps for conversion are as such:
Convert it to left/right grammar is called a regular grammar.
Convert the Regular Grammar into Finite Automata
The transitions for automata are obtained as follows
For every production A -> aB make δ(A, a) = B that is make an are labeled ‘a’ from A to B.
For every production A -> a make δ(A, a) = final state.
For every production A -> ϵ, make δ(A, ϵ) = A and A will be final state.
No. For these Grammar no DFA can form.
why?
because it requires memory. Memory of occurence of a.
Yes . it is CFL (context free Language).
We can design a PDA (Push down automata). Here , memory ( STACK is
use ). for PUSH a and POP b