How can one define a language which does not fit in the Chomsky Hierarchy? - grammar

I'm asking this question because I've stumbled across the accepted answer of Chomsky Language Types
This quote is referring to Type-0 Grammars:
This means that if you have a language that is more expressive than
this type (e.g. English), you cannot write an algorithm that can list
each an every (and only these) words of the language
As far as I know:
There is no mathematical description for what English is so it is meaningless to argue about where it lands in the hierarchy of formal languages.
If there was, then English would certainly be recognizable by some Type-0 Grammar by virtue of it being defined by a finite amount of reasoning - where it be axioms, a grammar, anything. (If not - how could've someone define it if not by a finite amount of steps?)
Hence:
We can't start talking about how 'expressive' a grammar needs to be to generate precisely an unknown mathematical object
Therefore my problem:
How can one define a language which does not fit in the Chomsky Hierarchy?
If (?) it takes a finite amount of steps for mathematicians to define
sets with cardinalities that do not make them recursively enumerable - then grammars must exist which are more expressive than Type-0 since they (mathematicians) have followed a finite amount of rules (production rules if you will) to produce a non-RE set. Where are they?

A language is a possibly-infinite set of finite words written with some finite alphabet. Since the alphabet is finite and the length of each word is finite, the words of any language are enumerable, in the sense that there exists an enumeration. In other words, the size of any language is at most countably infinite.
However, since any subset of the Kleene closure of the alphabet is a language, the number of languages is not countably infinite. Hence, there is no enumeration of languages.
The Chomsky hierarchy is based on a formalism which can be expressed as a finite sentence with a finite alphabet (the same alphabet as the language being described, plus a couple of extra symbols). [Note 1] So the number of possible Type 0 grammars is countably infinite, and there cannot be a correspondence between the set of grammars and the set of languages.
However. The existence of languages (i.e. sets) for which no generative grammar exists does not necessarily mean that there is some other way of describing these languages which is "more expressive" than generative grammars. Any description which can be written as a finite string using a finite alphabet can only describe a countable infinity of sets. Whether or not it is the same countable infinity will depend on the formalisms, and in general there will be no algorithm which can demonstrate homomorphism. But some equivalences are known (such as the equivalence with Turing machines, which is a particularly interesting equivalence).
So, we have an interesting little conundrum, which is (of course) related to Gödel's Incompleteness Theorems. That is, there are more languages than ways of describing a language, no matter what system we use to describe a language. So the question "How do we describe a language for which no description is available?" does not have a good answer (and if we answer it, by calling some set "Sue", then there will still be an uncountable infinitude of possible sets for which no name exists).
While all this foraging into infinitudes is interesting, it has a few issues:
It has very little (if anything) to do with programming, so it's questionable whether it's on topic for StackOverflow.
Kurt Gödel and Georg Cantor, the two mathematicians responsible for most of the concepts in this answer, both suffered from severe depression. Just saying.
Notes
Although at first glance it might appear that the alphabet for a Type 0 grammar might be arbitrarily larger than the alphabet of the language being described, that is not actually the case. The grammar's alphabet consists of the target alphabet plus a finite set of non-terminals plus an → symbol; the non-terminals can be written using numbers in any convenient base, say binary. So only three additional symbols are required (and you could reduce that to two by arbitrarily designating one of the non-terminal numbers to be the arrow). (It might seem like you need a third symbol to delimit the names of non-terminals, but you can use a fibonacci encoding to produce codes which always start with a 1 and never include two 1s, so that you can use an extra 1 at the beginning to unambiguously mark the start of the symbol.)

Related

Is it possible to have more than one minimal DFA 's for a regular language?

If we create two DFA's for a language L say DFA A and DFA B. Then, after minimising the DFA's we get their corresponding equivalent minimal DFA's . Is It always the case that both minimal DFA's have same number of states?
I designed 2 DFAs for a language containing strings with 1 as their second last symbol. (The alphabet is {0,1}
I made 2 DFA's one has 3 states one Has four. I am unable to minimise any of the two.
The minimum deterministic finite automata is unique up to isomorphism.
Isomorphism effectively means "equal shape". With other words, there is only one minimum DFA and you can name the states as you want, this renaming technically creates a new automata, but all of this different possible renaming of the states are isomorphic to each other - the shape is the same, just the representation is different.
Ignoring the isomorphism, the minimum DFA is unique.

Can different program implementations have the same program semantics?

So for any given language, if we implement the same program(i.e same output for any given input) twice, using different syntax (i.e. using i++ instead of i+1) will the two programs have the same semantics? Why?
Does the same apply in case where we use different constructs (i.e. Arrays vs Arraylists)?
Thanks
Yes. Depending on the programming language, there can be (combinations of) different syntax constructs with identical semantics.
For example, we can define a programming language with 3 constructs: A and B, both of which are semantically equivalent, and composition (e.g XY for any X and Y where any of these can either be A, B or any composition thereof). Hence program A is equivalent to program B. Also AA is equal to AB, BA and BB etc.
Further, if we extend the language with C which is semantically equivalent to AA, then, for example, BC is equivalent to AAA etc.
So for any given language, if we implement the same program(i.e same output for any given input) twice, using different syntax (i.e. using i++ instead of i+1) will the two programs have the same semantics?
That question is a tautology. The answer is yes. Obviously.
If two different programs produce the same results for all possible input sets, then they do have the same semantics. By definition1.
Why?
Because that is what "same semantics" means!
Does the same apply in case where we use different constructs (i.e. Arrays vs Arraylists)?
Yes.
(One data structure might use more memory, and that might cause an OOME for one version and not the other ... for certain input datasets. But then I would argue that the programs DO NOT produce the same results for all possible inputs.)
Note that this applies to all practical programming languages. Any programming language where there are programs that can only be written one way ... is probably too restrictive to be usable.
1 - OK, so anyone who has studied programming semantics would probably have a fit when they read that. But I am trying to provide an intuitive explanation rather than one that has a decent mathematical foundation. Horses for courses ... as they say.

Are there any steps or rules to draw a DFA?

In my first lecture of "Theory of Automata", after giving some concepts of Alphabet, Language, transition function etc. and a couple of simple automata of an electric circuit with one and two switches, is this question.
I understand what an Alphabet as well as the Language of a DFA is, but are there any rules or steps to followed to reach a correct automaton for a given Language? Or we just have to imagine and think in our mind and get to a solution which satisfies the given Language?
Note:- Please keep your language as simple as you can, since this is my first lecture and I am not yet aware of concepts like regular expressions or any other thing in the subject for that matter.
If you are given a description of the language in words, say, think about all the possible strings that can apply to this language. Then, try to come up with a DFA that handles most of the strings. Then look into the boundary conditions and generate some strings. Try to accommodate it in the DFA. This might be a good starting point for you
http://www.cse.chalmers.se/~coquand/AUTOMATA/o2.pdf
I am a novice .. but as per my experience.. the boundary conditions of the language to be accepted should be drawn 1st and then the complexities can be added while looking at the conditions which will get rejected step by step ... as a start if the figure in the question would have been for a DFA which accepts L={01*0}, then the bare minimum string would be "010" ..and eventually the dfa can be constructed keeping in mind the trap states and some analysis Hope this helps !!
Steps To Construct DFA-
Following steps are followed to construct a DFA for Type-01 problems-
Step-01: Determine the minimum number of states required in the DFA.
Draw those states.
Step-02: Decide the strings for which DFA will be constructed.
Step-03: Construct a DFA for the strings decided in Step-02.
Step-04: Send all the left possible combinations to the starting state.
Do not send the left possible combinations over the dead state.

Simulating regular expressions with deterministic finite automata and the size of the alphabet

I'm currently working my way through the "Dragon Book" (Compilers: Principles, Techniques, & Tools) and I'm kind of stuck at the lexical analysis chapter, which uses DFAs (Deterministic finite automata).
A DFA is a two-dimensional array, the first dimension contains the state and the second the transition symbols. This means that every DFA state contains all the symbols of the language. The examples in the book use a small language (usually two symbols), and they make the following note at the end of the chapter: "since a typical lexical analyzer has several hundred states in its DFA and involves the ASCII alphabet of 128 input characters, the array consumes less than a megabyte".
However, for matching strings I want to match all characters, which means the entire character set, and a lot of input files use UTF-8 encoding. This causes the alphabet, and thus the size of the DFA, to rise enormously.
This is the point where I'm stuck. How are lexical analyzers, or regular expression simulators in general handling this?
Thanks!
I've had an epiphany on this problem. In lexical analysis, about the only time you want to match characters beyond the ASCII range is while doing wildcard matching, like in strings or comments. Because these are only used in wildcards, and not individually, all the characters with a value of 128 or higher can be represented as a single 'other' value. The alphabet and DFA remain small this way, while I am still able to use transition tables and match the entire unicode charset.
Here's an interesting tool that converts a regular expression to non-deterministic finite automata.

Can antlr do type-dependent parsing?

Let me ask whether antlr3 accepts the following example grammar.
for an input , x + y * z ,
it is parsed as x+(y*z) if each in {x,y,z} is a number;
it is parsed as (x+y)*z if each in {x,y,z} is an object of a particular type T;
And let me ask whether such grammars are used sometimes or very rarely for computer languages.
Thank you very much.
In general, parsers (produced by parser generators) only check syntax.
A parser (produced by any means) that can explore multiple parses (I believe ANTLR does this by backtracking; other parsing engines [GLR, Earley] do it by parallel exploration of possible parses), if augmented with semantic checking information, could reject parses that didn't meet semantic constraints.
People tend not to build such parsers in my experience, partly because it is hard to explain to users. If they don't get it, your parser isn't successful; your example is especially bad IMHO in terms of explainability. They also tend not to do this because they need that type information, and that's not always convenient to collect as you parse. The GCC parsers famously do just this this to parse statements such as
X*T;
and the parser is a bit of a mess because of the need to parse and collect this type information as it goes.
I suspect ANTLR can check semantic predicates. How easy it is to get type information of the kind you discuss to those semantic checks is another question; I have no experience here.
The GLR parsing engine used by our DMS Software Reengineering Toolkit does have "semantic" predicates. It isn't particularly easy to get real semantic type information to those predicates by architectural design; we wanted such predicates to be driven off of "syntax". But then, everything (including type inference) is driven off syntax. So we stick information purely local to the reduction being proposed. This is particulary handy in (not) recognizing as separate types of parses, the following peculiar FORTRAN construct for nested-do-termination vs. shared-do-termination:
DO 10 I=1,10,1
DO 10 J=1,10,1
A(I,J)=0
10 CONTINUE
20 CONTINUE
vs.
DO 20 I=1,10,1
DO 10 J=1,10,1
A(I,J)=0
10 CONTINUE
20 CONTINUE
To the parser, at the pure syntax level, both of these look like:
DO <INT> <VAR>=...
DO <INT> <VAR>=...
<STMTS>
<INT> CONTINUE
<INT> CONTINUE
How can one determine which CONTINUE statement belongs to which DO consrtuct with only this information? You can't.
The DMS FORTRAN parser does exactly this by having two sets of rules for DO loops, one for unshared continues, an one for shared continues. They differentiate using semantic predicates that check that the CONTINUE statement label matches the DO loop designated label. And thus the DMS FORTRAN parser gets the loop nesting right as it parses. AFAIK, all the other FORTRAN compilers parse the statements individually, and then stitch the DO loop nests together in a post pass.
And yes, while FORTRAN has this (confusing) construct, no other modern language that I know copied it.