Can antlr do type-dependent parsing? - antlr

Let me ask whether antlr3 accepts the following example grammar.
for an input , x + y * z ,
it is parsed as x+(y*z) if each in {x,y,z} is a number;
it is parsed as (x+y)*z if each in {x,y,z} is an object of a particular type T;
And let me ask whether such grammars are used sometimes or very rarely for computer languages.
Thank you very much.

In general, parsers (produced by parser generators) only check syntax.
A parser (produced by any means) that can explore multiple parses (I believe ANTLR does this by backtracking; other parsing engines [GLR, Earley] do it by parallel exploration of possible parses), if augmented with semantic checking information, could reject parses that didn't meet semantic constraints.
People tend not to build such parsers in my experience, partly because it is hard to explain to users. If they don't get it, your parser isn't successful; your example is especially bad IMHO in terms of explainability. They also tend not to do this because they need that type information, and that's not always convenient to collect as you parse. The GCC parsers famously do just this this to parse statements such as
X*T;
and the parser is a bit of a mess because of the need to parse and collect this type information as it goes.
I suspect ANTLR can check semantic predicates. How easy it is to get type information of the kind you discuss to those semantic checks is another question; I have no experience here.
The GLR parsing engine used by our DMS Software Reengineering Toolkit does have "semantic" predicates. It isn't particularly easy to get real semantic type information to those predicates by architectural design; we wanted such predicates to be driven off of "syntax". But then, everything (including type inference) is driven off syntax. So we stick information purely local to the reduction being proposed. This is particulary handy in (not) recognizing as separate types of parses, the following peculiar FORTRAN construct for nested-do-termination vs. shared-do-termination:
DO 10 I=1,10,1
DO 10 J=1,10,1
A(I,J)=0
10 CONTINUE
20 CONTINUE
vs.
DO 20 I=1,10,1
DO 10 J=1,10,1
A(I,J)=0
10 CONTINUE
20 CONTINUE
To the parser, at the pure syntax level, both of these look like:
DO <INT> <VAR>=...
DO <INT> <VAR>=...
<STMTS>
<INT> CONTINUE
<INT> CONTINUE
How can one determine which CONTINUE statement belongs to which DO consrtuct with only this information? You can't.
The DMS FORTRAN parser does exactly this by having two sets of rules for DO loops, one for unshared continues, an one for shared continues. They differentiate using semantic predicates that check that the CONTINUE statement label matches the DO loop designated label. And thus the DMS FORTRAN parser gets the loop nesting right as it parses. AFAIK, all the other FORTRAN compilers parse the statements individually, and then stitch the DO loop nests together in a post pass.
And yes, while FORTRAN has this (confusing) construct, no other modern language that I know copied it.

Related

How can one define a language which does not fit in the Chomsky Hierarchy?

I'm asking this question because I've stumbled across the accepted answer of Chomsky Language Types
This quote is referring to Type-0 Grammars:
This means that if you have a language that is more expressive than
this type (e.g. English), you cannot write an algorithm that can list
each an every (and only these) words of the language
As far as I know:
There is no mathematical description for what English is so it is meaningless to argue about where it lands in the hierarchy of formal languages.
If there was, then English would certainly be recognizable by some Type-0 Grammar by virtue of it being defined by a finite amount of reasoning - where it be axioms, a grammar, anything. (If not - how could've someone define it if not by a finite amount of steps?)
Hence:
We can't start talking about how 'expressive' a grammar needs to be to generate precisely an unknown mathematical object
Therefore my problem:
How can one define a language which does not fit in the Chomsky Hierarchy?
If (?) it takes a finite amount of steps for mathematicians to define
sets with cardinalities that do not make them recursively enumerable - then grammars must exist which are more expressive than Type-0 since they (mathematicians) have followed a finite amount of rules (production rules if you will) to produce a non-RE set. Where are they?
A language is a possibly-infinite set of finite words written with some finite alphabet. Since the alphabet is finite and the length of each word is finite, the words of any language are enumerable, in the sense that there exists an enumeration. In other words, the size of any language is at most countably infinite.
However, since any subset of the Kleene closure of the alphabet is a language, the number of languages is not countably infinite. Hence, there is no enumeration of languages.
The Chomsky hierarchy is based on a formalism which can be expressed as a finite sentence with a finite alphabet (the same alphabet as the language being described, plus a couple of extra symbols). [Note 1] So the number of possible Type 0 grammars is countably infinite, and there cannot be a correspondence between the set of grammars and the set of languages.
However. The existence of languages (i.e. sets) for which no generative grammar exists does not necessarily mean that there is some other way of describing these languages which is "more expressive" than generative grammars. Any description which can be written as a finite string using a finite alphabet can only describe a countable infinity of sets. Whether or not it is the same countable infinity will depend on the formalisms, and in general there will be no algorithm which can demonstrate homomorphism. But some equivalences are known (such as the equivalence with Turing machines, which is a particularly interesting equivalence).
So, we have an interesting little conundrum, which is (of course) related to Gödel's Incompleteness Theorems. That is, there are more languages than ways of describing a language, no matter what system we use to describe a language. So the question "How do we describe a language for which no description is available?" does not have a good answer (and if we answer it, by calling some set "Sue", then there will still be an uncountable infinitude of possible sets for which no name exists).
While all this foraging into infinitudes is interesting, it has a few issues:
It has very little (if anything) to do with programming, so it's questionable whether it's on topic for StackOverflow.
Kurt Gödel and Georg Cantor, the two mathematicians responsible for most of the concepts in this answer, both suffered from severe depression. Just saying.
Notes
Although at first glance it might appear that the alphabet for a Type 0 grammar might be arbitrarily larger than the alphabet of the language being described, that is not actually the case. The grammar's alphabet consists of the target alphabet plus a finite set of non-terminals plus an → symbol; the non-terminals can be written using numbers in any convenient base, say binary. So only three additional symbols are required (and you could reduce that to two by arbitrarily designating one of the non-terminal numbers to be the arrow). (It might seem like you need a third symbol to delimit the names of non-terminals, but you can use a fibonacci encoding to produce codes which always start with a 1 and never include two 1s, so that you can use an extra 1 at the beginning to unambiguously mark the start of the symbol.)

Choosing serialization frameworks

I was reading about the downsides of using java serialization and the necessity to go for serialization frameworks. There are so many frameworks like avro, parquet, thrift, protobuff.
Question is what framework addresses what and what are all the parameters that are to be considered while choosing a serialization framework.
I would like to get hands on with a practical use case and compare/choose the serialization frameworks based on the requirements.
Can somebody please assist on this topic?
There are a lot of factors to consider. I'll go through some of the important ones.
0) Schema first or Code First
If you have a project that'll involve different languages, code first approaches are likely to be problematic. It's all very well have a JAVA class that can be serialised, but it might be a nuisance if that has to be deserialised in C.
Generally I favour schema first approaches, just in case.
1) Inter-object Demarkation
Some serialisations produce a byte stream that makes it possible to see where one object stops and another begins. Other do not.
So, if you have a message transport / data store that will separate out batches of bytes for you, e.g. ZeroMQ, or a data base field, then you can use a serialisation that doesn't demarkate messages. Examples include Google Protocol Buffers. With demarkation done by the transport / store, the reader can get a batch of bytes knowing for sure that it encompasses one object, and only one object.
If your message transport / data store doesn't demarkate between batches of bytes, e.g. a network stream or a file, then either you invent your own demarkation markers, or use a serialisation that demarkates for you. Examples include ASN.1 BER, XML.
2) Cannonical
This is a property of a serialisation which means that the serialised data desribes its own structure. In principal the reader of a cannonical message doesn't have to know up front what the message structure is, it can simply work that out as it reads the bytes (even if it doesn't know the field names). This can be useful in circumstances where you're not entirely sure where the data is coming from. If the data is not cannonical, the reader has to know in advance what the object structure was otherwise the deserialisation is ambiguous.
Examples of cannonical serialisations include ASN.1 BER, ASN.1 cannonical PER, XML. Ones that aren't include ASN.1 uPER, possibly Google Protocol Buffers (I may have that wrong).
AVRO does something different - the data schema is itself part of the serialised data, so it is always possible to reconstruct the object from arbitrary data. As you can imagine the libraries for this are somewhat clunky in languages like C, but rather better in dynamic languages.
3) Size and Value Constrained.
Some serialisation technologies allow the developer to set constraints on the values of fields and the sizes of arrays. The intention is that code generated from a schema file incorporating such constraints will automatically validate objects on serialisation and on deserialistion.
This can be extremely useful - free, schema driven content inspection done automatically. It's very easy to spot out-of-specification data.
This is extremely useful in large, hetrogenous projects (lots of different languages in use), as all sources of truth about what's valid and what's not comes from the schema, and only the schema, and is enforced automatically by the auto-generated code. Developers can't ignore / get round the constraints, and when the constraints change everyone can't help but notice.
Examples include ASN.1 (usually done pretty well by tool sets), XML (not often done properly by free / cheap toolsets; MS's xsd.exe purposefully ignores any such constraints) and JSON (down to object validators). Of these three, ASN.1 has by far the most elaborate of constraints syntaxes; it's really very powerful.
Examples that don't - Google Protocol Buffers. In this regard GPB is extremely irritating, because it doesn't have constraints at all. The only way of having value and size constraints is to either write them as comments in the .proto file and hope developers read them and pay attention, or some other sort of non-sourcecode approach. With GPB being aimed very heavily at hetrogrenous systems (literally every language under the sun is supported), I consider this to be a very serious omission, because value / size validation code has to be written by hand for each language used in a project. That's a waste of time. Google could add syntactical elements to .proto and code generators to support this without changing wire foramts at all (it's all in the auto-generated code).
4) Binary / Text
Binary serialisations will be smaller, and probably a bit quicker to serialise / deserialise. Text serialisations are more debuggable. But it's amazing what can be done with binary serialisations. For example, one can easily add ASN.1 decoders to Wireshark (you compile them up from your .asn schema file using your ASN.1 tools), et voila - on the wire decoding of programme data. The same is possible with GPB I should think.
ASN.1 uPER is extremely useful in bandwidth constrained situations; it automatically uses the size / value constraints to economise in bits on the wire. For example, a field valid between 0 and 15 needs only 4 bits, and that's what uPER will use. It's no coincidence that uPER features heavily in protocols like 3G, 4G, and 5G too I should think. This "minimum bits" approach is a whole lot more elegant than compressing a text wireformat (which is what's done a lot with JSON and XML to make them less bloaty).
5) Values
This is a bit of an oddity. In ASN.1 a schema file can define both the structure of objects, and also values of objects. With the better tools you end up with (in your C++, JAVA, etc source code) classes, and pre-define objects of that class already filled in with values.
Why is that useful? Well, I use it a lot for defining project constants, and to give access to the limits on constraints. For example, suppose you'd got a an array field with a valid length of 15 in a message. You could have a literal 15 in the field constraint or you could cite the value of an integer value object in the constraint, with the integer also being available to developers.
--ASN.1 value that, in good tools, is built into the
--generated source code
arraySize INTEGER ::= 16
--A SET that has an array of integers that size
MyMessage ::= SET
{
field [0] SEQUENCE (SIZE(arraySize)) OF INTEGER
}
This is really handy in circumstances where you want to loop over that constraint, because the loop can be
for (int i = 0; i < arraySize; i++) {do things with MyMessage.field[i];} // ArraySize is an integer in the auto generated code built from the .asn schema file
Clearly this is fantastic if the constraint ever needs to be changed, because the only place it has to be changed is in the schema, followed by a project recompile (where every place it's used will use the new value). Better still, if it's renamed in the schema file a recompile identifies everywhere in the poject it was used (because the developer written source code that uses it is still using the old name, which is now an undefined symbol --> compiler errors.
ASN.1 constraints can get very elaborate. Here's a tiny taste of what can be done. This is fantastic for system developers, but is pretty complicated for the tool developers to implement.
arraySize INTEGER ::= 16
minSize INTEGER ::= 4
maxVal INTEGER ::= 31
minVal INTEGER ::= 16
oddVal INTEGER ::= 63
MyMessage2 ::= SET
{
field_1 [0] SEQUENCE (SIZE(arraySize)) OF INTEGER, -- 16 elements
field_2 [1] SEQUENCE (SIZE(0..arraySize)) OF INTEGER, -- 0 to 16 elements
field_3 [2] SEQUENCE (SIZE(minSize..arraySize)) OF INTEGER, -- 4 to 16 elements
field_4 [3] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER, -- 5 to 15 elements
field_5 [4] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(0..maxVal), -- 5 to 15 elements valued 0..31
field_6 [5] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(minVal..maxVal), -- 5 to 15 elements valued 16..31
field_7 [6] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(minVal<..maxVal), -- 5 to 15 elements valued 17..31
field_8 [7] SEQUENCE (SIZE(arraySize)) OF INTEGER(minVal<..<maxVal), -- 16 elements valued 17..30
field_9 [8] INTEGER (minVal..maxVal AND oddVal) -- valued 16 to 31, and also 63
f8_indx [10] INTEGER (0..<arraySize) -- index into field 8, constrained to be within the bounds of field 8
}
So far as I know, only ASN.1 does this. And then it's only the more expensive tools that actually pick up these elements out of a schema file. With it, this makes it tremendously useful in a large project because literally everything related to data and its constraints and how to handle it is defined in only the .asn schema, and nowhere else.
As I said, I use this a lot, for the right type of project. Once one has got it pervading an entire project, the amount of time and risk saved is fantastic. It changes the dynamics of a project too; one can make late changes to a schema knowing that the entire project picks those up with nothing more than a recompile. So, protocol changes late in a project go from being high risk to something you might be content to do every day.
6) Wireformat Object Type
Some serialisation wireformats will identify the type of an object in the wireformat bytestrean. This helps the reader in situations where objects of many different types may arrive from one or more sources. Other serialisations won't.
ASN.1 varies from wireformat to wireformat (it has several, including a few binary ones as well as XML and JSON). ASN.1 BER uses type, value and length fields in its wireformat, so it is possible for the reader to peek at an object's tag up front decode the byte stream accordingly. This is very useful.
Google Protocol Buffers doesn't quite do the same thing, but if all message types in a .proto are bundled up into one final oneof, and it's that that's only every serialised, then you can achieve the same thing
7) Tools cost.
ASN.1 tools range from very, very expensive (and really good), to free (and less good). A lot of others are free, though I've found that the best XML tools (paying proper attention to value / size constraints) are quite expensive too.
8) Language Coverage
If you've heard of it, it's likely covered by tools for lots of different languages. If not, less so.
The good commercial ASN.1 tools cover C/C++/Java/C#. There are some free C/C++ ones of varying completeness.
9) Quality
It's no good picking up a serialisation technology if the quality of the tools is poor.
In my experience, GPB is good (it generally does what it says it will). The commercial ASN1 tools are very good, eclipsing GPB's toolset comprehensively. AVRO works. I've heard of some occassional problems with Capt'n Proto, but having not used it myself you'd have to check that out. XML works with good tools.
10) Summary
In case you can't tell, I'm quite a fan of ASN.1.
GPB is incredibly useful too for its widespread support and familiarity, but I do wish Google would add value / size constraints to fields and arrays, and also incorporate a value notation. If they did this it'd be possible to have the same project workflow as can be achieved with ASN.1. If Google added just these two features, I'd consider GPB to be pretty well nigh on "complete", needing only an equivalent of ASN.1's uPER to finish it off for those people with little storage space or bandwidth.
Note that quite a lot of it is focused on what a project's circumstances are, as well as how good / fast / mature the technology actually is.

Can different program implementations have the same program semantics?

So for any given language, if we implement the same program(i.e same output for any given input) twice, using different syntax (i.e. using i++ instead of i+1) will the two programs have the same semantics? Why?
Does the same apply in case where we use different constructs (i.e. Arrays vs Arraylists)?
Thanks
Yes. Depending on the programming language, there can be (combinations of) different syntax constructs with identical semantics.
For example, we can define a programming language with 3 constructs: A and B, both of which are semantically equivalent, and composition (e.g XY for any X and Y where any of these can either be A, B or any composition thereof). Hence program A is equivalent to program B. Also AA is equal to AB, BA and BB etc.
Further, if we extend the language with C which is semantically equivalent to AA, then, for example, BC is equivalent to AAA etc.
So for any given language, if we implement the same program(i.e same output for any given input) twice, using different syntax (i.e. using i++ instead of i+1) will the two programs have the same semantics?
That question is a tautology. The answer is yes. Obviously.
If two different programs produce the same results for all possible input sets, then they do have the same semantics. By definition1.
Why?
Because that is what "same semantics" means!
Does the same apply in case where we use different constructs (i.e. Arrays vs Arraylists)?
Yes.
(One data structure might use more memory, and that might cause an OOME for one version and not the other ... for certain input datasets. But then I would argue that the programs DO NOT produce the same results for all possible inputs.)
Note that this applies to all practical programming languages. Any programming language where there are programs that can only be written one way ... is probably too restrictive to be usable.
1 - OK, so anyone who has studied programming semantics would probably have a fit when they read that. But I am trying to provide an intuitive explanation rather than one that has a decent mathematical foundation. Horses for courses ... as they say.

translation from Datalog to SQL

I am still thinking on how to translate the recursivity of a Datalog program into SQL, such as
P(x,y) <- Q(x,y).
Q(x,y) <- P(x,z), A(y).
where A/1 is an EDB predicate. This, there is a co-dependency between P and Q. For longer queries, how to solve this problem?
Moreover, is there any system completely implement the translation? If there is, may I know what system or which paper may I refer?
If you adopt an approach of "tabling" previous conclusions and forward-chain reasoning on these to infer new conclusions, no recursive "depth" is required.
Bear in mind that Datalog requires some restrictions on rules and variable that assure finite termination and hence finitely many conclusions. Variables must have a finite range of possible values, for example.
Let's assume your example refers to constants rather than to variables:
P(x,y) <- Q(x,y).
Q(x,y) <- P(x,z), A(y).
One wrinkle is that you want A/1 to be implemented as an extended stored procedure or external code. For that I would propose tabling all the results of calling A on all possible arguments (finitely many). These are after all among the conclusions (provable statements) of your system.
Once that is done the forward-chaining inference proceeds iteratively rather than recursively. At each step consider each rule, applying it with premises (right-hand sides) that are previously obtained (tabled) conclusions if it produces a new conclusion. If no rule produces a new conclusion in the current step, halt. The proof procedure is complete.
In your example the proofs stop after all the A facts are adduced, because there are no conclusions sufficient to apply either rule to get new conclusions.
A possible approach is to use recursive CTEs in SQL, which provide the power of transitive closure. Relational algebra + transitive closure = Datalog.
Logica does something like this. It translates a datalog-like language into SQL for Google BigQuery, PostgreSQL and SQLite.

How do I perform binary addition on a mod type in Ada?

Very specific issue here…and no this isn’t homework (left that far…far behind). Basically I need to compute a checksum for code being written to an EPROM and I’d like to write this function in an Ada program to practice my bit manipulation in the language.
A section of a firmware data file for an EPROM is being changed by me and that change requires a new valid checksum at the end so the resulting system will accept the changed code. This checksum starts out by doing a modulo 256 binary sum of all data it covers and then other higher-level operations are done to get the checksum which I won’t go into here.
So now how do I do binary addition on a mod type?
I assumed if I use the “+” operator on a mod type it would be summed like an integer value operation…a result I don’t want. I’m really stumped on this one. I don’t want to really do a packed array and perform the bit carry if I don’t have to, especially if that’s considered “old hat”. References I’m reading claim you need to use mod types to ensure more portable code when dealing with binary operations. I’d like to try that if it’s possible. I'm trying to target multiple platforms with this program so portability is what I'm looking for.
Can anyone suggest how I might perform binary addition on a mod type?
Any starting places in the language would be of great help.
Just use a modular type, for which the operators do unsigned arithmetic.
type Word is mod 2 ** 16; for Word'Size use 16;
Addendum: For modular types, the predefined logical operators operate on a bit-by-bit basis. Moreover, "the binary adding operators + and – on modular types include a final reduction modulo the modulus if the result is outside the base range of the type." The function Update_Crc is an example.
Addendum: §3.5.4 Integer Types, ¶19 notes that for modular types, the results of the predefined operators are reduced modulo the modulus, including the binary adding operators + and –. Also, the shift functions in §B.2 The Package Interfaces are available for modular types. Taken together, the arithmetical, logical and shift capabilities are sufficient for most bitwise operations.