(L)BNF grammar for GAMS - system

Is there a place I can find Backus–Naur Form (BNF) grammar for General Algebraic Modeling System (GAMS)? I have searched the web but have turned up empty handed. I figure it must exist somewhere.

Related

It is possible to use Perl6 grammar on raster data ? (Use Case : Cloud Optimized GeoTIFF Validation)

A few questions scratch an itch around perl6 grammars and raster (binary in general) data. For what I understand, the text approach is to work at the grapheme-level trhough grammars, may we approach raster data that way ? Can we make custom grapheme definition to approach raster data or a basic unit of binary data to parse them using Grammars ?
Seeing that perl6 is defined by perl6 grammars, can we define similar grammars as kind of "validation" test with a basic case being if the grammar can parse the data, the data is well-formed and is structurally validated ? Using this approach for text data, it is kind of obvious with grammars as the basic unit are text-oriented but can we customize those back-end definition (by example, it's possible to overwrite the :sigspace to make rules and tokens parse with a another separator for grapheme) to enable the power of grammars in the binary data territory ?
Thanks!
For the background part:
During the past few weeks, I begin to learn-ish Perl6 by personal interest. After seeing this talk at FOSDEM 2019 and I begin to ask myself (and the people around me) about using using grammars to inspect/parse binary data. My usecase will be for example to replicate the Cloud Optimized Geotiff validator without the support of a GDAL binding (I didn't see one yet in perl6). It's clearly a learning project for me.
The Spec for Cloud Optimized Geotiff
For now, the basic idea is to parse the binary structure with the help of perl6 grammars if it possible as a first basic step, hoping to be able to inspect the data and metadata as a main goal.
Note : Not native speaker, if some parts need rewriting/precisions feel free to point out.
As only comments where posted, I will summarize all the answers I got from the comments here, my further research and the #perl6 IRC chatroom.
Concerning the support of binding for X library (in the test case, it was GDAL), the strategy inside the perl6 community is to either leverage :
Use the Inline::Foo modules aiming at launching and accessing the ecosystem of the Foo language (by example : Inline::Perl5, Inline::Python and so on). List of Inline::X modules from the Perl6 Module Directory ;
Use or write a binding using NativeCall to bind to dynamic libraries who follow the C Calling Covention ;
Use or write a native perl6 module.
Concerning the parsing of binary data, I'll split the subject in two parts :
Generally speaking ;
Leveraging Grammars ;
1. Generally speaking
Leveraging the P5pack module or using Inline::Perl5 to use the unpack/pack is actually (with perl6.c) the best to parse binary data structure (the former seems favoraed as it's native module).
Go to see first comment from #raiph to a SO anwser showing a basic use case.
2. Leveraging the grammars
With perl6.c, grammars can only parse text.
However, the question about parsing binary data seems to be moderatly hot (based on feedbacks seen on the #perl6 irc channel) and a few to document, yet not implemetend, seems to pave the way with a hope to see it happens in a future (near or distant?).
The last part of the #raiph's anwser list a lot of ressources aiming at that direction. Moreover, in the Synopses 05 - Regexes and Rules : line 432 a :bytes modifier is evoked. We will have to see at which point those modifiers will be implemented and what is missing to bring them to the language.
On the #perl6 irc channel, MasterDuke said « also, i think the nqp binary reading/writing ops that jnthn recently specced and nine implemented were a prerequisite for anything further ». I still have to investigated what exactly he is talking about but it seems to go to the good direction.
One of the main point, IMO, is the related to the grapheme definition based on the UTF-8 one. If we were able to overwrite the grapheme definition to a custom one for specialized grammar as we can for now overwrite the :sigspace modifier to affect what is the separators for rulesand tokens, we will access a new way to operate around data structure and grammars. For now, the grapheme is defined in the string-level not the grammar-level or meta. See #timotimo comments linking to the UTF-8 document describing the Grapheme Cluster Boundary Rules.
A way to bend the rules was linked by #jjmererlo : Parsing GFX3 format with perl6 grammars.

How to Write a Source to Source Compiler API

I am doing a little research on source to source compilation but now that I am getting an understanding of Source to Source compilation. I am wondering are there any examples of API's for these source to source compilers.
I mean an Interface Descriptor to pass the source code of one programming language to another compiler to be compile? Please if so can you point me to these examples or could you give me tips (Just pure explanation) on writing one am still in research okay.
Oh I should note I am researching this for several days an I have came across things such as ROSE, DMS and LLVM. As said its purely research so I dont know whats the best approach I know I wouldn't use ROSE for it is only for C/C++. LLVMs' seems promising but I am new to LLVM. Oh my aim is to create a transpiler for 4 language support (Is that feasible). Which is why I just need expert Advice :)
Yes, you can have a procedural API for doing source-to-source translation. These are pretty straightforward in the abstract: define a core data structure to represent AST nodes, then define APIs to "parse file to AST", "visit tree nodes", "inspect tree nodes", "modify tree nodes", "spit out text". They get messy in the concrete, especially if the API is specific the language being translated; too much of the details of that language get wound into the APIs. While traditional, this is really a rather clumsy way to define source-to-source translators, because you then have to write tons of procedural code invoking the APIs to do the translation.
You can instead define them using a program transformation system (PTS) using source to source transformations based on surface syntax; these are patterns written using the notation of your to-be-compiled language, and your target-language, in the form of "if you see this, then replace it by that", operating on syntax trees not text strings. This means you can inspect the transforms simply by staring at them. So can your fellow programmer.
One such translation rule might look like:
rule tranlate_add_to(t: access_path, u: access_path):COBOL -> Java
" add \t to \u "
-> " \object_for\(\u\).\u += \object_for\(\t\).\t; ";
with a left-hand side "add \t to \u " specifying a COBOL fragment (this) to be replaced by the right-hand side " \object_for... " representing corresponding Java code (that). This rule uses a helper function "object_for" to decide where in a target Java program, a global variable in a the source COBOL program will be placed. (There's no avoiding writing such a function if you are translating Java to COBOL. You can argue about how sophisticated). In practice, the way such a rule works is the pattern ASTs of each side are constructed, and then the patterns are matched against a parsed AST; a match causes the corresponding subtree to be spliced into place where the match was found. (All this low level tree matching and splicing has to be done... procedurally, but somebody else has already implemented that in a PTS).
In our experience, you need one to two thousand such rules to translate one language to another. The plethora of rules comes from the combinatorics of language syntax constructs for the source language (and their perhaps different interpretations according to types; "a+b" means different things when a is an int vs when a is a string) and the target language opportunities. A nice plus of such rewrites is that one can build a somewhat simpler base translation, and apply additional rewrites from the target language to itself to clean up and optimize the translated result.
Many PTS are purely based on source-to-source surface syntax rewrites. We have found that combining both PTS and a procedural API, and making it possible to segue between them makes for very nice tool: you can use the rewrites where convenient, and procedural APIs where they don't work so well (the "object_for" function suggested above is easier to code as a procedure).
See lot more detail on how our DMS Software Reengineering Toolkit encodes such transformation rules (the one above is code in DMS style), in a language agnostic (well, parameterized) fashion. DMS offers a "pure" procedural API as OP requested with some 400 functions, but DMS encourages its users to lean heavily on the rewrites and only code as a little as necessary agains the procedural API. It would be "straightforward" (at least as straightforward as practical) to build your "4 language support" this way.
Don't underestimate the amount of effort to build such translators, even with a lot of good technical machinery as a foundation. Langauges tend to be complex beasts, and their translations doubly so. And you have to decide if you want a truly crummy translation or a good one.
I have been using ROSE compiler framework to write a source to source translator. ROSE can parse a language that it supports and create an AST from it. It provides different APIs (found in SageInterface) to perform transformation and analysis on the AST. After the transformation, you can unparse the transformed AST to produce your target source code.
If ROSE does not support parsing your input language, you can write your own parser while utilizing ROSE's SageBuilder API to build the AST. If your target language is one of the languages which ROSE supports, then you can rely on ROSE's unparser to get the target code. But if ROSE does not support your target language, then you can write your own unparser as well using different AST traversal mechanism provided by ROSE.

Antlr and PL/I grammar

Right now we would like to have the grammar of PL/I, COBOL based on Antlr4. Is there anyone provide these grammars
If not, can you please share your thought/experience on developing these grammars from scratch
Thanks
I assume you mean IBM PL/I and COBOL. (Not many other PL/Is around, but I don't think that really changes the answer much).
The obvious place to look for mature ANTLR grammars is ANTLR3 grammar library; no PL/1 or COBOL grammars there. The Antlr V4 (a very new, radical, backwards incompatible reengineering of ANTLR3) main page talks about Java and C#; no hint of PL/1 or COBOL there; given its newness, no surprise. If you are really lucky, somebody may have one they will give you and speak up.
Developing such grammars is difficult for several reasons (based on personal experience building production-quality parsers for these two specific items, using a very strong parser system different than ANTLR [see my bio for more details]):
The character set and column layout rules (columns 1-5, 6 and 72-80 are special) may be an issue: the languages you describe are typically written in EBCDIC historically in punch-card 80 column format without line break characters between lines. Translation to ASCII sometimes produces nasty glitches; the ASCII end-of-line character is occasionally found in the middle of COBOL literal strings as a binary value, but because it has the same exact code in EBCDIC and ASCII, after translation it will (be and) appear to be an ASCII newline break character. Character strings can also be long but split across multiple lines; but columns 72-80 by definition have to be ignored. Column 6 may contain a "D" character, which affects interpretation of the following source lines as "debug" or "not". This means you need to get 80 column processing right. I don't know what ANTLR has to support processing characters-in-column-areas. You'll also need to worry about DBCS encoding of string literals, and variations of that if the source code is used in non-English speaking countries, such as Japan.
These languages are large and complex; IBM has had 40 years to decorate them with cruft. The IBM COBOL manual is some 600 pages ... then you discover that COBOL also includes a Report Writer, which is another 600 page document. Capturing all the nuances of the lexical tokens and the grammar rules will take effort, and you have to do that from the IBM manuals, which don't contain nice BNF-style descriptions, which means guessing from the textual description and some examples. For COBOL, expect several thousand grammar rules; PL/1 is less complicated in the abstract. Expect a certain amount of "lies"; we've encountered a number of places where the reference documentation clearly says certain things are not legal, and yet the IBM compilers (based on real, running source code) accepts them, and vice versa. The only way you find these is by empirical experiments.
Both languages have constructs that are difficult to parse, e.g., requiring arbitrary lookahead and/or local ambiguity. ANTLR4 is much better than ANTLR3 from my understanding on these, but that doesn't mean these aspects will be easy. PL/1 is particularly nasty in this regard: it has no keywords, but hundreds of keywords-in-context. To resolve these one has to get the lexer and the parser to cooperate, and even then there may be many locally ambiguous parses. ANTLR3 doesn't do these well; ANTLR4 is supposed to be better but I don't know how it handles this, if it does at all.
To verify these parsers are right, you will need to run them on millions of lines of code (which means you have to have access to such code samples), and correct any errors you find. This takes a long time (in our case, several years of more or less continuous work/improvement to get production quality grammars that work on large code bases). You might be miraculously faster than this; good luck.
You need to build a preprocessor for COBOL (COPY ... REPLACING), whose details are poorly documented, and eventually another one for PL/1 (which I understand to be fully Turing capable).
After you build a parser, you need to capture a syntax tree; here ANTLR4 is supposed to be pretty good in that it will capture one for the grammar you give it. That may or may not be the AST you want; with several thousand grammar rules, I'd expect not. ANTLR3 requires you to add, manually, indications of where and how to form the AST.
After you get the AST, you'll want to do something with it. This means you will need to build at least symbol tables (mappings from identifier instances to their declarations and any related type information). ANTLR provides nothing special to support this AFAIK except for support in walking the ASTs. This, too, is hard to get right, COBOL has crazy rules about how an unqualified identifier reference can be interpreted as to a specific data field if there are no other conflicting interpretations. (There's lots more to Life After Parsing if you want to have good semantic information about the program; see my bio for more details; for each of these semantic aspects you have develop them and then for validation go back and run them on large code bases again.).
TL;DR
Building parsers (well, "front ends") for these languages is a lot of work no matter what parsing engine you choose. Likely explains why they aren't already in ANTLR's grammar zoo.
Have a look at the OpenSource Cobol-85 Parser from ProLeap, based on antlr4 and creating ASTs and ASGs as well.
And, best of all, it really works !
https://github.com/uwol/proleap-cobol-parser
I am not aware of a comparable PLI-grammar, but a very good start is the EBNF-definition from Ralf Lämmel (CWI, Amsterdam) & Chris Verhoef (WINS, Universiteit van Amsterdam)
http://www.cs.vu.nl/grammarware/browsable/os-pli-v2r3/

How Do I Design Abstract Semantic Graphs?

Can someone direct me to online resources for designing and implementing abstract semantic graphs (ASG)? I want to create an ASG editor for my language. Being able to edit the ASG directly has a number of advantages:
Only identifiers and literals need to be typed in and identifiers are written only once, when they're defined. Everything else is selected via the mouse.
Since the editor knows the language's grammar, there are no more syntax errors. The editor prevents them from being created in the first place.
Since the editor knows the language's semantics, there are no more semantic errors.
There are some secondary advantages:
Since all the reserved words are easily separable, a program can be written in one locale and viewed in other. On-the-fly changes of locale are possible.
All the text literals are easily separable, so changes of locale are easily made, including on-the-fly changes.
I'm not aware of a book on the matter, but you'll find the topic discussed in portions of various books on computer language. You'll also find discussions of this surrounding various projects which implement what you describe. For instance, you'll find quite a bit of discussion regarding the design of Scratch. Most workflow engines are also based on scripting in semantic graphs.
Allow me to opine... We've had the technology to manipulate language structurally for basically as long as we've had programming languages. I believe that the reason we still use textual language is a combination of the fact that it is more natural for us as humans, who communicate in natural language, to wield, and the fact that it is sometimes difficult to compose and refactor code when proper structure has to be maintained. If you're not sure what I mean, try building complex expressions in Scratch. Text is easier and a decent IDE gives virtually as much verification of correct structure.*
*I don't mean to take anything away from Scratch, it's a thing of beauty and is perfect for its intended purpose.

Shallow parsing with ANTLR

I'm trying to develop a solution able to extract, in a closed-context, certain actions.
For example, in a context of booking cinema tickets, if a user says:
"I'd like to go to the cinema tomorrow night, it would be Casablanca, I'd like to be at the last row, please"
I've designed grammars for getting the name of the film, desired seat, date and hour of the projection, etc.
However, though I've thought about ANTLR for developing such solution, I don't really know if it has such functionality, I mean, if I can define several root symbols.
ANTLR has methods of addressing ambiguities in grammars. These methods are in improved in ANTLR 4, but when it comes to processing ambiguous languages (especially human language), you'll face one giant limitation that will inevitably make ANTLR unsuitable for the task:
ANTLR eventually resolves an ambiguity by deciding that one specific option among multiple potential options is the correct solution. Since this resolution happens at a very early stage in the parsing process with ANTLR, it's very difficult to incorporate semantic logic in this decision making process (as opposed to logic involving syntax alone).
Edit: One thing that's particularly interesting about ANTLR 4 in the context of NLP is the fact that ANTLR 4 uses an augmented transition network as the basis for its parser. Somewhere in there I know it would be possible to modify it for use in natural language processing, but to date haven't figured out just how to make it work. Reference: I developed the optimized version of the ANTLR 4 runtime, which is currently slightly behind the reference branch but I'll catch up later this summer.
ANTLR isn't well suited to parse human languages: they're too ambiguous. Try NLP instead. Here's a list of natural language processing toolkits.