Semantic Model from a grammar - grammar

I would like ask for some thoughts about the concepts: Domain Object and a Semantic Model.
So, I really want to understand what's a Domain Object / Semantic Model for and what's not Domain Object / Semantic Model for.
As far I've been able to figure out, given a grammar is absolutly advisable do these separation concepts.
However, I'm not quite figure out how to do it. For example, given this slight grammar, how do you build a Domain Object or a Semantic Model.
It's exactly what I'm trying to figure out...
Most of books suggest this approach in order to go through an AST. Instead of directly translate at the same time you go throguh the AST creating a semantic model and then connect to it an interpreter.
Example (SQL Syntax Tree):
Instead of generate directly a SQL sentence, I create a semantic model, and then I'm able to connent an interpreter that translate this semantic model to a SQL sentence.
Abstract Systex Tree -> Semantic Model -> Interpreter
By this way, I could have a Transact-SQL Interpreter and another onr for SqLite.

The terms "domain object" and "semantic model" aren't really standard terms from the compiler literature, so you'll get lots of random answers.
The usual terms related to parsing are "concrete syntax tree" (matches the shape of the grammar rules), "abstract syntax tree" (an attempt to make a tree which contains less accidental detail, although it might not be worth the trouble.).
Parsing is only a small part of the problem of processing a language. You need a lot of semantic interpretation of syntax, however you represent it (AST, CST, ...). This includes concepts such as :
Name resolution (for every identifier, where is it defined? used?
Type resolution (for every identifier/expression/syntax construct, what is the type of that entity?
Type checking (is that syntax construct used in a valid way?)
Control flow analysis (what order are the program parts executed in, possibly even parallel/dynamic/constraint-determined)
Data flow analysis (where are values defined? consumed?)
Optimization (replacement of one set of syntax constructs by another semantically equivalent set with some nice property [executes faster after compilation is common]), at high or low levels of abstraction
High level code generation, e.g, interpreting sets of syntactic constructs in the language, to equivalent sets in the targeted [often assembly-language like] language
Each of these concepts more or less builds on top of the preceding ones.
The closest I can come to "semantic model" is that high-level code generation. That takes a lot of machinery that you have to build on top of trees.
ANTLR parses. You have to do/supply the rest.

Related

Is Graql compiled or translated to gremlin?

I'm wrapping my head around Grakn a little to understands its added value, I wonder if Graql is compiled or translated to gremlin traversal step ?
This makes me wonder about the difference of expressivity between Sparql and Graql, given that the former is until now not fully translated into Gremlin. It seems to be an open problem ? Is Graql fundamentally simpler than sparql and that would explain the fact that is it fully translated if that's the case ? If not is there any limitation in translating it to gremlin steps at this point ?
I'll try to shine some light on your questions.
To begin with, Graql was designed to be a high-level, human-readable query language. The main idea was to abstract the node-vertex graph datastructure to concepts that are specific to a given user-defined domain. In that way the user doesn't need to worry about the underlying graph representation and low-level gremlin constructs and instead he can work with high-level terms he defined himself and/or he is familiar with.
Now, implementation-wise Graql is an abstraction over Gremlin which translates the high-level queries to Gremlin traversals which can then be executed against a specific graph. However, the mapping between Graql and Gremlin is not 1-1. In fact, Graql operates with some subset of Gremlin that allows to capture the intended behaviours of the Graql language. It was never our intention to find such a mapping as the goal was to translate high-level queries to queries understandable by the underlying graph processor.
Now the efficiency of the traversal generation. Graql queries can be decomposed to properties (has, isa, sub, etc) and fragments. Each fragment has a defined Gremlin counterpart and each property can possibly contain multiple fragments. Now the fragment translation is unambiguous, however there is a lot of freedom in picking and arranging the fragments that go into a property. Keeping in mind that queries contain multiple properties this makes the arrangement a strictly non-trivial task. To perform this arrangement, which in Gremlin is handed to the user, we implemented a query processor. The idea of the processor is to pick such an arrangement and ordering of the fragments that the resulting query execution is as fast as possible. This is reminiscent of SQL query processors and the motivation is exactly the same, to abstract the query optimisation from the user.
We are actively working on the query planning component and although it gives no guarantee to be produce the most optimal plan in all cases, we are trying to make the produced plans converge to optimal solutions.

ANTLR: Source to Target Language Conversion

I have fair understanding on ANTLR & grammar. Is it correct to say ANTLR can do source language to target language conversion like ASP to JSP or COBOL to JSP? if yes, could you help me to provide some information/tutorial/link to explorer the possibilities?
Idea is to pragmatically translating huge amounts of code from source to target using ANTLR.
Thanks
The basic steps to building a translator in Antlr4 is to:
generate a parse tree from an input text in the source language
repeatedly walk the parse tree to analyze the nodes of the parse tree, adding and evolving properties (decorator pattern) associated with individual parse tree nodes -- the properties will describe the change(s) required to represent the content of the node in the target language.
final walk of the parse tree to collect and output the target language text.
The form and content of the properties and the progression of creation and evolution will be entirely dependent on the nature of the source and target languages and the architect's conversion strategy.
Since Antlr parse-tree walks can be logically independent of one another, specific conversion aspects can be addressed in separate walks. For example, one walk can evaluate (possibly among other things) whether individual perform until statements will be converted to if or while statements. Another walk can be dedicated to analyzing variable names to ensure they are created/accessed in the correct scope and determining the naming and scope of any target language required temporary variables. Etc.
Given that the conversion is a one-time affair, there is no fundamental penalty to implementing 5, 10, or even more walks. Just the 'whatever makes sense in your case' practicality.
The (relevant) caveat addressed in the other QA is how to handle conversions where there is no simple or near identity between statements in the two languages. To convert a unique source language statement then requires a target language run-time package be created to implement the corresponding function.
GenPackage (I am the author) automates the generation of a basic conversion project. The generated project represents but one possible architectural approach and leaves substantial work to be done to tailor it to any particular end use.

General stategy for designing Flexible Language application using ANTLR4

Requirement:
I am trying to develop a language application using antlr4. The language in question is not important. The important thing is that the grammar is very vast (easily >2000 rules!!!). I want to do a number of operations
Extract bunch of informations. These can be call graphs, variable names. constant expressions etc.
Any number of transformations:
if a loop can be expanded, we go ahead and expand it
If we can eliminate dead code we might choose to do that
we might choose to rename all variable names to conform to some norms.
Each of these operations can be applied independent of each other. And after application of these steps I want the rewrite the input as close as possible to the original input.
e.g. So we might want to eliminate loops and rename the variable and then output the result in the original language format.
Questions:
I see a need to build a custom Tree (read AST) for this. So that I can modify the tree with each of the transformations. However when I want to generate the output, I lose the nice abilities of the TokenStreamRewriter. I have to specify how to write each of the nodes of the tree and I lose the original input formatting for the places I didn't do any transformations. Does antlr4 provide a good way to get around this problem?
Is AST the best way to go? Or do I build my own object representation? If so how do I create that object efficiently? Creating object representation is very big pain for such a vast language. But may be better in the long run. Again how do I get back the original formatting?
Is it possible to work just on the parse tree?
Are there similar language applications which do the same thing? If so what strategy do they use?
Any input is welcome.
Thanks in advance.
In general, what you want is called a Program Transformation System (PTS).
PTSs generally have parsers, build ASTs, can prettyprint the ASTs to recover compilable source text. More importantly, they have standard ways to navigate/inspect/modify the ASTs so that you can change them programmatically.
Many offer these capabilities in the form of pattern-matching code fragments written in the surface syntax of the language being transformed; this avoids the need to forever having to know excruciatingly fine details about which nodes are in your AST and how they are related to children. This is incredibly useful when you big complex grammars, as most of our modern (and our legacy languages) all seem to have.
More sophisticated PTSs (very few) provide additional facilities for teasing out the semantics of the source code. It is pretty hard to analyze/transform most code without knowing what scopes individual symbols belong to, or their type, and many other details such as data flow. Full disclosure: I build one of these.

Generating random but still valid expressions based on yacc/bison/ANTLR grammar

Is it possible? Any tool available for this?
You can do this with any system that gives you access to base grammar. ANTLR and YACC compile your grammar away so you don't have them anymore. In ANTLR's case, the grammar has been turned into code; you're not going to get it back. In YACC's case, you end up with parser tables, which contain the essence of the grammar; you could walk such parse tables if you understood them well enough to do what I describe below as.
It is easy enough to traverse a set of explicitly represented grammar rules and randomly choose expansions/derivations. By definition this will get you valid syntax.
What it won't do is get you valid code. The problem here is that most languages really have context sensitive syntax; most programs aren't valid unless the declared identifiers are used in a way consistent with their declaration and scoping rules. That latter requires a full semantic check.
Our DMS Software Reengineering Toolkit is used to parse code in arbitrary languages [using a grammar], build ASTs, lets you analyze and transform those trees, and finally prettyprint valid (syntactic) text. DMS provides direct access to the grammar rules, and tree building facilities, so it is pretty easy to generate random syntactic trees (and prettyprint). Making sure they are semantically valid is hard with DMS too; however, many of DMS's front ends can take a (random) tree and do semantic checking, so at least you'd know if the tree was semantically valid.
What you do if it says "no" is still an issue. Perhaps you can generate identifier names in way that guarantees at least not-inconsistent usage, but I suspect that would be langauge-dependent.
yacc and bison turn your grammar into a finite state machine. You should be able to traverse the state machine randomly to find valid inputs.
Basically, at each state you can either shift a new token on to the stack and move to a new state or reduce the top token in the stack based on a set of valid reductions. (See the Bison manual for details about how this works).
Your random generator will traverse the state machine making random but valid shifts or reductions at each state. Once you reach the terminal state you have a valid input.
For a human readable description of the states you can use the -v or --report=state option to bison.
I'm afraid I can't point you to any existing tools that can do this.

Programming constructs

A wise man told me that to learn how a syntax works does not mean your a good programmer, but rather to grasp programming constructs like iterators and conditionals, thus, meaning you can pick up any syntax easier.
How would one go about learning these constructs??
The easiest construct you mention is a conditional.
The basic pattern of a conditional is:
if <some-condition> then
<do-action>
else
<do-other-action>
end if
This basic pattern is expressed in many different ways according to the language of choice, but is the basic decision-making building block of any program.
An iterator is a construct which abstracts the physical layout of a data structure, allowing you to iterate (pass through) it without worrying about where in memory each element in the data structure is.
So, for example, you can define a data structure such as any of Array, Vector, Deque, Linked List, etc.
When you go to iterate, or pass through the data structure one element at a time, the iterator presents you with an interface in which each element in the data structure follows sequentially, allowing you to loop through with a basic for loop structure:
for <element> in <data-structure>
<do-action>
end loop
As for other constructs, take a look at some books on Data Structures and Algorithms (usually a 2nd-year level computer science course).
Syntax is only a technical form of expressing your solution. The way you implement and the concepts you use in your solution are the ones who makes the different between a beginner and an experienced developer. Programming languages are the means not the wits !