How to recognize non reserved keywords in yacc? [duplicate] - sql

I am using Flex & bison on Linux. I have have the following set up:
// tokens
CREATE { return token::CREATE;}
SCHEMA { return token::SCHEMA; }
RECORD { return token::RECORD;}
[_a-zA-Z0-9][_a-zA-Z0-9]* { yylval->strval = strdup(yytext); return TOKEN::NAME;}
...
// rules
CREATE SCHEMA NAME ...
CREATE RECORD NAME ...
...
Everything worked just fine. But if users enter: "create schema record ..." (where 'record' is the name of the schema to be created), Flex will report an error since it matches 'record' as a token and it is looking for the rule "CREATE SCHEMA RECORD". I understand that keywords can be escaped, but that makes user experiences awkward. My question is:
"How can I design the above rules so that it accepts 'create schema record ...' and matches this input to 'CREATE SCHEMA NAME ...'?"
Thanks!

"Semi-reserved" words are common in languages which have a lot of reserved words. (Even modern C++ has a couple of these: override and final.) But they create some difficulties for traditional scanners, which generally assume that a keyword is a keyword.
The lemon parser generator, which not coincidentally was designed for parsing SQL, has a useful "fallback" feature, where a token which is not valid in context can be substituted by another token (without changing the semantic value). Unfortunately, bison does not implement this feature, and nor does any other parser generator I know of. However, in many cases it is possible to implement the feature in Bison grammars. For example, in the simple case presented here, we can substitute:
create_statement: CREATE RECORD NAME ...
| CREATE SCHEMA NAME ...
with:
create_statement: CREATE RECORD name
| CREATE SCHEMA name
name: NAME
| CREATE
| RECORD
| SCHEMA
| ...
Obviously, care needs to be taken that the (semi-)keywords in the list of alternatives for name are not valid in the context in which name is used. This may require the definition of a variety of name productions, valid for different contexts. (This is where lemon-style fallbacks are more convenient.)
If you do this, it is important that the semantic values of the keywords be correctly set up, either by the scanner or by the reduction rule of the name non-terminal. If there is only one name non-terminal, it is probably more efficient to do it in the reduction actions (because it avoids unnecessary allocation and deallocation of strings, where the deallocation will complicate the other grammar rules in which the keywords appear), so that the name rule would actually look like this:
name: NAME
| CREATE { $$ = strdup("CREATE"); }
| RECORD { $$ = strdup("RECORD"); }
| SCHEMA { $$ = strdup("SCHEMA"); }
| ...
There are, of course, many other possible ways to deal with the semantic value issue.

You shouldn't do this, for the same reason you can't have a variable in C++ named for, while, or class. But if you really want to, look into Start Conditions (it'll be messy).

Related

How to implement type checking in a Listener

My script grammar contains the following:
if_statement
: IF condition_block (ELSE IF condition_block)* (ELSE statement_block)?
;
condition_block
: expression statement_block
;
expression
: expression op=(LTEQ | GTEQ | LT | GT) expression #relationalExpression
| expression op=(EQ | NEQ) expression #equalityExpression
| expression AND expression #andExpression
| expression OR expression #orExpression
| atom #atomExpression
;
atom
: OPAR expression CPAR #parenExpression
| INT #numberAtom
| (TRUE | FALSE) #booleanAtom
| STRING #stringAtom
;
What I would like to do, is to make sure that the user doesn't compare e.g. an INT to a STRING.
I use a Listener to provide errors to the user when they create a script. So what I want to do is something like
public override void EnterRelationalExpression([NotNull] ScriptEvaluatorParser.RelationalExpressionContext context)
{
<..compare context.expression(0) to context.expression(1) here
and add an error if not the same base type...>
base.EnterRelationalExpression(context);
}
Doing this in a Visitor is easy
object left = Visit(context.expression(0)
object right = Visit(context.expression(1)
<...compare types...>
But how do I do the same in the Listener? I can new up a Visitor and do it that way, but I was wondering if there is a better way to do the check without having to new up a Visitor.
I’ve done this before by adding a type stack to my listener.
I use the exit*() listener hooks (you can’t really have any useful information about children in the enter*() methods, as the children have not been visited.
As an expression is exited, I can determine the type directly, if it’s a simple type (or looking it’s type up in a symbol table if it’s an identifier). Then push the type on the type stack. For expressions like you equalityExpression, I pop the top two items from the type stack and check their compatibility (of course, it then pushes a boolean type on the type stack.
For and and or expressions, just pop the top two items, ensure they’re boolean and then push boolean.
This does depend on having a symbol table available to resolve identifier types, and is a bit of a work-around for listeners not returning values, but it has worked well for me. I like the visitor handling the navigation and ensuring all nodes are visited. But, as Bart mentions, if you’re comfortable with using visitors to accomplish this, there’s not really one way that’s “better” than another.
You can also look into adding locals to your rules to hold that resulting type. This avoids the need for a type stack, and the management of that stack, but makes your grammar target language specific (which I like to avoid). You’d still need to leverage the exit*() methods since children would have to be visited before the locals were populated (BTW, locals are just a way of adding additional fields to the ParseTreeContext for nodes.)

How does `Yacc` identifies function calls?

I am trying to figure out how yacc identifies function calls in a C code. For Example: if there is a function call like my_fun(a,b); then which rules does this statement reduces to.
I am using the cGrammar present in : C Grammar
Following the Grammar given over there manually; I figured out that we only have two choices in translation unit. Everything has to either be a function definition or a declaration. Now all declaration starts type_specifiers, storage_class_specifier etc but none of them starts with IDENTIFIER
Now in case of a function call the name would be IDENTIFIER. This leaves me unclear as to how it will be parsed and which rules will be used exactly?
According to the official yacc specification specified here yacc, everything is handled by user given routines. When you have a function call the name of course is IDENTIFIER.It is parsed using the user defined procedures.According to the specifications, the user can specify his input in terms of individual input characters, or in terms of higher level constructs such as names and numbers. The user-supplied routine may also handle idiomatic features such as comment and continuation conventions, which typically defy easy grammatical specification.
Do have a look.By the way you are supposed to do a thorough research before putting questions here.

Get a name of a method parameter using Javassist

I have a CtMethod instance, but I don't know how to get names of parameters (not types) from it. I tried getParameterTypes, but it seems it returns only types.
I'm assuming it's possible, because libraries I'm using don't have sources, just class files and I can see names of method parameters in IDE.
It is indeed possible to retrieve arguments' names, but only if the code has been compiled with debug symbols otherwise you won't be able to do it.
To retrieve this information you have to access the method's local variable table. For further information about this data structure I suggest you to check section 4.7.13. The LocalVariableTable Attribute of the jvm spec. As I usually say, JVM spec may look bulky but it's an invaluable friend when you're working at this level!
Accessing the local variable table attribute of your ctmethod
CtMethod method = .....;
MethodInfo methodInfo = method.getMethodInfo();
LocalVariableAttribute table = methodInfo.getCodeAttribute().getAttribute(javassist.bytecode.LocalVariableAttribute.tag);
You now have the the local variable attribute selected in table variable.
Detecting the number of localVariables
int numberOfLocalVariables = table.tableLenght();
Now keep in mind two things regarding the number in numberOfLocalVariables:
1st: local variables defined inside your method's body will also be accounted in tableLength();
2nd: if you're in a non static method so will be this variable.
The order of your local variable table will be something like:
|this (if non static) | arg1 | arg2 | ... | argN | var1 | ... | varN|
Retriving the argument name
Now if you want to retrieve, for example, the arg2's name from the previous example, it's the 3rd position in the array. Hence you do the following:
// remember it's an array so it starts in 0, meaning if you want position 3 => use index 2
int frameWithNameAtConstantPool = table.nameIndex(2);
String variableName = methodInfo.getConstPool().getUtf8Info(frameAtConstantPool)
You now have your variable's name in variableName.
Side Note: I've taken you through the scenic route so you could learn a bit more about Java (and javassists) internals. But there are already tools that do this kind of operations for you, I can remember at least one by name called paranamer. You might want to give a look at that too.
Hope it helped!
If you don't actually want the names of the parameters, but just want to be able to access them, you can use "$1, $2, ..." as seen in this tutorial.
It works with Javaassist 3.18.2 (and later, at least up to 3.19 anyway) if you cast, like so:
LocalVariableAttribute nameTable = (LocalVariableAttribute)methodInfo.getCodeAttribute().getAttribute(LocalVariableAttribute.tag);

Can I index source code using Lucene?

I would like to index source code using Lucene. The source code has already been pre-analysed using a compiler plugin. The output of the compiler is a list of IDs that appear in the source code. Each ID includes information about
the module the ID was defined in (as opposed to used in),
the source span where the ID appears (i.e. line:col-line:col), and
whether the ID is defined at this location or merely used here.
For example, given this source code module (in pseudo-code)
module MyModule
from MyOtherModule import bar
foo = ...
print bar
here's what the compiler might output when compiling MyModule:
MyModule.foo,3:1-3:3,definition
MyOtherModule.bar,4:7-4:9,use
Note how all IDs that appear in the output are fully qualified, even though they might not appear that way in the source. This is why we use a compiler, it allows us to do more exact code search than just purely text-based search.
Question: Is it possible to write a custom tokenizer and analyzer that indexes the compiler output shown above in a way that the metadata (i.e. the fully qualified ID and whether the ID was defined or used at the given location) is kept an available when scoring the documents?
To be more precise, I'd like each term to be associated with the module where it was defined (e.g. foo would have associated metadata: defining module=MyModule). I want each posting in the posting list to store whether this particular appearance of an ID was a definition or a use of that ID.
In addition, I'd like to have Lucene store the non-qualified ID as synonyms for the qualified ID. This would allow users to search for "foo" and retrieve all documents that contain the IDs "Module1.foo" and "Module2.foo".
It's probably easier to put the various attributes into Lucene fields, so that you can query like:
parse module:MyModule use:yes
which would return only hits on 'parse' in 'MyModule' where 'parse' was used rather than defined.

What are the steps I need to do to complete this programming assignment?

I'm having a hard time understanding what I'm supposed to do. The only thing I've figured out is I need to use yacc on the cminus.y file. I'm totally confused about everything after that. Can someone explain this to me differently so that I can understand what I need to do?
INTRODUCTION:
We will use lex/flex and yacc/Bison to generate an LALR parser. I will give you a file called cminus.y. This is a yacc format grammar file for a simple C-like language called C-minus, from the book Compiler Construction by Kenneth C. Louden. I think the grammar should be fairly obvious.
The Yahoo group has links to several descriptions of how to use yacc. Now that you know flex it should be fairly easy to learn yacc. The only base type is int. An int is 4 bytes. Booleans are handled as ints, as in C. (Actually the grammar allows you to declare a variable as a type void, but let's not do that.) You can have one-dimensional arrays.
There are no pointers, but references to array elements should be treated as pointers (as in C).
The language provides for assignment, IF-ELSE, WHILE, and function calls and returns.
We want our compiler to output MIPS assembly code, and then we will be able to run it on SPIM. For a simple compiler like this with no optimization, an IR should not be necessary. We can output assembly code directly in one pass. However, our first step is to generate a symbol table.
SYMBOL TABLE:
I like Dr. Barrett’s approach here, which uses a lot of pointers to handle objects of different types. In essence the elements of the symbol table are identifier, type and pointer to an attribute object. The structure of the attribute object will differ according to the type. We only have a small number of types to deal with. I suggest using a linear search to find symbols in the table, at least to start. You can change it to hashing later if you want better performance. (If you want to keep in C, you can do dynamic allocation of objects using malloc.)
First you need to make a list of all the different types of symbols that there are—there are not many—and what attributes would be necessary for each. Be sure to allow for new attributes to be added, because we
have not covered all the issues yet. Looking at the grammar, the question of parameter lists for functions is a place where some thought needs to be put into the design. I suggest more symbol table entries and pointers.
TESTING:
The grammar is correct, so taking the existing grammar as it is and generating a parser, the parser will accept a correct C-minus program but it won’t produce any output, because there are no code snippets associated with the rules.
We want to add code snippets to build the symbol table and print information as it does so.
When an identifier is declared, you should print the information being entered into the symbol table. If a previous declaration of the same symbol in the same scope is found, an error message should be printed.
When an identifier is referenced, you should look it up in the table to make sure it is there. An error message should be printed if it has not been declared in the current scope.
When closing a scope, warnings should be generated for unreferenced identifiers.
Your test input should be a correctly formed C-minus program, but at this point nothing much will happen on most of the production rules.
SCOPING:
The most basic approach has a global scope and a scope for each function declared.
The language allows declarations within any compound statement, i.e. scope nesting. Implementing this will require some kind of scope numbering or stacking scheme. (Stacking works best for a one-pass
compiler, which is what we are building.)
(disclaimer) I don't have much experience with compiler classes (as in school courses on compilers) but here's what I understand:
1) You need to use the tools mentioned to create a parser which, when given input will tell the user if the input is a correct program as to the grammar defined in cminus.y. I've never used yacc/bison so I don't know how it is done, but this is what seems to be done:
(input) file-of-some-sort which represents output to be parsed
(output) reply-of-some-sort which tells if the (input) is correct with respect to the provided grammar.
2) It also seems that the output needs to check for variable consistency (ie, you can't use a variable you haven't declared same as any programming language), which is done via a symbol table. In short, every time something is declared you add it to the symbol table. When you encounter an identifier, if it is not one of the language identifiers (like if or while or for), you'll look it up in the symbol table to determine if it has been declared. If it is there, go on. If it's not - print some-sort-of-error
Note: point(2) there is a simplified take on a symbol table; in reality there's more to them than I just wrote but that should get you started.
I'd start with yacc examples - see what yacc can do and how it does it. I guess there must be some big example-complete-with-symbol-table out there which you can read to understand further.
Example:
Let's take input A:
int main()
{
int a;
a = 5;
return 0;
}
And input B:
int main()
{
int a;
b = 5;
return 0;
}
and assume we're using C syntax for parsing. Your parser should deem Input A all right, but should yell "b is undeclared" for Input B.