Best way to save source line information in an ANTLR4.7.1 lexer/parser - antlr

All,
I'm fairly new to antlr so the solution may be trivial, however the solution escapes me. (I have much experience with parsers and scanners, just not with ANTLR generated ones.)
I'm recoding an assembler for a 32-bit (National Series 32000) CPU. It originally was coded using C++/(f)lex/yacc/bison, but is being ported to Java8. Part of my requirements is that I produce a listing file that contains addresses, generated code, source line, etc.
I have an object that can contain all of the information I need (e.g source line, generated code, etc) and I would like to associate said object with each token. My question is:
1) What is the best way to capture a source line? I considered using the lexer (+ modes) to capture a source line, but found no way to capture a source line and reject (or push back) the input to make it available for subsequent processing. I know that CharStream buffers it's entire input stream in one fell swoop. Would subclassing CharStream to construct my container and capture source line contents be an appropriate approach?
2) How to associate my container object with each token? I suspect subclassing Token and creating a custom TokenFactory is required, but am uncertain how to connect a custom CharStream to Token. (This is why I liked the concept of using the lexer to capture individual lines.)
Thanks for any help!

There's no need to capture position information manually. Each token (which is normally an instance of CommonToken comes with line and char offset values, plus a few more like the token index (which is the index of the token in the token stream) and start/stop indices, which give you the character indexes in the original text input.
The resulting parse tree also contains references to the token or symbol that make up a rule context or terminal node. So you can look up positions at any time, always connected to a particular parser rule.

Related

How to create a refactor tool?

I want to learn how to create refactor tools.
As an example: I want to create migration scripts
for when some library removes deprecated function
and we want to transform code to use the newer adequate functionality.
My idea was to use ANTLR to parse the code into AST,
use some pattern matching on this tree to modify contents,
and output the modified contents.
However, from what I read ANTLR isn't preserving formatting in AST tree,
therefore it would be hard to get unbroken content back.
Do you have a solution that would comply with:
allows me to modify code with preserving formatting
(optionally) allows me to use AST transformations for code transformation
(optionally) can transform variety languages like ANTLR
Question is not limited to one particular language,
I'd be happy to heard solutions created for different languages.
If you want a
general purpose tool to parse source code from arbitrary languages producing ASTs
apply procedural or preferably source-to-source pattern-directed rewrite rules to manipulate the ASTs
regenerate valid source code retaining formatting and comments
I know of only two systems at present that can do this robustly.
RASCAL Metaprogramming language, a research platform
Semantic Designs' (my company) DMS Software Reengineering Toolkit (DMS)
You probably don't want to try building frameworks like this yourself; these tools both have decades of PhD level investment to make them practical.
One issue that occurs repeatedly is the mistake of thinking that having a parser (e.g., ANTLR) solves most of the problem. See my essay on Life After Parsing. A key insight is that you can't transform "just the syntax (ASTs)" without context; you have to take in account the language semantics and whatever framework you choose to use had better help you do semantic analysis to support the rewrite rules.
There are other (general purpose) program transformation systems out there. Many are research. Few have been used to do serious software reengineering in practice. RASCAL has been applied to some quite interesting tasks but not in any commercial context that I know. DMS has been used in production for over 20 years to carry out massive code base changes including refactoring, API revision, and fully automated language migrations.
ANTLR has a TokenStreamRewriterTokenStreamRewriter class that is very good at preserving your source input.
It has very robust capabilities. It allow you to delete, insert or replace text in the input stream. IT actually stores up a series of pending changes, and then applies them when you ask for the modified input stream (even allows for rolling back changes, as well as multiple sets of changes).
A couple of examples from a recent presentation I did that touched on the Rewriter:
private void plus0(RefactorUtilContext ctx, String pName) {
for (var match : plus0PatternA.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (AddSubExprContext) (match.getTree());
rewriter.delete(pName, matchCtx.op, matchCtx.rhs.getStop());
}
for (var match : plus0PatternB.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (AddSubExprContext) (match.getTree());
rewriter.delete(pName, matchCtx.lhs.getStart(), matchCtx.op);
}
}
private void times1(RefactorUtilContext ctx, String pName) {
for (var match : times1PatternA.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (MulDivExprContext) (match.getTree());
// rewriter.delete(pName, matchCtx.op, matchCtx.rhs.getStop());
rewriter.insertBefore(pName, matchCtx.op, "/* ");
rewriter.insertAfter(pName, matchCtx.rhs.getStart(), " */");
}
for (var match : times1PatternB.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (MulDivExprContext) (match.getTree());
// rewriter.delete(pName, matchCtx.lhs.getStart(), matchCtx.op);
rewriter.insertBefore(pName, matchCtx.lhs.getStart(), "/* ");
rewriter.insertAfter(pName, matchCtx.op, " */");
}
}
TokenStreamRewriter, basically, just stores a set of instruction about how to modify you input stream, so everything about you input stream that you don't modify is, ummmm, unmodified :).
You may also wish to look into the XPath capabilities that ANTLR has. These allow you to find very specific patterns in the parse tree to locate the portions you would like to refactor. As the name suggests, the syntax is very similar to XPath for XML documents, but works on the parse tree instead of an XML DOM.
Note: all of these operate on the parse tree, not an AST (which would necessarily be of your own design, so ANTLR wouldn't know of it's structure. Also, it's in the nature of ASTs to drop irrelevant information (like comments, whitespace, etc.) so they're just not a good starting point for anything where you want to preserve formatting.)
I put together a quite small project for this presentation and it's on GitHub at LittleCalc. The most relevant file is LittleCalcExecutionVisitor.
You can start a REPL and test things out by running LittleCalcRepl.java

Is there a way to get exported constants from an Objective-C framework? [duplicate]

I'm trying to find a constant (something like a secret token) from inside of an iOS app in order to build an app using an undocumented web API (by the way, I'm not into something illegal). So far, I have the decrypted app executable on my Mac (jailbreak + SSH + dumping decrypted executable as file). I can use the strings command to get a readable list of strings, and I can use the class-dump tool (http://stevenygard.com/projects/class-dump/) to get a list of interface definitions (headers) of the classes. Although this gives me an idea of the app's inner workings, I still can't find what I'm searching for: the constants I'm looking for. There are literally thousands of string definitions in the strings command dump. Is there any way to dump the strings in a way that I can have the names of the NSString constants with their values. I don't need the implementation details of the methods, I know that it's compiled and all I can get is assembly code. But if I can get the names of the string constants (both in strings dump and class dump) and also the string values (in strings dump), I think there may be a way to associate them together.
Thanks,
Can.
Unfortunately, no, unless there's some black magic tool out there that I'm unaware of, or unless the executable was built with debug symbols (which is likely not the case). If there are debug symbols, you should be able to run it through a debugger and get variable names.
At compile time, the compiler strips off the name of the constant, and replaces all occurrences of the constant in the code with the address of its location in memory (which is usually the same byte offset as inside the executable). Because of this, the original variable naming of the constant is lost, leaving only the value. Hence, the reason you can't find the constants anywhere.
Something that I would do to try to find the secret token, is capture all the data traffic that the app creates, and then look for the same patterns in the binary. If the token is indeed in there, and it isn't obfuscated somehow, then at least that narrows it down for you greatly.
Good luck! RE can be very rewarding but sometimes it really sucks.

ANTLR4 - Generate code from non-file inputs?

Where do we start to manually build a CST from scratch? Or does ANTLR4 always require the lex/parse process as our input step?
I have some visual elements in my program that represent code structures.
e.g. a square represents a class, while a circle embedded within that square represents a method.
Now I want to turn those into code. How do I use ANTLR4 to do this, at runtime (using ANTLR4.js)? Most of the ANTLR examples seem to rely on lexing and parsing existing code to get to a syntax tree. So rather than:
input code->lex->parse->syntax tree->output code (1)
I want
manually create syntax tree->output code (2)
(Later, as the user adds code to that class and its methods, then ANTLR will be used as in (1).)
EDIT Maybe I'm misunderstanding this. Do I create some custom data structure and then run the parser over it? i.e. write structures to some in-memory format->parse->output code (3)?
IIUC, you could use StringTemplate directly.
By, way of background, Antlr itself builds an in-memory parse-tree and then walks it, incrementally calling StringTemplate to output code snippets qualified by corresponding parse-tree node data. That Antlr uses an internal parse-tree is just a convenience for simplifying walking (since Antlr is built using Antlr).
If you have your own data structure, regardless of its specific implementation, procedurally process it to progressively call ST templates to emit the corresponding code. And, you can directly use the same templates that Antlr uses (JavaScript.stg), if they meet your requirements.
Of course, if your data structure is of a nature that can be lex'd/parsed into a standard Antlr parse-tree, you can then use a standard Antlr visitor to call and populate node-specific templates.

can hard coded strings in a compiled exe be changed?

Lets say you have some code in your app with a hard coded string.
If somevalue = "test123" Then
End If
Once the application is compiled, is it possible for someone to modify the .exe file and change 'test123' to something else? If so, would it only work if the string contained the same number of characters?
It's possible but not necessarily straightforward. For example, if your string is loaded in memory, someone could use a memory manager tool to modify the value of the string's address directly.
Alternatively, they could decompile your app, change the string, and recompile it to create a new assembly with the new string. However, whether this is likely to happen depends on your app and how important it is for that string to be changed.
You could use an obfuscator to make it a bit harder to do but, ultimately, a determined cracker would be able to do it. The question is whether that string is important enough to worry about and, if so, maybe consider an alternative approach such as using a web service to provide the string.
Strings hard-coded without any obfuscation techniques can easily be found inside compiled executables by openign them up in any HEX-editor. Once found, replacing the string is possible in 2 ways :
1. Easy way (*conditions apply)
If the following conditions apply in your case, this is a very quick-fire way of modifying the hard-coded strings in the executable binary.
length(new-string) <= length(old-string)
No logic in the code to check for executable modification using CRC.
This is a viable option ONLY if the new string is equal or shorter than the old string. Use a hex-editor to find occurrences of the old string and replace it with the new string. Pad an extra space with NULL i.e. 0x00
For example old-long-string in the binary
is modified to a shorter new-string and padded with null characters to the same length as the original string in the binary executable file
Note that such modifications to the executable files are detected by any code that verifies the checksum of the binary file against the pre-calculate checksum of the original binary executable file.
2. Harder way (applicable in almost all cases)
De-compiling the binary to native code opens up the possibility to modify any strings (and even code) and rebuild it to obtain the new binary executable.
There exist dozens of such de-compiler tools to decompile vb.net (Visual Studio.net, in general). An excellent detailed comparison of the most popular ones (ILspy, JustDecompile, DotPeek, .NET Reflector to name a few ) can be found here.
There do exist scenarios in which even the harder way will NOT be successful. This is the case when the original developer has used obfuscation techniques to prevent the strings from being detected and modified in the executable binary. One such obfuscation technique is storing encrypted strings.

How can I access hidden tokens in ANTLR AST?

I am trying to write a manual tree walker in Java for an AST generated by ANTLR V3. The AST is built using island grammers as similar to the one specified in ANTLR: call a rule from a different grammar.
In the AST, I have a node for expression list with each expression as child node. Now I need to know the line numbers of the COMMAs which seperated the expressions. The COMMAs were present in parsing but removed during AST rewrite.
I see some resources(here and here) pointing to the usage of CommonTokenStream.getTokens but I am not sure how I can access the CommonTokenStream while processing the AST. Is there anyway I can get the CommonTokenStream used to build the AST?
The complete list of tokens is accessible through CommonTokenStream.getTokens(), which you can call before you call the tree walker. The list of tokens would be an argument to the walker. There's no need to change CommonTree, unless you want the recovered information embedded in the tree.
I've used the token list to associate hidden tokens such as comments and explicit line numbers (think FORTRAN) with the closest visible token. This was done post-processing the AST and looking at the line, column, and char-index information which is available for both the tokens in the list and the nodes in the AST.
My attempts at trying to that during AST construction resulted in hacky, unmaintainable code. The post-processing code, OTOH, is Programming-101 algorithmic.