Is there a way to consistently find ends of the Solidity functions in the corresponding EVM assembly? - solidity

I've been working on a project that analyzes EVM-assembly of Solidity smart contracts. Currently I am stuck with the problem of finding the endings of all the contract functions in the assembly. There is a bruteforce approach with simulating the EVM and simply tracking at what line the execution reaches the finish, but producing a complete EVM simulator is, I am afraid, well beyond my capabilities. I am searching for a simpler solution if there is one.
So far I've managed to (almost) consistenly find beginnings of the functions (corresponding JUMPDESTs) in the assembly assuming that I have access to the contract's ABI. The idea there is quite simple. At the top of the EVM assembly file there are multiple blocks looking as such:
PUSH4 0x8ac28d5a
GT
PUSH2 0x191
JUMPI
DUP1
and also as such:
PUSH4 0xfeaf968c
EQ
PUSH2 0xc82
JUMPI
PUSH2 0x2f4
JUMP
JUMPDEST
DUP1
Let's call them "header blocks" (if there is an official name, I am sorry for my illeteracy :) ). Each header block compares the hash of the method signature that came in the calldata and decides whether to jump on the JUMPDEST that corresponds to the beginning of the desired function. But there is a catch. As you can see, there is a GT at the top of the first header block. Why would we compare hashes with less/greater? So the header blocks do not perform a linear search over all the signatures. Instead, they do some kind of a logarithmic search as I deduced (please correct me if I am wrong). And, as we can see in the second header block, in some cases they can decide to unconditionally proceed somewhere else seemingly in the middle of the search process. But in reality, they just have enough information at that moment to infer that there is no function in this assembly that has the required hash of the signature. So we can deduce that those "else" JUMPs jump right to the fallback.
So this is the context of what I have done so far. I am able to obtain the list of the beginnings of all the functions including the fallback. Obtaining the list of the ends of the functions is what I am currently struggling with. So far I've had a hypothesis that I can split the whole assembly file by JUMPDESTs of the beginnings of functions (and the dispatch part with header blocks) and each part except the first will correspond to each Solidity function. Unfortunately, it can be easily disproven by looking at what is the assembly of a basic contract with only a couple of functions. You can experiment yourself at godbolt.org (a little example here). There will be a number of auxiliary "functions" created by the Solidity compiler. So my approach is not viable here. Are there any approaches of finding the endings of the functions without simulating the EVM?

Related

ANTLR4: What is the best approach to implement C like include file handling?

I am implementing a lexer/parser for the real-time language OpenPEARL. For better struturing of my testsuite I want to implement a include file handling similiar to C/C++. The parser iteself uses the visitors. What would be the best approach to implement this? One thing which concern me when instantiating a nested parser the included file does not need to contain a complete program depending where it is included.
Cheers
Marcel
I can't speak for ANTLR, but in general one implements a C-like preprocessor in the lexer.
You accomplish this by having a stack of input streams, with the base of the stack being the source file. You read input from the stream on top of the stack.
When an include is encountered in the lexer, a new stream is pushed on top of the stack, and reading continues (now from the new stream). When a stream encounters EOF, you pop the stack and continue; if the stack is empty, the lexer emits an EOF token.
You can abuse these streams to implement macros. On macro call, simply push a new stream that represents the macro body. When you encounter a macro parameter name, push a stream for the argument supplied to the corresponding macro.
I have seen implementations where include handling has been done in the (parser) grammar. Doing it in the lexer like Ira suggests is certainly possible, but with some extra work.
However, full include handling is more than simply switching input streams, namely macro handling, line splicing, trigraph handling, charizing and stringizing + as evaluator for #if(def) commands. All that I have implemented in my Windows Resource File Parser, which was written for ANTLR 2.7 and hence needs an update, but is certainly good for getting ideas.
In this project I handle include files outside of the normal ANTLR parsing chain, which follows more the preprocessor approach you often see for C/C++.

Using ANTLR4 lexing for Code Completion in Netbeans Platform

I am using ANTLR4 to parse code in my Netbeans Platform application. I have successfully implemented syntax highlighting using ANTLR4 and Netbeans mechanisms.
I have also implemented a simple code completion for two of my tokens. At the moment I am using a simple implementation from a tutorial, which searches for a whitespace and starts the completion process from there. This works, but it deems the user to prefix a whitespace before starting code completion.
My question: is it possible or even contemplated using ANTLR's lexer to determine which tokens are currently read from the input to determine the correct completion item?
I would appreciate every pointer in the right direction to improve this behaviour.
not really an answer, but I do not have enough reputation points to post comments.
is it possible or even contemplated using ANTLR's lexer to determine which tokens are currently read from the input to determine the correct completion item?
Have a look here: http://www.antlr3.org/pipermail/antlr-interest/2008-November/031576.html
and here: https://groups.google.com/forum/#!topic/antlr-discussion/DbJ-2qBmNk0
Bear in mind that first post was written in 2008 and current antlr v4 is very different from the one available at the time, which is why Sam’s opinion on this topic appear to have evolved.
My personal experience - most of what you are asking is probably doable with antlr, but you would have to know antlr very well. A more straightforward option is to use antlr to gather information about the context and use your own heuristics to decide what needs to be shown in this context.
The ANTLRv3 grammar https://sourceware.org/git/?p=frysk.git;a=blob_plain;f=frysk-core/frysk/expr/CExpr.g;hb=HEAD implements context sensitive completion of C expressions (no macros).
For instance, if fed the string:
a_struct->a<tab>
it would just lists the fields of "a_struct" starting with "a" (tab could, technically be any character or marker).
The technique it used was to:
modify a C grammar to recognize both IDENT and IDENT_TAB tokens
for IDENT_TAB capture the partial expression AST and "TOKEN_TAB" and throw them back to 'main' (there are hacks to help capture the AST)
'main' then performs a type-eval on the partial expression (compute the expression's type not value) and use that to expand TOKEN_TAB
the same technique, while not exactly ideal, can certainly be used in ANTLRv4.

Is it feasible to use Antlr for source code completion?

I don't know, if this question is valid since i'm not very familiar with source code parsing. My goal is to write a source code completion function for one existing programming language (Language "X") for learning purposes.
Is Antlr(v4) suitable for such a task or should the necessary AST/Parse Tree creation and parsing be done by hand, assuming no existing solutions exists?
I haven't found much information about that specific topic, except a list of compiler books, except a compiler is not what i'm after for.
The code completion in GoWorks is completely implemented using ANTLR 4. The following video shows the level of completion of this code completion engine. The code completion example runs from 5 minutes through the end of the video.
Intro to Tunnel Vision Labs' GoWorks IDE (Preview Release)
I have been working on code completion algorithms for many years, and strongly believe that there is no better solution (automated or manual) for producing a code completion solution for a new language that meets the requirements for what I would call highly-responsive code completion. If you are not interested in that level of performance or accuracy, other solutions may be easier for you to get involved with (I don't work with those personally, because I am too easily disappointed in the results).
Xtext uses ANTLR3 and has good autocomplete facilities. The problem is, it generates a seperate parser (again using antlr3) for autocomplete processing which is derived from AbstractInternalContentAssistParser. This multi-thousand line code part shows that the error recovery of ANTLR3 alone found to be insufficient by the xtext team.
Meanwhile ANTLR4 has a function parser.getExpectedTokensWithinCurrentRule() which lists possible token types for given position. It works when used in a ParseTreeListener. Remaining is semantics, scoping etc which is out of ANTLRs scope.

A macro highlighted as keyword: pascal

While looking in the sample code for FunkyOverlayWindow, I just found a pretty interesting declaration:
pascal OSStatus MyHotKeyHandler(
EventHandlerCallRef nextHandler,
EventRef theEvent,
void *userData
);
Here, pascal is highlighted as a keyword (pink in standard Xcode color scheme). But I just found it's a macro, interestingly enough defined in file CarbonCore/ConditionalMacros.h as:
#define pascal
So, what is (or was) it supposed to do? Maybe it had some special use in the past?
While this discussion might not be well suited here, it would be interesting to know why Apple still using Carbon if this relates to the answer. I have no experience in Carbon, but this code appears to set a keyboard event handler which makes me wonder if there are any advantages over the Cocoa approach. Won't Carbon be ever removed completely?
Under the 68k Classic Mac OS runtime (e.g, before PowerPC or x86), C and Pascal used different calling conventions, so C applications had to declare the appropriate convention when calling into libraries using the Pascal conventions (including most of the operating system). The macro was implemented in contemporaneous compilers (e.g, MPW, Metrowerks, Think C).
In all newer runtimes, and in all modern compilers, the keyword is no longer recognized, so the ConditionalMacros.h header defines it away. There are a few comments in that file which may help explain a bit more -- read through it, if you're game.
You have encountered a calling convention.
The pascal calling convention for the x86 is described here.
It is very interesting that it was defined-away-to-nothing, which you notice means that it is not used anymore. It was common in the old days in x86-land, especially in the Microsoft Windows APIs, because of the ability of the processor to remove parameters from the stack at the end of a call with its special RET n instruction. Functions with the pascal calling convention were sometimes advantageous, because the caller wasn't explicity adjusting the stack pointer after each call returned.
If someone with more knowledge of why the macro still exists in the codebase, but is defined away comes along and gives you a better answer, I will happily withdraw this one.

Why Decompilers cant produce original code theoretically

I searched the internet but did not find a concrete answer that why decompilers are unable to produce original source code. I dint get a satisfactory answer. Somewhere it was written that it is similar to halting problem but dint tell how. So what is the theoretical and technical limitation of creating a decompiler which is perfect.
It is, quite simply, a many-to-one problem. For example, in C:
b++;
and
b+=1;
and
b = b + 1;
may all get compiled to the same set of operations once the compiler and optimizer are done. It reorders things, drops in-effective operations, and rewrites entire sections of code. By the time it is done, it has no idea what you wrote, just a pretty good idea what you intended to happen, at a raw-CPU (or vCPU) level.
It is even smart enough to remove variables that aren't needed:
{
a=5;
b=func();
c=a+b;
d=func2(c);
}
## gets rewritten as:
REGISTERA=func()
REGISTERA+=5
return(func2(REGISTERA))
For starters, the variable names are never preserved when your program is compiled. ...so the best it could possibly do would be to use meaningless variable names throughout your re-constituted program. Compiling is generally a one-way transformation - like a one-way hashing function. Like the hash, it may be possible to generate something else that could hash to the same value, but it's highly unlikely the decompiled program will be the exact same as your original.
Compilers throw out information; not all the information that is in the source code is in the compiled code. For example in compiled Java, you can't tell the difference between a parameterized and unparameterized generic type because the information is only used by the compiler; some annotations are only used at compile time and are not included in the compiled output. That doesn't mean you couldn't get some sort of source code by decompiling; it just wouldn't match nor would be as informative as the actual source code.
There is usually not a 1-to-1 correspondence between source code and compiled code. If an essentially infinite number of possible sources could result in the same object code (given unbounded variable name lengths, etc.), how is a decompiler to guess which one to spit out?