Why Decompilers cant produce original code theoretically - decompiler

I searched the internet but did not find a concrete answer that why decompilers are unable to produce original source code. I dint get a satisfactory answer. Somewhere it was written that it is similar to halting problem but dint tell how. So what is the theoretical and technical limitation of creating a decompiler which is perfect.

It is, quite simply, a many-to-one problem. For example, in C:
b++;
and
b+=1;
and
b = b + 1;
may all get compiled to the same set of operations once the compiler and optimizer are done. It reorders things, drops in-effective operations, and rewrites entire sections of code. By the time it is done, it has no idea what you wrote, just a pretty good idea what you intended to happen, at a raw-CPU (or vCPU) level.
It is even smart enough to remove variables that aren't needed:
{
a=5;
b=func();
c=a+b;
d=func2(c);
}
## gets rewritten as:
REGISTERA=func()
REGISTERA+=5
return(func2(REGISTERA))

For starters, the variable names are never preserved when your program is compiled. ...so the best it could possibly do would be to use meaningless variable names throughout your re-constituted program. Compiling is generally a one-way transformation - like a one-way hashing function. Like the hash, it may be possible to generate something else that could hash to the same value, but it's highly unlikely the decompiled program will be the exact same as your original.

Compilers throw out information; not all the information that is in the source code is in the compiled code. For example in compiled Java, you can't tell the difference between a parameterized and unparameterized generic type because the information is only used by the compiler; some annotations are only used at compile time and are not included in the compiled output. That doesn't mean you couldn't get some sort of source code by decompiling; it just wouldn't match nor would be as informative as the actual source code.

There is usually not a 1-to-1 correspondence between source code and compiled code. If an essentially infinite number of possible sources could result in the same object code (given unbounded variable name lengths, etc.), how is a decompiler to guess which one to spit out?

Related

Is there a way to consistently find ends of the Solidity functions in the corresponding EVM assembly?

I've been working on a project that analyzes EVM-assembly of Solidity smart contracts. Currently I am stuck with the problem of finding the endings of all the contract functions in the assembly. There is a bruteforce approach with simulating the EVM and simply tracking at what line the execution reaches the finish, but producing a complete EVM simulator is, I am afraid, well beyond my capabilities. I am searching for a simpler solution if there is one.
So far I've managed to (almost) consistenly find beginnings of the functions (corresponding JUMPDESTs) in the assembly assuming that I have access to the contract's ABI. The idea there is quite simple. At the top of the EVM assembly file there are multiple blocks looking as such:
PUSH4 0x8ac28d5a
GT
PUSH2 0x191
JUMPI
DUP1
and also as such:
PUSH4 0xfeaf968c
EQ
PUSH2 0xc82
JUMPI
PUSH2 0x2f4
JUMP
JUMPDEST
DUP1
Let's call them "header blocks" (if there is an official name, I am sorry for my illeteracy :) ). Each header block compares the hash of the method signature that came in the calldata and decides whether to jump on the JUMPDEST that corresponds to the beginning of the desired function. But there is a catch. As you can see, there is a GT at the top of the first header block. Why would we compare hashes with less/greater? So the header blocks do not perform a linear search over all the signatures. Instead, they do some kind of a logarithmic search as I deduced (please correct me if I am wrong). And, as we can see in the second header block, in some cases they can decide to unconditionally proceed somewhere else seemingly in the middle of the search process. But in reality, they just have enough information at that moment to infer that there is no function in this assembly that has the required hash of the signature. So we can deduce that those "else" JUMPs jump right to the fallback.
So this is the context of what I have done so far. I am able to obtain the list of the beginnings of all the functions including the fallback. Obtaining the list of the ends of the functions is what I am currently struggling with. So far I've had a hypothesis that I can split the whole assembly file by JUMPDESTs of the beginnings of functions (and the dispatch part with header blocks) and each part except the first will correspond to each Solidity function. Unfortunately, it can be easily disproven by looking at what is the assembly of a basic contract with only a couple of functions. You can experiment yourself at godbolt.org (a little example here). There will be a number of auxiliary "functions" created by the Solidity compiler. So my approach is not viable here. Are there any approaches of finding the endings of the functions without simulating the EVM?

Can you have the compiler check the names of variables and cause build errors on certain names?

I honestly can't even find anything like this anywhere on the internet, seems kind of like an obvious feature that is missing.
Basically, I'd like the compiler to check the names of every local variable and cause a build to fail if certain variable names are used. Just as an example, I'd like the build to fail if someone tries to use "x" as a variable for anything.
I get that determined folks would find a ways around it, but I'd like to see if it could be done.
I'm interested more in hearing if this can be done in Visual Studio since that's just what I use the most, but I'd be interesting in hearing about any kind of feature like this for any language / compiler if you know of one.

How to find the size of a reg in verilog?

I was wondering if there were a way to compute the size of a reg in Verilog. I researched it quite a bit, and found $size(a), but it's only in SystemVerilog, and it won't work in my verilog program.
Does anyone know an alternative for this??
I also wanted to ask as a side note; I'm having some trouble with my test bench in the sense that when I update a value in the file, that change is not taken in consideration when I simulate. I've been told I might have been using an old test bench but the one I am continuously simulating is the only one available in this project.
EDIT:
To give you an idea of what's the problem: in my code there is a "start" signal and when it is set to 1, the operation starts. Otherwise, it stays idle. I began writing the test bench with start=0, tested it and simulated it, then edited the test bench by setting start to 1. But when I simulate it, the start signal remains 0 in the waveform. I tried to check whether I was using another test bench, but it is the only test bench I am using in this project.
Given that I was on a deadline, I worked on the code so that it would adapt to the "frozen" test bench. I am getting now all the results I want, but I wanted to test some other features of my code, so I created a new project and copy pasted the code in new files (including the same test bench). But when I ran a simulation, the waveform displayed wrong results (even though I was using the exact same code in all modules and test bench). Any idea why?
Any help would be appreciated :)
There is a standardised way to do this, but it requires you to use the VPI, which I don't think you get on Modelsim's student edition. In short, you have to write C code, and dynamically link it to the simulator. In the C code, you can get object properties using routines such as vpi_get. Useful properites might be vpiSize, which is what you want, vpiLeftRange, vpiRightRange, and so on.
Having said all that, Verilog is essentially a static language, and objects have to be declared with a static width using constant expressions. Having a run-time method to determine an object's size is therefore of pretty limited value (since you should already know it), and may not solve whatever problem you actually have. Your question would make more sense for VHDL (and SystemVerilog?), which are much more dynamic.
Note on Icarus: the developers have pushed lots of SystemVerilog stuff back into the main language. If you take advantge of this you may find that your code is not portable.
Second part of your question: you need to be specific on what your problem actually is.

Problems with Code in the Frege REPL

While trying to learn Frege I copied some code from Dierk's Real World Frege to the online REPL an tried to execute it (see also How to execute a compiled code snipped in Frege online repl). The scripts I've tried don't compile :-(
What am I doing wrong?
Here are examples of what does not compile:
println ( 2 *-3 ) -- unlike haskell, this will work!
and the whole ValuesAndVariables.fr code
It is unavoidable that over the course of more than a year, an evolving language (and its libraries) change so that older code will not compile anymore.
It would be nice, if we could see an example, instead of a generalization like "most".
The next best thing would be to have an issue in Dierks project that points to the error(s).
But the very best would be en effort to find out what is wrong. This would also intensify your learning process.
Here are two ressources that could help:
https://github.com/Frege/frege/wiki/New-or-Changed-Features -- the release notes for every release, contains a summary of things that have changed between releases, and especially the reasons why code would not compile anymore and how to correct it.
http://www.frege-lang.org/doc/fregedoc.html -- the library docs. May explain possible errors like import not found, or missing identifiers.
Go, give it a try. And I'm convinced Dierk will be happy to accept pull requests.
Edit: Fixes for announced errors.
The error in:
println ( 2 *-3 )
stems indeed from a syntactical change.
It is, as of recently, demanded that adjacent operators be separated by at least one space.
Hence
println (2 * -3)
However, the error message you got here was:
can't resolve `*-`, did you mean `-` perhaps?
which could have triggered the idea that it tries to interpret *- as a single operator.
The other error in ValuesAndVariables1.fr is indeed a show stopper for a beginner. The background is that we have one pi that has type Double and one that has type Float and potentially many more through type class Floating, so one needs to tell which one to print.
The following will work:
import Prelude.Math -- unless already imported
println Float.pi
println (pi :: Double)
the online REPL at http://try.frege-lang.org is currently based on Frege V3.23.370-g898bc8c . Dierk's code examples are based on V3.21.500-g88270a0 (which can be seen in the gradle build file).
It seems that the Frege developers decided to change the Frege syntax slightly between those versions. THe result is that you will not be able to run these code snippets in the online REPL anymore.

common blocks, FORTRAN,and DLLs

I am a modeler who programs...I would never call myself a programmer, yet I program in C# and in FORTRAN. I have a FORTRAN model that I have connected to some C# code through a dll. I have found that I must have a common block in order to keep the variables in memory in the dll. I have also found that I cannot use more than one include statement.... my include file for the common variables are all Unlabeled. Chapman (2008) "FORTRAN 95/2003 for scientists and Engineers" states "The unlabeled COMMON statement should never be used ...".
How can I ensure that I do not have corrupted memory in my common file? I guess I can experiment, but I was hoping to have some sound advice on this. I am using the Lahey-F ver 7.2 within Microsoft Visual Studio 2008
Anyone, any thoughts?
As a programmer who models what I'd like to know is exactly why Chapman states that the unlabelled COMMON should not be used. From what I can remember the blank / unnamed common block is global and must be defined in the main program.
The only way to be sure about this is probably to make a simple Fortan DLL and then disassemble it to see what it's done with / where it put the common block.
Also it'd be useful if you could paste examples of errors etc. when you try to use a named common. It may be that there is a better solution once we understand exactly what's not working.