what language is dotnet executable written in? - executable

I thought it would be Common Intermediate Language, but in notepad it does not look like that at all. Does it just look uglier in reality than in tutorials? Or is it some bytecode form that is further compiled from CIL?

It's CIL is the name of the binary format, not of the "assembler" you're thinking of.
Can you possibly imagine that .NET assemblies would be text files?

A .NET executable is a binary file that has a PE header (same as a native executable, but with slightly different values). The PE header tells the OS to load the CLR, which in turn loads the assembly.
The content beyond the header is a binary representation of the CIL code, plus some metadata and other stuff. The text you see in tutorials is the text representation of CIL, in much the same way that the assembly language code you see in a tutorial about assembly language programming is just the text representation of the binary machine code.
See http://www.yetanotherchris.me/home/2010/7/12/inside-net-assemblies-part-1.html (among many others) for more information.

A .Net executable is usually not written, it is compiled from another language such as C#, F# or VB.Net.
The contents of a .Net executable can be viewed with the ILDASM tool.
The contents are first a manifest which is used for reflection, signatures or other meta-code purposes.
Secondly there are the MSIL instructions themselves. These are in a kind of bytecode format, but ILDASM will show you what the instructions are.
And there are sometimes resources such as imagery, sounds or other content packed into the executable.
The executable is just-in-time compiled to native code either during installation (I think this is uncommon), or as a precursor to execution. The resulting native code can be stored for reuse. (This is what I was told during PDC 2001, might be "out of date".)

Related

Can different file extension executables be disassembled into the same instruction set OpCode?

This is a question from someone clueless about disassembly and decompiling in general, so bear with me. I am curious to know if executable file extensions (for example, listed in http://pcsupport.about.com/od/tipstricks/a/execfileext.htm ) can be disassembled into assembly language so then I can analyze opcode patterns across files.
My logic is that once all these different file extensions are in opcode form, they are all on the same level, regardless of language barriers, etc, so it would be easier to analyze them.
How feasible is this?
EDIT: Example. I have an .exe file and an .app file. If I disassembled both, could I compare them across opcode on the same OS? If not, how about executable files from the same OS. For example, for all executable files on Windows, if I disassembled both, could I compare opcode across each?
EDIT2: How will obfuscators affect my efforts?
In short, no.
The problem is that there is no practical universal instruction set. In practice, every computer architecture has its own instruction set (or sometimes several instruction sets). A native executable format like .exe is compiled to the machine's instruction set, which will differ based on the ISA targeted.
I'm not familiar with the .app format, but it appears to be some sort of archive containing executable code. So if you have an exe and app targeting the same ISA, you could conceivably diassemble and compare.
Obfuscation makes things much harder because it is difficult to get a reliable disassembly, let alone deal with stuff like self modifying code.

What is the difference in byte code like Java bytecode and files and machine code executables like ELF?

What are the differences between the byte code binary executables such as Java class files, Parrot bytecode files or CLR files and machine code executables such as ELF, Mach-O and PE.
what are the distinctive differences between the two?
such as the .text area in the ELF structure is equal to what part of the class file?
or they all have headers but the ELF and PE headers contain Architecture but the Class file does not
Java Class File
Elf file
PE File
Byte code is, as imulsion noted, an intermediate step, right before compilation into machine code. Because the last step is left to load time (and often runtime, as is the case with Just-In-Time (JIT) compilation, byte code is architecture independent: The runtime (CLR for .net or JVM for Java) is responsible for mapping the byte code opcodes to their underlying machine code representation.
By comparison, native code (Windows: PE, PE32+, OS X/iOS: Mach-O, Linux/Android/etc: ELF) is compiled code, suited for a particular architecture (Android/iOS: ARM, most else: Intel 32-bit (i386) or 64-bit). These are all very similar, but still require sections (or, in Mach-O parlance "Load Commands") to set up the memory structure of the executable as it becomes a process (Old DOS supported the ".com" format which was a raw memory image). In all the above, you can say , roughly, the following:
Sections with a "." are created by the compiler, and are "default" or expected to have default behavior
The executable has the main code section, usually called "text" or ".text". This is native code, which can run on the specific architecture
Strings are stored in a separate section. These are used for hard-coded output (what you print out) as well as symbol names.
Symbols - which are what the linker uses to put together the executable with its libraries (Windows: DLLs, Linux/Android: Shared Objects, OS X/iOS: .dylibs or frameworks) are stored in a separate section. Usually there is also a "PLT" (Procedure Linkage Table) which enables the compiler to simply put in stubs to the functions you call (printf, open, etc), that the linker can connect when the executable loads.
Import table (in Windows parlance.. In ELF this is a DYNAMIC section, in OS X this is a LC_LOAD_LIBRARY command) is used to declare additional libraries. If those aren't found when the executable is loaded, the load fails, and you can't run it.
Export table (for libraries/dylibs/etc) are the symbols which the library (or in Windows, even an .exe) can export so as to have others link with.
Constants are usually in what you see as the ".rodata".
Hope this helps. Really, your question was vague..
TG
Byte code is a 'halfway' step. So the Java compiler (javac) will turn the source code into byte code. Machine code is the next step, where the computer takes the byte code, turns it into machine code (which can be read by the computer) and then executes your program by reading the machine code. Computers cannot read source code directly, likewise compilers cannot translate immediately into machine code. You need a halfway step to make programs work.
Note that ELF binaries don't necessarily need to be machine/arch specific per se.
The interesting piece is the "interpreter" header field: it holds a path name to a loader program that's executed instead of the actual binary. This one then is responsible for loading the actual program, loading and linking libraries, etc. This is the way how eg. ld.so comes in.
Theoretically one could create an ELF binary that holds java bytecode (or a complete jar). This just needs some appropriate "interpreter" program which starts up a JVM and loads the code from the binary into it.
Not sure whether this actually has been done before, but certainly possible.
The same can be done w/ quite any non-native code.
It also could serve for direct multiarch support via some VM like qemu:
Let the target platform (libc+linker scripts) put the arch name into the interpreter program name (eg. /lib/ld.so.x86_64, /lib/ld.so.armhf, ...).
Then, on a particular arch (eg. x86_64), the one with native arch name will point to the original ld.so, while the others point to some special one that calls up something like qemu-system-XXX.

When is MSIL generated?

Let's say I have a C# Windows Class library in my solution and I build it in my VS2010 IDE.
The output here in my bin directory is X.dll
1) X.dll does not contain MSIL at this stage but "compressed byte code".
Is this true?
2) This "compressed byte code" is converted to MSIL somehow.
When does this occur?
3) When X.dll is accessed the JIT compiler of CLR takes the portion of MSIL that it needs to convert and does so into machine code.
Am I good on this final part?
Can anybody help in filling in the gaps in my understanding here?
X.dll contains MSIL bytecode after you build it with Visual Studio. You can prove this by disassembling it with ildasm.
At some time between assembly loading and the actual execution of the code, the MSIL is translated to native code. I am not familiar with where exactly this is done, but I would suspect at assembly load.
To build upon Dark Falcon's response, on question no.2, the IL code (not compressed byte code) is converted to native code by the JIT compiler, method by method, at each of their first call. So, a shorter answer for no.2 is "when the method is first accessed". Fields are not subject to this, properties are disguised methods themselves.
As for the third question, no, the IL is not JITed when it is loaded. See above.

What form is DLL & what makes it processor dependent

I know DLL contains one or more exported functions that are compiled, linked, and stored separately..
My question is about not about how to create it.. but it is all about in what form it is stored.. Is it going to be in the form of 0's & 1's.. or in assembly commands ADD, MUL, DIV, MOV, CALL, RETURN etc..
Also what makes it to be processor dependent.. (like x86, x87, IBM 700 instruction set)..
Can someone please explain it little briefly..!
First of all, everything in a computer is in the form of "0's & 1's" . The fact that the computer can display some of these as text, pictures, sounds, 3D models, etc. is just a matter of how you interpret them. But down there, at the metal, it's all just "0's & 1's" (also known as bits). Note though that they are always grouped together in groups of 8, and these are called "bytes". It's really for the sake of efficiency, because operating with every bit individually would be too tedious. Actually, todays computers don't even operate on single bytes anymore (or rather - they do it very rarely). Mostly you operate with 4 or 8 bytes at a time, depending on whether you have a 32-bit or 64-bit CPU (that's in layman's terms, it's actually a bit more complicated than that).
As for a .DLL file - like an .EXE file, it contains bytes that describe instructions that a CPU can execute. The CPU takes these bytes directly from the .DLL/.EXE and executes them without any further modifications. That's why these files are CPU-specific. In different CPU architectures the same combination of bytes means different things, so a .DLL/.EXE will run correctly only on the CPU for which it was designed. On other CPUs these bytes will mean some other instructions, and when run, the program will most likely do some utter nonsense and crash immediately.
The assembly commands you mentioned also deserve an explanation. "Assembler" is not a language that a CPU can understand. It's a language a human can understand. It was created because writing directly in machine code (the bytes that the CPU actually understands) is very difficult. What you get is utter gibberish on the screen (try opening some .EXE file in Notepad!) but every bit has to be precisely set for it to work.
So assembly language is basically the same thing, except these instructions are written in text that humans can read. For every machine code that a CPU can understand, there is am instruction with a human-friendly name. An assembly compiler simply reads these instructions and replaces them with the bytes that represent the actual instructions for the CPU to execute. It's a 1:1 operation. Every command in assembly language matches a single machine instruction (again, in layman's terms).
So you see, there isn't even a single assembly language. Every CPU architecture has its own assembly language, because they each have different instructions.
Note though that all this applies to native .DLL/.EXE files. .NET files are different - they don't contain machine code, but rather instructions for an abstract, nonexistent CPU. It's like Java bytecodes. When a .NET .DLL/.EXE is run, the .NET runtime translates it from the abstract instructions to the instructions that the specific CPU can understand. They use a lot of tricks to make this very fast, so these files run almost as fast as simple .DLL/.EXE files.
Does this clear things up? :)
Native DLLs (not .NET assemblies) usually contain machine code that can only be run on a certain platform. The machine code is a sequence of bytes that the processor treats as instructions (ADD, MOV, etc.).
In Windows, dll's are stored in the PE format which is basically a collection of sections that holds the information about how to map it into memory. some sections contains the program's code (which is of course processor dependent), others contains the program's data, other the exported and imported functions and so on.
Managed code is compiled to some intermediate language that is JITed by the run-time as it is executed. therefore, your dll won't contain any processor dependent code and you'll be able to execute your program on any platform with the relevant run-time.
it depends on your DLL. generally, a DLL contains executable code as an EXE file. those code DLLs are processor dependent since the code can only be executed on a specific platform. the code is stored using the same "format" as an EXE file (binary machine code).
however, a DLL can sometimes contains only data: they are then called "resource DLL" and are not processor dependent at all. they act as a container for data files used by applications.
note that many DLLs are hybrids: they contain both code and resources. for example, most DLLs which comprises the user part of the Windows operating system are hybrid: you can open them using Visual Studio or a Resource Explorer to see the resources (the data segments) they contain, or open them with Dependency Walker or dumpbin to see the functions (the code segments) they contain.
(of course this answer is really Windows specific, i don't know for .so files which are the linux equivalent of a DLL)
Both a DLL and an EXE contain executable code.
In the case of a DLL it doesn't have the necessary parts to be directly executable. It must be called from an other piece of executable code. One DLL can call another, but all must ultimately be called from and EXE.
So the rules about what's compatible with what processor that apply to EXEs also apply to DLLs.

DLL and LIB files - what and why?

I know very little about DLL's and LIB's other than that they contain vital code required for a program to run properly - libraries. But why do compilers generate them at all? Wouldn't it be easier to just include all the code in a single executable? And what's the difference between DLL's and LIB's?
There are static libraries (LIB) and dynamic libraries (DLL) - but note that .LIB files can be either static libraries (containing object files) or import libraries (containing symbols to allow the linker to link to a DLL).
Libraries are used because you may have code that you want to use in many programs. For example if you write a function that counts the number of characters in a string, that function will be useful in lots of programs. Once you get that function working correctly you don't want to have to recompile the code every time you use it, so you put the executable code for that function in a library, and the linker can extract and insert the compiled code into your program. Static libraries are sometimes called 'archives' for this reason.
Dynamic libraries take this one step further. It seems wasteful to have multiple copies of the library functions taking up space in each of the programs. Why can't they all share one copy of the function? This is what dynamic libraries are for. Rather than building the library code into your program when it is compiled, it can be run by mapping it into your program as it is loaded into memory. Multiple programs running at the same time that use the same functions can all share one copy, saving memory. In fact, you can load dynamic libraries only as needed, depending on the path through your code. No point in having the printer routines taking up memory if you aren't doing any printing. On the other hand, this means you have to have a copy of the dynamic library installed on every machine your program runs on. This creates its own set of problems.
As an example, almost every program written in 'C' will need functions from a library called the 'C runtime library, though few programs will need all of the functions. The C runtime comes in both static and dynamic versions, so you can determine which version your program uses depending on particular needs.
Another aspect is security (obfuscation). Once a piece of code is extracted from the main application and put in a "separated" Dynamic-Link Library, it is easier to attack, analyse (reverse-engineer) the code, since it has been isolated. When the same piece of code is kept in a LIB Library, it is part of the compiled (linked) target application, and this thus harder to isolate (differentiate) that piece of code from the rest of the target binaries.
One important reason for creating a DLL/LIB rather than just compiling the code into an executable is reuse and relocation. The average Java or .NET application (for example) will most likely use several 3rd party (or framework) libraries. It is much easier and faster to just compile against a pre-built library, rather than having to compile all of the 3rd party code into your application. Compiling your code into libraries also encourages good design practices, e.g. designing your classes to be used in different types of applications.
A DLL is a library of functions that are shared among other executable programs. Just look in your windows/system32 directory and you will find dozens of them. When your program creates a DLL it also normally creates a lib file so that the application *.exe program can resolve symbols that are declared in the DLL.
A .lib is a library of functions that are statically linked to a program -- they are NOT shared by other programs. Each program that links with a *.lib file has all the code in that file. If you have two programs A.exe and B.exe that link with C.lib then each A and B will both contain the code in C.lib.
How you create DLLs and libs depend on the compiler you use. Each compiler does it differently.
One other difference lies in the performance.
As the DLL is loaded at runtime by the .exe(s), the .exe(s) and the DLL work with shared memory concept and hence the performance is low relatively to static linking.
On the other hand, a .lib is code that is linked statically at compile time into every process that requests. Hence the .exe(s) will have single memory, thus increasing the performance of the process.