Can different file extension executables be disassembled into the same instruction set OpCode? - executable

This is a question from someone clueless about disassembly and decompiling in general, so bear with me. I am curious to know if executable file extensions (for example, listed in http://pcsupport.about.com/od/tipstricks/a/execfileext.htm ) can be disassembled into assembly language so then I can analyze opcode patterns across files.
My logic is that once all these different file extensions are in opcode form, they are all on the same level, regardless of language barriers, etc, so it would be easier to analyze them.
How feasible is this?
EDIT: Example. I have an .exe file and an .app file. If I disassembled both, could I compare them across opcode on the same OS? If not, how about executable files from the same OS. For example, for all executable files on Windows, if I disassembled both, could I compare opcode across each?
EDIT2: How will obfuscators affect my efforts?

In short, no.
The problem is that there is no practical universal instruction set. In practice, every computer architecture has its own instruction set (or sometimes several instruction sets). A native executable format like .exe is compiled to the machine's instruction set, which will differ based on the ISA targeted.
I'm not familiar with the .app format, but it appears to be some sort of archive containing executable code. So if you have an exe and app targeting the same ISA, you could conceivably diassemble and compare.
Obfuscation makes things much harder because it is difficult to get a reliable disassembly, let alone deal with stuff like self modifying code.

Related

Can malware binaries be in packed form?

Recently I'm reading malware analysis. I'm going through this Malware Repository (https://github.com/ytisf/theZoo). Here we can find malware binaries. Can binaries be in packed form? If so, how can we say that these binaries are packed or not?
PS: Packers compress a program and will try to hide internals from us(sort of compression or encryption). I got a doubt regarding this. Can binaries be in the packed form or not?
Edit2: In this repository, they just zipped it to be safe which is not actual packing I'm talking about. After unzipping, we will get a binary. Whether that can be in packed form or not?
First of all, the distinction you make between "packers" and archiver programs (ZIP, etc) or compression programs doesn't appear to have any basis.
A "packed" executable cannot be executed directly. It must be unpacked first. This is exactly the same as (say) a ZIP file containing malware, or a malware file that has been compressed with a standard compression program.
What about a "packed" executable that has been created by a program that does the "packing" in a secret way ... to evade detection? Well that won't work. The malware still has to be unpacked before it can be executed. So that means that the bad gut now has a second problem: getting the unpacker onto the victims machine. And once someone (an anti-hacker) gets hold of the super-secret unpacker, it is no longer secret. It can be reverse engineered ... or simply used as-is by an AV product on suspicious binary files.
The only practical use of "packing" that I can think of is to add self-unpacking functionality to the malware. The malware (as distributed) would consist of an executable with a small amount of code that implemented the unpacker. The rest of the executable would be packed code that implements the nasty stuff. When the user runs the malware, it would unpack the packed code, load it into memory and start executing.
However, there are potential ways to detect or prevent this kind of thing.
If the unpacker writes the executable code into a file prior to loading it, an AV product could detect that.
If the packer attempts to load code into itself, there are ways that could be blocked; e.g. using memory protection hardware + the OS, etc to stop the unpacker from creating memory segments containing executable code; see https://en.wikipedia.org/wiki/Executable_space_protection.
An AV could look for the signature in the packed code, or it cold look for a signature in the unpacker code.
In short, malware could use some kind of "packing" to hide itself, but there must be an executable component somewhere to unpack it.
If so, how can we say that these binaries are packed or not?
If the malware is distributed as a non-executable you figure out what is going to unpack it, and then see if that process is going to give you an executable.
If the malware is a self-unpacking executable, you reverse engineer the unpacking component to figure out how it works.

How do large software systems composed of multiple executables work?

First of all let me clear up some confusion arising from my potential misuse of vocabulary in the question:
By 'executable' I mean a single executable file that is build from sources containing one main function (my background is in C++) and potentially lots of classes and the like. This 'large software system' is a collection of such executables that communicate with each other and work together to achieve some goal.
I'm used to writing simple programs that have a clear entry point and exit conditions. What would be this entry point in such a software system? Which executable starts first and how do I know which one it is? There is no one global main function after all, is it? When are all other executables launched and who calls them? What other files compose such system? How are they bundled together? How is the system loaded on the target machine?
Question is way too vague, but I'll try and take a stab at it.
Which executable executes first - this would depend on requirements and the developer. If it's a sequential flow, there would definitely be an order of executing executables. For parallel processing, the return codes of each executable would be examined to determine their result.
Who calls other executables. This can be done by calling your initial executable from a shell script, and based on it's return code, deciding the next course of action. Instead of shell script, you can also opt for job schedulers like Tivoli or Cron jobs i believe.
What other files compose the system.
Well that would depend on the system being built. This is really extremely vague to even attempt to answer.
How are they bundled.
That would depend on the target system. Java apps could be .jar, in windows you can have .exe
How is the system loaded.
Again way too vague to answer

Extract Objective-c binary

Is it possible to extract a binary, to get the code that is behind the binary? With Class-dump you can see the implementation addresses, but is it possible to also see the code thats IN the implementation addresses? Is there ANY way to do it?
All your code compiles to single instructions, placed in the text section of your executable. The compiler is responsible for translating your higher level language to the processor specific instructions, which are simpler. Reverting this process would be nearly impossible, unless the code is quite simple. Some problems are ambiguity of statements, and the overall readability: local variables, for instance, will be nothing but an offset address.
If you want to read the disassembled code (the instructions of which the higher level code was compiled to) use this command in an executable:
otool -tV file
You can decompile (more accurately, disassemble) a binary and get it's assembly, but there is no way to get back the original Objective-C.
My curiosity begs me to ask why you want to do this!?
otx http://otx.osxninja.com/ is a good tool for symbolicating the otool based disassembly
It will handle both x86_64 and i386 disassembly.
and
Mach-O-Scope https://github.com/smorr/Mach-O-Scope is a a tool built on top of otx to dump it all into a sqlite3 database for browsing and annotating.
It won't give you the original source -- but it will get you pretty close providing you with the messages that are being sent around in methods.

Executable files obfuscated with different encryptors

I would like to test (and validate) an application that analyses executable files obfuscated with UPX, ASProtect, PECompact, etc... Does someone knows a place where I can find (dummy) samples apps obfuscated with different algorithms?
Many software use packers to compress/obfuscate their binaries. However, instead of searching for such already obfuscated applications, I think you can download packers and obfuscate some application by yourself. You can then use these packed binaries for testing/validating PE analyzer. Some of the well known packers are:
http://upx.sourceforge.net/
http://www.bitsum.com/pecompact.php
http://www.aspack.com/asprotect.html

What form is DLL & what makes it processor dependent

I know DLL contains one or more exported functions that are compiled, linked, and stored separately..
My question is about not about how to create it.. but it is all about in what form it is stored.. Is it going to be in the form of 0's & 1's.. or in assembly commands ADD, MUL, DIV, MOV, CALL, RETURN etc..
Also what makes it to be processor dependent.. (like x86, x87, IBM 700 instruction set)..
Can someone please explain it little briefly..!
First of all, everything in a computer is in the form of "0's & 1's" . The fact that the computer can display some of these as text, pictures, sounds, 3D models, etc. is just a matter of how you interpret them. But down there, at the metal, it's all just "0's & 1's" (also known as bits). Note though that they are always grouped together in groups of 8, and these are called "bytes". It's really for the sake of efficiency, because operating with every bit individually would be too tedious. Actually, todays computers don't even operate on single bytes anymore (or rather - they do it very rarely). Mostly you operate with 4 or 8 bytes at a time, depending on whether you have a 32-bit or 64-bit CPU (that's in layman's terms, it's actually a bit more complicated than that).
As for a .DLL file - like an .EXE file, it contains bytes that describe instructions that a CPU can execute. The CPU takes these bytes directly from the .DLL/.EXE and executes them without any further modifications. That's why these files are CPU-specific. In different CPU architectures the same combination of bytes means different things, so a .DLL/.EXE will run correctly only on the CPU for which it was designed. On other CPUs these bytes will mean some other instructions, and when run, the program will most likely do some utter nonsense and crash immediately.
The assembly commands you mentioned also deserve an explanation. "Assembler" is not a language that a CPU can understand. It's a language a human can understand. It was created because writing directly in machine code (the bytes that the CPU actually understands) is very difficult. What you get is utter gibberish on the screen (try opening some .EXE file in Notepad!) but every bit has to be precisely set for it to work.
So assembly language is basically the same thing, except these instructions are written in text that humans can read. For every machine code that a CPU can understand, there is am instruction with a human-friendly name. An assembly compiler simply reads these instructions and replaces them with the bytes that represent the actual instructions for the CPU to execute. It's a 1:1 operation. Every command in assembly language matches a single machine instruction (again, in layman's terms).
So you see, there isn't even a single assembly language. Every CPU architecture has its own assembly language, because they each have different instructions.
Note though that all this applies to native .DLL/.EXE files. .NET files are different - they don't contain machine code, but rather instructions for an abstract, nonexistent CPU. It's like Java bytecodes. When a .NET .DLL/.EXE is run, the .NET runtime translates it from the abstract instructions to the instructions that the specific CPU can understand. They use a lot of tricks to make this very fast, so these files run almost as fast as simple .DLL/.EXE files.
Does this clear things up? :)
Native DLLs (not .NET assemblies) usually contain machine code that can only be run on a certain platform. The machine code is a sequence of bytes that the processor treats as instructions (ADD, MOV, etc.).
In Windows, dll's are stored in the PE format which is basically a collection of sections that holds the information about how to map it into memory. some sections contains the program's code (which is of course processor dependent), others contains the program's data, other the exported and imported functions and so on.
Managed code is compiled to some intermediate language that is JITed by the run-time as it is executed. therefore, your dll won't contain any processor dependent code and you'll be able to execute your program on any platform with the relevant run-time.
it depends on your DLL. generally, a DLL contains executable code as an EXE file. those code DLLs are processor dependent since the code can only be executed on a specific platform. the code is stored using the same "format" as an EXE file (binary machine code).
however, a DLL can sometimes contains only data: they are then called "resource DLL" and are not processor dependent at all. they act as a container for data files used by applications.
note that many DLLs are hybrids: they contain both code and resources. for example, most DLLs which comprises the user part of the Windows operating system are hybrid: you can open them using Visual Studio or a Resource Explorer to see the resources (the data segments) they contain, or open them with Dependency Walker or dumpbin to see the functions (the code segments) they contain.
(of course this answer is really Windows specific, i don't know for .so files which are the linux equivalent of a DLL)
Both a DLL and an EXE contain executable code.
In the case of a DLL it doesn't have the necessary parts to be directly executable. It must be called from an other piece of executable code. One DLL can call another, but all must ultimately be called from and EXE.
So the rules about what's compatible with what processor that apply to EXEs also apply to DLLs.