How do large software systems composed of multiple executables work? - system

First of all let me clear up some confusion arising from my potential misuse of vocabulary in the question:
By 'executable' I mean a single executable file that is build from sources containing one main function (my background is in C++) and potentially lots of classes and the like. This 'large software system' is a collection of such executables that communicate with each other and work together to achieve some goal.
I'm used to writing simple programs that have a clear entry point and exit conditions. What would be this entry point in such a software system? Which executable starts first and how do I know which one it is? There is no one global main function after all, is it? When are all other executables launched and who calls them? What other files compose such system? How are they bundled together? How is the system loaded on the target machine?

Question is way too vague, but I'll try and take a stab at it.
Which executable executes first - this would depend on requirements and the developer. If it's a sequential flow, there would definitely be an order of executing executables. For parallel processing, the return codes of each executable would be examined to determine their result.
Who calls other executables. This can be done by calling your initial executable from a shell script, and based on it's return code, deciding the next course of action. Instead of shell script, you can also opt for job schedulers like Tivoli or Cron jobs i believe.
What other files compose the system.
Well that would depend on the system being built. This is really extremely vague to even attempt to answer.
How are they bundled.
That would depend on the target system. Java apps could be .jar, in windows you can have .exe
How is the system loaded.
Again way too vague to answer

Related

Can different file extension executables be disassembled into the same instruction set OpCode?

This is a question from someone clueless about disassembly and decompiling in general, so bear with me. I am curious to know if executable file extensions (for example, listed in http://pcsupport.about.com/od/tipstricks/a/execfileext.htm ) can be disassembled into assembly language so then I can analyze opcode patterns across files.
My logic is that once all these different file extensions are in opcode form, they are all on the same level, regardless of language barriers, etc, so it would be easier to analyze them.
How feasible is this?
EDIT: Example. I have an .exe file and an .app file. If I disassembled both, could I compare them across opcode on the same OS? If not, how about executable files from the same OS. For example, for all executable files on Windows, if I disassembled both, could I compare opcode across each?
EDIT2: How will obfuscators affect my efforts?
In short, no.
The problem is that there is no practical universal instruction set. In practice, every computer architecture has its own instruction set (or sometimes several instruction sets). A native executable format like .exe is compiled to the machine's instruction set, which will differ based on the ISA targeted.
I'm not familiar with the .app format, but it appears to be some sort of archive containing executable code. So if you have an exe and app targeting the same ISA, you could conceivably diassemble and compare.
Obfuscation makes things much harder because it is difficult to get a reliable disassembly, let alone deal with stuff like self modifying code.

Why Does a .dll file allow programs to be modularized?

Excerpt From Micrsoft's "What is a .dll?":
"By using a DLL, a program can be modularized into separate
components. For example, an accounting program may be sold by module.
Each module can be loaded into the main program at run time if that
module is installed. Because the modules are separate, the load time
of the program is faster, and a module is only loaded when that
functionality is requested. Additionally, updates are easier to apply
to each module without affecting other parts of the program. For
example, you may have a payroll program, and the tax rates change each
year. When these changes are isolated to a DLL, you can apply an
update without needing to build or install the whole program again."
Ref:http://support.microsoft.com/kb/815065
DLL's are:
loaded at runtime
can "dynamically loaded" (by multiple programs at the same time)
- which allows saving of resources
- lowers disk space requirements
But why do they promote "modulizing" programs?What would happen if there weren't .dll files?Could someone provide/expand on the example
Modular programs provide a way of making a particular functionality available to many programs without having to include the same code in all of them. Also, they allow greater compatibility between programs since they would essentially use the same methods in common DLLs to obtain the same results.
One would write a program in a modular fashion such that different parts of the program could be maintained separately. Say you had some clever way of reading and writing your own data format to files. Say you make improvements to that technique. If the code for reading and writing the files lived in a DLL, you would only need to update the DLL. The program itself would remain unchanged.
If you have one monolithic EXE, you have to
pay for all the extra time relinking it, even if 1 source file changed (this is painful if it's > 80 MB, as is the case in large projects),
ship the entire EXE, when you could only ship a single DLL which is a fraction of the size (for patches/updates).
Breaking it up into DLLs you
have pluggability: The EXE is the host application and others can write DLLs that "plug into" the host via a well-defined interface. DLLs can be interchanged as long as they conform to the interface.
can share code across other DLLs and EXEs.
can have some DLLs be optionally loaded on demand, only if they're used, and unloaded when they're not needed
similar to above, have optional functionality. With a single EXE you have to download everything, even if some components are rarely used. With DLLs, you could have a system that downloads and installs features as needed.
The biggest advantage of dlls is probably during development of the original program. Without dlls you wouldn't be able to integrate with existing libraries without including the original source code. By including an existing library as a dll you don't need the source since it's all encapsulated in the dll. It would be a nightmare to develop in frameworks like .Net without dlls since you constantly include other libraries...
The alternative to breaking your program down in n > 1 pieces is to keep it in n == 1 piece. Why is this bad? Well it isn't always bad (maybe the BIOS is a good example?). But for user programs it usually is. Why? First we need to define what a program is.
What is a program?
A simple "program", roughly speaking, consists of an entry point (i.e. offset to the main function), functions and global variables. A function consists of instructions and information about what local variables are needed to run the function. To be executed a program must be loaded in primary memory/RAM (the aforementioned information). Because our program has functions (and not just jump statements), that implies the existence of a stack, which implies the existence of a containing environment managing the stack. (I suppose you could have a program that manages its own stack but I'd argue then your program is not a program anymore but an environment.) This environment contains the program, starts in the entry point and executes each instruction, be it "go to this part of the RAM and add it to whatever is in this register" or "If this register is all 0 then jump ahead this many instructions and resume execution there" indefinitely or until the program gives control back to its environment. (This is somewhat simplified - context switches in multi-process environments, illegal memory access, illegal instructions, etc. can also cause control to be taken from the program.)
Anyway, so we have two options: either load the entire program at once or have it stored and loaded in pieces.
n == 1
There are some advantages to doing it all at once:
Once the program is in memory no disk access is required to execute further (unless the program explicitly asks there to be).
Since the program is compiled/linked before execution begins you can do everything without any sort of string names/comparisons - go directly to the address (or an offset).
Functions are never out of sync with one another.
n > 1
There are some disadvantages, though, which mirror the advantages:
Most programs don't execute all code paths most of the time. I think there's some studies that in most programs most of the time spent executing is spent in a fraction of the instructions present in the program. In other words something like 20% of the program is executed 80% of the time (I just made that particular figure up - but you get the idea). If we divide our program up enough and only load instruction sets (i.e. functions) as they are needed then we won't waste time loading the 80% we'll never use this execution of the program. Along these lines we can ultimately fit more concurrently executing programs in our RAM at once if we only end up loading the fraction of the program we need.
Most programs share similar functions (i.e. storing data/trees/hashes/sorting/etc., reading input, writing output, etc.) and if each program has its own local copy then you can't reuse instruction code.
Many programs depend on the existence of others and are maintained by separate companies/groups/individuals. By releasing versioned modules we don't have to synchronize releases all the time.
Conclusion
These aren't the only points to consider but the first ones that came to my mind. I'd recommend reading about compilers, linkers and operating systems. That will answer this question more thoroughly than I and other questions I'm sure this has brought up. To recap dll's aren't the "best" way of packaging executable programs in all situations and circumstances - they have a particular use and advantages and disadvantages.

How can I handle platform-specific modules in Go?

I'm writing a command-line utility in Go that (as part of its operation) needs to get a password from the user. There's a great gopass module for Unix that does this, and I know how to write one for the Windows console. The problem is that the Windows module obviously won't build on *nix, and the *nix version won't build on Windows. Since Go lacks any preprocessor support (as far as I can tell), I have absolutely no idea what the right way to approach this is. I know it's possible, since Go itself must do this for its own libraries, but the tooling I'm used to (conditional imports/preprocessors/etc.) seems to be missing.
Go has build constraints, which can either be specified as comments in a .go file, or as part of the file name.
One set of constraints is for target operating system, so you can have one file for Windows, one for e.g. Linux and implement the same function in two different ways in the two.
More information on build constraints are at http://golang.org/pkg/go/build/#hdr-Build_Constraints

Compiling Massive VB.NET Project

To compile my current project, one exe with ~90,000 loc + ~100 DLL's it takes about a half hour or more depending on the speed of the workstation.
The build process is one of running devenv from Powershell scripts. This works very well with no problems.
The problem is that it is slow. I want to speed up this build process.
MSBuild (using VS-2005) is one option but there's a bug specifying icons to the vb compiler/linker on the command line such that it won't successfully link.
What other options are there to "make" VB.NET programs?
(Faster workstation is not an option.)
Do you absolutely have to compile the whole solution every time? With that many assemblies it seems unlikely that they all need to be built unless they actually change. If your solution is made up of multiple projects, you might consider creating multiple solutions in your build environment. One master solution could contain all the projects, another that includes the ones that change most often. You can then configure your build process to focus on the projects that have changed. Depending on the source control system you use, you may be able to query the system to determine which projects have changed since the last build, and only build those projects.
There's NAnt, and Cruisecontrol.NET for continuous build.
You mentioned that getting a faster PC is not an option, but how much memory do you have? 2GB should be the minimum for a developer machine. Also, using a fast 10K RPM hard disk makes a big difference.
Have you tried disabling any virus scanner during your build?
If you can, upgrade to the 3.5 version of MSBuild. It can build solution files, and enables support for multiprocessor support (or here if you need to host it yourself) enabling it to build projects in parallel.
The caveat is that you need to be using project references so it knows what to build.
Also, how long is it taking now? Have you looked at the CPU/Memory Usage (using something like PerfMon) to see if it is a bottleneck?
There's not much you can do to make the build process any faster short of adding more cores, CPU power, and memory to your machine, but that isn't an option in your case.
Most large projects are not self-contained in a single EXE. More often, logical units are moved into seperate assemblies, which can either be a DLL or EXE. The end result is a whole bunch of little assemblies, instead of one enormous one.
To cite one example, one project that I worked on was enormous, consisting of 700+ forms and 10s of 1000s of classes. Functionally related forms, such as those related to printing, report generation, user interrogation, etc were self-contained in their own EXEs. If I was working on the reports, I'd exclude all projects not related to reports from the build process, and this helps bring the compilation time down from a half hour to a few seconds.
This programming style can be tricky, but when it done right, it simply works and works flawlessly.
If you have a big number of projects then you should try to reduce them. You can always split them up in dll's later. The fewer projects the faster it can build. Especially if it has to build them in a certain order.
Breaking them in smaller solutions is also an option.

Cross platform file-access tracking

I'd like to be able to track file read/writes of specific program invocations. No information about the actual transactions is required, just the file names involved.
Is there a cross platform solution to this?
What are various platform specific methods?
On Linux I know there's strace/ptrace (if there are faster methods that'd be good too). I think on mac os there's ktrace.
What about Windows?
Also, it would be amazing if it would be possible to block (stall out) file accesses until some later time.
Thanks!
The short answer is no. There are plenty of platform specific solutions which all probably have similar interfaces, but they aren't inherently cross platform since file systems tend to be platform specific.
How do I do it well on each platform?
Again, it will depend on the platform :) For Windows, if you want to track reads/writes in flight, you might have to go with IFS. If you just want to get notified of changes, you can use ReadDirectoryChangesW or the NTFS change journal.
I'd recommend using the NTFS change journal only because it tends to be more reliable.
On Windows you can use the command line tool Handle or the GUI version Process Explorer to see which files a given process has open.
If you're looking for a get this information in your own program you can use the IFS kit from Microsoft to write a file system filter. The file system filter will show all file system operation for all process. File system filters are used in AV software to scan files before they are open or to scan newly created files.
As long as your program launches the processes you want to monitor, you can write a debugger and then you'll be notified every time a process starts or exits. When a process starts, you can inject a DLL to hook the CreateFile system calls for each individual process. The hook can then use a pipe or a socket to report file activity to the debugger.