Cross platform file-access tracking - cross-platform

I'd like to be able to track file read/writes of specific program invocations. No information about the actual transactions is required, just the file names involved.
Is there a cross platform solution to this?
What are various platform specific methods?
On Linux I know there's strace/ptrace (if there are faster methods that'd be good too). I think on mac os there's ktrace.
What about Windows?
Also, it would be amazing if it would be possible to block (stall out) file accesses until some later time.
Thanks!

The short answer is no. There are plenty of platform specific solutions which all probably have similar interfaces, but they aren't inherently cross platform since file systems tend to be platform specific.
How do I do it well on each platform?
Again, it will depend on the platform :) For Windows, if you want to track reads/writes in flight, you might have to go with IFS. If you just want to get notified of changes, you can use ReadDirectoryChangesW or the NTFS change journal.
I'd recommend using the NTFS change journal only because it tends to be more reliable.

On Windows you can use the command line tool Handle or the GUI version Process Explorer to see which files a given process has open.
If you're looking for a get this information in your own program you can use the IFS kit from Microsoft to write a file system filter. The file system filter will show all file system operation for all process. File system filters are used in AV software to scan files before they are open or to scan newly created files.

As long as your program launches the processes you want to monitor, you can write a debugger and then you'll be notified every time a process starts or exits. When a process starts, you can inject a DLL to hook the CreateFile system calls for each individual process. The hook can then use a pipe or a socket to report file activity to the debugger.

Related

How do large software systems composed of multiple executables work?

First of all let me clear up some confusion arising from my potential misuse of vocabulary in the question:
By 'executable' I mean a single executable file that is build from sources containing one main function (my background is in C++) and potentially lots of classes and the like. This 'large software system' is a collection of such executables that communicate with each other and work together to achieve some goal.
I'm used to writing simple programs that have a clear entry point and exit conditions. What would be this entry point in such a software system? Which executable starts first and how do I know which one it is? There is no one global main function after all, is it? When are all other executables launched and who calls them? What other files compose such system? How are they bundled together? How is the system loaded on the target machine?
Question is way too vague, but I'll try and take a stab at it.
Which executable executes first - this would depend on requirements and the developer. If it's a sequential flow, there would definitely be an order of executing executables. For parallel processing, the return codes of each executable would be examined to determine their result.
Who calls other executables. This can be done by calling your initial executable from a shell script, and based on it's return code, deciding the next course of action. Instead of shell script, you can also opt for job schedulers like Tivoli or Cron jobs i believe.
What other files compose the system.
Well that would depend on the system being built. This is really extremely vague to even attempt to answer.
How are they bundled.
That would depend on the target system. Java apps could be .jar, in windows you can have .exe
How is the system loaded.
Again way too vague to answer

Is Dropbox considered a Distributed File System?

I was just reading this https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems
The definition of a DFS seems to exactly describe Dropbox to me but it isn't in the list of examples, which of course it would be if it was one I think.
So what is different about Dropbox which makes it not fall into this category?
Usually, when talking about distributed file-systems, you expect properties that Dropbox doesn't support. For example, if you and I share a folder, I can create a file called "work.txt" in it and you can create a file "work.txt" in it, and if we do it fast enough (or when we're not syncing with dropbox) we'll have conflicting copies of the same file.
A similar example would be if we both edit the same file concurrently - we'll have conflicting copies, which is something a distributed file system should prevent. In the link you refer to, this is called "Concurrency transparency; all clients have the same view of the state of the file system".
Another example of a property dropbox doesn't support: if my computer fails (e.g., my hard-drive is corrupted) I might lose data that wasn't uploaded to Dropbox. There is a small window in which I think my data was written to the local disk, but if my computer fails, I lose that data.
Lastly, I'm not sure how Dropbox will operate with file locks. For example, MS office takes locks on .doc files, to ensure no one else is working on them at the same time. I don't think Dropbox supports this feature.
I've written a blog post about some of complexities of implementing a distributed file-system, you might find it helpful as well.

Why Does a .dll file allow programs to be modularized?

Excerpt From Micrsoft's "What is a .dll?":
"By using a DLL, a program can be modularized into separate
components. For example, an accounting program may be sold by module.
Each module can be loaded into the main program at run time if that
module is installed. Because the modules are separate, the load time
of the program is faster, and a module is only loaded when that
functionality is requested. Additionally, updates are easier to apply
to each module without affecting other parts of the program. For
example, you may have a payroll program, and the tax rates change each
year. When these changes are isolated to a DLL, you can apply an
update without needing to build or install the whole program again."
Ref:http://support.microsoft.com/kb/815065
DLL's are:
loaded at runtime
can "dynamically loaded" (by multiple programs at the same time)
- which allows saving of resources
- lowers disk space requirements
But why do they promote "modulizing" programs?What would happen if there weren't .dll files?Could someone provide/expand on the example
Modular programs provide a way of making a particular functionality available to many programs without having to include the same code in all of them. Also, they allow greater compatibility between programs since they would essentially use the same methods in common DLLs to obtain the same results.
One would write a program in a modular fashion such that different parts of the program could be maintained separately. Say you had some clever way of reading and writing your own data format to files. Say you make improvements to that technique. If the code for reading and writing the files lived in a DLL, you would only need to update the DLL. The program itself would remain unchanged.
If you have one monolithic EXE, you have to
pay for all the extra time relinking it, even if 1 source file changed (this is painful if it's > 80 MB, as is the case in large projects),
ship the entire EXE, when you could only ship a single DLL which is a fraction of the size (for patches/updates).
Breaking it up into DLLs you
have pluggability: The EXE is the host application and others can write DLLs that "plug into" the host via a well-defined interface. DLLs can be interchanged as long as they conform to the interface.
can share code across other DLLs and EXEs.
can have some DLLs be optionally loaded on demand, only if they're used, and unloaded when they're not needed
similar to above, have optional functionality. With a single EXE you have to download everything, even if some components are rarely used. With DLLs, you could have a system that downloads and installs features as needed.
The biggest advantage of dlls is probably during development of the original program. Without dlls you wouldn't be able to integrate with existing libraries without including the original source code. By including an existing library as a dll you don't need the source since it's all encapsulated in the dll. It would be a nightmare to develop in frameworks like .Net without dlls since you constantly include other libraries...
The alternative to breaking your program down in n > 1 pieces is to keep it in n == 1 piece. Why is this bad? Well it isn't always bad (maybe the BIOS is a good example?). But for user programs it usually is. Why? First we need to define what a program is.
What is a program?
A simple "program", roughly speaking, consists of an entry point (i.e. offset to the main function), functions and global variables. A function consists of instructions and information about what local variables are needed to run the function. To be executed a program must be loaded in primary memory/RAM (the aforementioned information). Because our program has functions (and not just jump statements), that implies the existence of a stack, which implies the existence of a containing environment managing the stack. (I suppose you could have a program that manages its own stack but I'd argue then your program is not a program anymore but an environment.) This environment contains the program, starts in the entry point and executes each instruction, be it "go to this part of the RAM and add it to whatever is in this register" or "If this register is all 0 then jump ahead this many instructions and resume execution there" indefinitely or until the program gives control back to its environment. (This is somewhat simplified - context switches in multi-process environments, illegal memory access, illegal instructions, etc. can also cause control to be taken from the program.)
Anyway, so we have two options: either load the entire program at once or have it stored and loaded in pieces.
n == 1
There are some advantages to doing it all at once:
Once the program is in memory no disk access is required to execute further (unless the program explicitly asks there to be).
Since the program is compiled/linked before execution begins you can do everything without any sort of string names/comparisons - go directly to the address (or an offset).
Functions are never out of sync with one another.
n > 1
There are some disadvantages, though, which mirror the advantages:
Most programs don't execute all code paths most of the time. I think there's some studies that in most programs most of the time spent executing is spent in a fraction of the instructions present in the program. In other words something like 20% of the program is executed 80% of the time (I just made that particular figure up - but you get the idea). If we divide our program up enough and only load instruction sets (i.e. functions) as they are needed then we won't waste time loading the 80% we'll never use this execution of the program. Along these lines we can ultimately fit more concurrently executing programs in our RAM at once if we only end up loading the fraction of the program we need.
Most programs share similar functions (i.e. storing data/trees/hashes/sorting/etc., reading input, writing output, etc.) and if each program has its own local copy then you can't reuse instruction code.
Many programs depend on the existence of others and are maintained by separate companies/groups/individuals. By releasing versioned modules we don't have to synchronize releases all the time.
Conclusion
These aren't the only points to consider but the first ones that came to my mind. I'd recommend reading about compilers, linkers and operating systems. That will answer this question more thoroughly than I and other questions I'm sure this has brought up. To recap dll's aren't the "best" way of packaging executable programs in all situations and circumstances - they have a particular use and advantages and disadvantages.

Notifications for file system changed?

I'm implementing a free space on disk bar where while files are being copied, the free space bar updates. I need some way of being notified of file system changes. What's the best way to go about doing this?
The File System Events Programming Guide has all the info you need. You want to register with the File System Events API (OS X 10.5 and later).
To monitor operations on individual files you can use kqueue file change notifications. Uli Kusterer has a nice Obj-C wrapper called UKKQueue.
You can get it here: http://zathras.de/angelweb/sourcecode.htm
If you want to watch an entire folder, FSEvents (and the SCEvents wrapper) will probably be of more use.

How to run unmanaged executable from memory rather than disc

I want to embed a command-line utility in my C# application, so that I can grab its bytes as an array and run the executable without ever saving it to disk as a separate file (avoids storing executable as separate file and avoids needing ability to write temporary files anywhere).
I cannot find a method to run an executable from just its byte stream. Does windows require it to be on a disk, or is there a way to run it from memory?
If windows requires it to be on disk, is there an easy way in the .NET framework to create a virtual drive/file of some kind and map the file to the executable's memory stream?
You are asking for a very low-level, platform-specific feature to be implemented in a high-level, managed environment. Anything's possible...but nobody said it would be easy...
(BTW, I don't know why you think temp file management is onerous. The BCL does it for you: http://msdn.microsoft.com/en-us/library/system.io.path.gettempfilename.aspx )
Allocate enough memory to hold the executable. It can't reside on the managed heap, of course, so like almost everything in this exercise you'll need to PInvoke. (I recommend C++/CLI, actually, so as not to drive yourself too crazy). Pay special attention to the attribute bits you apply to the allocated memory pages: get them wrong and you'll either open a gaping security hole or have your process be shut down by DEP (i.e., you'll crash). See http://msdn.microsoft.com/en-us/library/aa366553(VS.85).aspx
Locate the executable in your assembly's resource library and acquired a pinned handle to it.
Memcpy() the code from the pinned region of the managed heap to the native block.
Free the GCHandle.
Call VirtualProtect to prevent further writes to the executable memory block.
Calculate the address of the executable's Main function within your process' virtual address space, based on the handle you got from VirtualAlloc and the offset within the file as shown by DUMPBIN or similar tools.
Place the desired command line arguments on the stack. (Windows Stdcall convention). Any pointers must point to native or pinned regions, of course.
Jump to the calculated address. Probably easiest to use _call (inline assembly language).
Pray to God that the executable image doesn't have any absolute jumps in it that would've been fixed up by calling LoadLibrary the normal way. (Unless, of course, you feel like re-implementing the brains of LoadLibrary during step #3).
Retrieve the return value from the #eax register.
Call VirtualFree.
Steps #5 and #11 should be done in a finally block and/or use the IDisposable pattern.
The other main option would be to create a RAMdrive, write the executable there, run it, and cleanup. That might be a little safer since you aren't trying to write self-modifying code (which is tough in any case, but especially so when the code isn't even yours). But I'm fairly certain it will require even more platform API calls than the dynamic code injection option -- all of them requiring C++ or PInvoke, naturally.
Take a look at the "In Memory" section of this paper. Realize that it's from a remote DLL injection perspective, but the concept should be the same.
Remote Library Injection
Creating a RAMdisk or dumping the code into memory and then executing it are both possible, but extremely complicated solutions (possibly more so in managed code).
Does it need to be an executable? If you package it as an assembly, you can use Assembly.Load() from a memory stream - a couple of trivial lines of code.
Or if it really has to be an executable, what's actually wrong with writing a temp file? It'll take a few lines of code to dump it to a temp file, execute it, wait for it to exit, and then delete the temp file - it may not even get out of the disk cache before you've deleted it! Sometimes the simple, obvious solution is the best solution.
This is explicitly not allowed in Vista+. You can use some undocumented Win32 API calls in XP to do this but it was broken in Vista+ because it was a massive security hole and the only people using it were malware writers.