What kernel level operations are performed when editing a file? - file-io

Can anybody please explain to me what kernel level operations are performed, when a file is edited? The thing i'm confused with is that is it the case that a new inode is created every time a file is edited. Please explain the steps, if possible. I have searched the internet, but no satisfactory answers there.
Thanks in advance.

There's no single general answer, because this depends on what the application does when it's editing the file, what system it's running on, and what the file is stored on. It might be creating new temporary files, or clobbering and rewriting the original file, or using memory mapping, or using versioned filesystem features, or doing network file system operations, etc etc.
Instead of trying to answer this in the abstract, pick an open source editor you're interested in, and read through its source code and debug it to understand what it in particular is doing. Then if you have questions, you can read the API docs to figure out what kernel operations the functions it's calling map to or rely on.

Related

Attaching a specific piece of non-intrusive info to a file or folder to keep a connection to a program

This is going to be a question with a lot of hypotheticals, but it's been on my mind for a while now and I finally want to get some perspectives on how to tackle this "issue". For the sake of the question, I'll make up an example requirement of how the program I want to make would work on a conceptual level without too many specifics.
The Problem
I want to create a program to keep track of miscellaneous info for files and folders. This miscellaneous info can be anything from comments, authors, to more specific info like the original source of the file (a URL for example), categories, tags, and more. All this info is kept track of in an SQLite database.
Now... how would you create a connection to the file (or folder) to the database? Whatever file is added to the program, the file should continue to operate on an independent level from the program, meaning you should be able to edit, copy, move, rename or do anything else with the file you would usually do with your OS of choice - even deleting it.
You should even be able to archive it, zip it, upload it somewhere or do other things that temporarily or permanently removes the file from your system, without losing the connection to the database. The program itself doesn't actually ever touch the files themselves, unless to generate a new entry in the database, but obviously, there should be some kind of reference in the file to a database entry in the program.
Yes, I know that if you delete the file, you would have a dead entry in the database. For now, just treat this as an unfortunate reality that can't be solved unless you incorporate the file more closely into the program.
Possible solutions and why I decided against them
Reference inside Filename
Probably the most obvious choice, you could just have a reference inside the filename to point to a database entry, for example by including the id at the start of the filename:
#1 my-example-file.txt
#12814 this-is-one-of-many-files.txt
Obviously, that goes against what I established earlier, as you would be restricted from freely renaming the file. You would always have to keep in mind to not mess with the id inside the filename, or else the connection to your program is broken. Unfortunately, that is the best bet I currently have, but I would like to avoid using that approach if possible.
Alternate Data Streams (ADS)
A pretty cool feature I recently discovered that's available on NTFS file systems, ADS allows you to store different streams of data for your files, to grossly simplify it. You could attach a data stream to your file that saves the id for the database entry in the program, and a regular user would never be able to mess directly with that.
However, since this is a feature reserved for specific file systems, there's some ugly side effects to ADS, as you can easily lose that part of the file by:
moving/copying it to a file system that doesn't support ADS, such as the file systems most often used in removable drives
uploading it to a cloud then later downloading it
moving it to another OS that might not support ADS or treats it in an unexpected way
zipping it
Thus I can't really rely on ADS either.

Is Dropbox considered a Distributed File System?

I was just reading this https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems
The definition of a DFS seems to exactly describe Dropbox to me but it isn't in the list of examples, which of course it would be if it was one I think.
So what is different about Dropbox which makes it not fall into this category?
Usually, when talking about distributed file-systems, you expect properties that Dropbox doesn't support. For example, if you and I share a folder, I can create a file called "work.txt" in it and you can create a file "work.txt" in it, and if we do it fast enough (or when we're not syncing with dropbox) we'll have conflicting copies of the same file.
A similar example would be if we both edit the same file concurrently - we'll have conflicting copies, which is something a distributed file system should prevent. In the link you refer to, this is called "Concurrency transparency; all clients have the same view of the state of the file system".
Another example of a property dropbox doesn't support: if my computer fails (e.g., my hard-drive is corrupted) I might lose data that wasn't uploaded to Dropbox. There is a small window in which I think my data was written to the local disk, but if my computer fails, I lose that data.
Lastly, I'm not sure how Dropbox will operate with file locks. For example, MS office takes locks on .doc files, to ensure no one else is working on them at the same time. I don't think Dropbox supports this feature.
I've written a blog post about some of complexities of implementing a distributed file-system, you might find it helpful as well.

How to extract info from a file

this may be a beginner's question. I've tried searching for info but couldn't find anything. Part of my work requires me to convert a specific, proprietary, file type. Unfortunately the software is no longer supported and can't be found. I have no idea where to start on this. I would like to write a little utility to basically convert the file for me to a standard file. Question is where do I start? Conceptually what am I looking at here? Is this even possible?
You could start by understanding what is stored in the file. Is there a pattern to the data, what is the pattern, how it is repeated, etc.
Then open the file in binary mode and try to find if there is indeed a pattern. If there is one, you should be able to see it, even if in binary mode.
And lots of patience :-)

Implement a self extracting archive?

I know i can use 7z or winrar but i want to learn this for myself.
How would i implement a self extracting archive? I can use C# or C++ but let me run down the problem.
When i open the exe i need some kind of GUI asking where to extract the files. Once the user says ok I should obviously extract them. I implemented a simple example in C# winforms already BUT my problem is HOW do i get the filenames and binary of the files into an exe?
One upon a time i ask Is it safe to add extra data to end of exe? and the answer suggested if i just add data to the end of the exe it may be picked up by a virus scanner. Now its pretty easy to write the length of the archive as the last 4bytes and just append the data to my generic exe and i do believe my process can read my own exe so this could work. But it feels hacky and i rather not have people accuse me of writing virus just because i am using this technique. Whats the proper way to implement this?
Note: I checked the self-extracting tag and many of the question is how to manipulate self extracting and not how to implement. Except this one which is asking something else Self-extracting self-checking executable
-edit- I made two self extracting with 7z and compared them. It looks like... well it IS the 7z.sfx file but with a regular 7z archive appended. So... there is nothing wrong with doing this? Is there a better way? I'm targeting windows and can use the C# compiler to help but i don't know how much extra work or how difficult it may be programmatically and maybe adding data to end of exe isnt bad?
It is possible. I used the following technique once, when we needed to distribute updates for the application, but the computers were configured so that the end user had no permissions to change application files. The update was supposed to log on to administrator account and update required files (so we came across identical problem: how to distribute many files as a single executable).
The solution were file resources in C#. All you need to do is:
Create a resource file in your C# project (file ending with .resx).
Add new resource of type "file". You can easily add existing files as byte[] resources.
In program you can simply extract resource as file:
System.IO.FileStream file = new System.IO.FileStream("C:\\PathToFile",
System.IO.FileMode.OpenOrCreate);
System.IO.BinaryWriter writer = new System.IO.BinaryWriter(file);
writer.Write(UpdateApplication.Data.DataValue, 0, UpdateApplication.Data.DataValue.Length);
(Here UpdateApplication.Data denotes binary resource).
Our solution lacked compression, but I believe this is easily achieved with libraries such as C#ZipLib.
I hope this solution is virus-scanner-safe, as this method creates complete, valid executable file.

Software configuration management tool for hundreds of binary files, many are large

Note: I've tried searching, Stackoverflows near useless. I am not sure what kind of tool I need.
At my organization we need to keep track of the software configuration for many types of computers including the binary installers and automation scripts. Change is infrequent but the size of latest version of the configuration is several gigs.
We are trying to use Mercurial to store changes but it is just too slow, even without many revisions at all. I did an hg status but killed it after it took 10 minutes without finishing.
We are looking for a way to store the current configuration as well as having the old configurations there just in case. I have never done anything like this before and do not know what tools are available or even suitable for such tasks. Can someone point me in the right direction or tell me how the are solving this problem? Thanks
Since hard disk space is cheap and being able to view binary differences isn't very helpful, perhaps the best option you have is to store each configuration in a new directory that is indexed somehow. Example below:
/software/configs/2009-03-15
/software/configs/2009-09-28
/software/configs/2009-09-30
Given the size of your files and the infrequent number of changes, this would allow you to pick a configuration from a given 'tag' without the overhead of revision control.
If you pack your files into a single tar file and generate a SHA-512 hash, then you can be reasonably sure that no one has tampered with your files since they were archived.
While I don't know specific details about how to implement this strategy in mercurial, I have been working with git and git-fat. It sets up a general procedure that is likely to be feasible on mercurial as well. Basically the idea is whenever you add a binary file to the repository, under the hood, the repo creates a symlink to the file that is actually stored in another location as a checksummed object.
This allows large files to be tracked by the repo, without storing the actual data inside. It requires the data to be stored in some other location (perhaps in a binary management system).
It might take some configuration to do it in mercurial, but I think it's an elegantly simple solution.