Ensure a file is not changed while trying to remove it - locking

In a POSIX environment, I want to remove a file from disk, but calculate its checksum before removing it, to make sure it was not changed. Is locking enough? Should I open it, unlink, calculate checksum, and then close it (so the OS can remove its inode)? Is there any way to ensure no other process has an open file descriptor on the file?
To give a bit of context, the code performs synchronization of files across hosts, and there's an opportunity for data loss if a remote host removes a file but the file is being changed locally.

Your proposal of open,unlink,checksum,close won't work as is, because you'll be stuck if the checksum doesn't match (there is no POSIX-portable way of creating a link to a file given by a file descriptor). A better variant is rename,checksum,unlink,close, which lets you undo the rename or redo the copy if the checksum doesn't match. You'll still need to think of what you want to do if a third program has recreated the file in the meantime.
POSIX offers only cooperative locks. If you have control over the programs that may modify the file, make sure they use locks; if that's not an option, you're stuck without locks.
There is no portable way to see what (or even whether) processes have opened a file. On most Unix systems, lsof will show you, but this is not universal, not robust (a program could open the files just after lsof has finished looking), and incomplete (if the files are exported over NFS, there may be no way to know about active clients).
You may benefit from looking at what other synchronization programs are doing, such as rsync and unison.

Related

Attaching a specific piece of non-intrusive info to a file or folder to keep a connection to a program

This is going to be a question with a lot of hypotheticals, but it's been on my mind for a while now and I finally want to get some perspectives on how to tackle this "issue". For the sake of the question, I'll make up an example requirement of how the program I want to make would work on a conceptual level without too many specifics.
The Problem
I want to create a program to keep track of miscellaneous info for files and folders. This miscellaneous info can be anything from comments, authors, to more specific info like the original source of the file (a URL for example), categories, tags, and more. All this info is kept track of in an SQLite database.
Now... how would you create a connection to the file (or folder) to the database? Whatever file is added to the program, the file should continue to operate on an independent level from the program, meaning you should be able to edit, copy, move, rename or do anything else with the file you would usually do with your OS of choice - even deleting it.
You should even be able to archive it, zip it, upload it somewhere or do other things that temporarily or permanently removes the file from your system, without losing the connection to the database. The program itself doesn't actually ever touch the files themselves, unless to generate a new entry in the database, but obviously, there should be some kind of reference in the file to a database entry in the program.
Yes, I know that if you delete the file, you would have a dead entry in the database. For now, just treat this as an unfortunate reality that can't be solved unless you incorporate the file more closely into the program.
Possible solutions and why I decided against them
Reference inside Filename
Probably the most obvious choice, you could just have a reference inside the filename to point to a database entry, for example by including the id at the start of the filename:
#1 my-example-file.txt
#12814 this-is-one-of-many-files.txt
Obviously, that goes against what I established earlier, as you would be restricted from freely renaming the file. You would always have to keep in mind to not mess with the id inside the filename, or else the connection to your program is broken. Unfortunately, that is the best bet I currently have, but I would like to avoid using that approach if possible.
Alternate Data Streams (ADS)
A pretty cool feature I recently discovered that's available on NTFS file systems, ADS allows you to store different streams of data for your files, to grossly simplify it. You could attach a data stream to your file that saves the id for the database entry in the program, and a regular user would never be able to mess directly with that.
However, since this is a feature reserved for specific file systems, there's some ugly side effects to ADS, as you can easily lose that part of the file by:
moving/copying it to a file system that doesn't support ADS, such as the file systems most often used in removable drives
uploading it to a cloud then later downloading it
moving it to another OS that might not support ADS or treats it in an unexpected way
zipping it
Thus I can't really rely on ADS either.

How to get the file back after using "unlink" in R?

I accidentally deleted some of my useful files. The files were deleted and I could not find them in recycle bin. I want to know how can I get it back?
I am using windows 8.1. All the files in My documents deleted using unlink in R. I try to using R-delete to recover, but it only can recover the file deleted from recycle bin not unlink using R.
Thank you.
Though not being an expert of R, I assume that your file has been unlinked at the file-system level. You can't expect finding it in the recycle bin of your operating system. If it is very important, the only real solution is:
stop immediately doing anything with your computer;
take the time reading and understanding from another computer
try accessing your hard drive (or whatever) from another mounted filesystem/operating-system (boot with a USB stick for instance)
use some undelete tool adapted to your filesystem.
You don't tell about your operating system and OS; maybe there will be some tool usable from the mounted filesystem and it may be easier; but anyway, don't use your computer too much before doing it...

Prevent renaming of file from another binary on Mac OS

I am working with multiple processes that write to the same directory.
I have a directory dir1/
My process creates a file a.txt under dir1/. However the other process creates a-temp1.txt and renames it to a.txt. I don't have control over the other process since that code comes from a library. Can I prevent a-temp.txt from being renamed?
There's nothing you can do that the other process can't undo. Your best hope (other than changing your program to work sanely) is that the other process doesn't try too hard to do the rename. That is, it tries the simple approach and gives up if that fails.
In particular, you can set the UF_IMMUTABLE flag on either file and that will prevent one from being renamed to replace the other. You can set the flag using chflags(). Using Cocoa, you could also use [someURL setResourceValue:#YES forKey:NSURLIsUserImmutableKey error:NULL].
Keep in mind that you won't be able to change the file in any other way, either, until that flag is removed. If the other process is determined to rename the file, it has permission to remove the flag just like your process does.
Also keep in mind that a system such as this is inherently race-prone.
You really ought to use separate names for the files, or separate directories, or ditch that library that doesn't give you the control you need.
Set the user immutable flag chflags(...,uchg). This will keep the other process from changing your file unless it takes action to clear the bit. Of course I don't know how the other process will react to you putting things in it's way, but that wasn't the question.
You can use chflags() on an HFS+ (Mac OS X) file system to set the UF_APPEND attribute. (Do a man 2 chflags.) That will permit appending to the file, but not deleting or renaming, even by the same user.
You can, but it unlikely will solve your problem. I strongly suspect this is an X-Y problem, and almost certainly the correct solution is to redesign some part of this system entirely, probably by changing your file names, using unique temporary files, moving to another directory, or reworking the usage of the library (libraries only do what callers tell them to do; and libraries are just code anyway). You shouldn't try to defeat another process; you're all working for the same user.
All that said, sure, you can prevent your own userid from renaming over file. Just deny yourself permission. You can modify the file:
chmod 400 a.txt
That says that you can read the file but may not write it. However, if you already have an open file handle, you may continue to use it (so you can keep writing to the file, even though another process running as the same user may not).
Similarly, you may change permissions on the directory:
chmod 500 .
This would prevent the rename because file names are kept in the directory.

How to remove .efs file extension from 1000's of recovered files in one folder

I recently recovered a 1.5TB external HDD that crashed. The program I used to recover the files was Active Undelete Enterprise, it's excellent. When the files were successfully recovered they were all saved with a .efs extension so files looked like mydocument.docx.efs. At first I thought they were encrypted and needed to be decrypted, I spent 10 mins on it and realized I just need to remove the .efs from the entire filename and the mydocument.docx works perfectly. Problem is now I have over 55,000 files within hundreds of folders where I need to simply remove the .efs after each file. Does anyone know how to do this?
From a command prompt window, navigate to the top level directory where these files reside.
Type the command
DIR /S/B >>filelist.txt
This command will give you a bare format file listing of the current directory plus all nested subdirectories without any extraneous information. The list will be contained in the text file named "filelist.txt" or whatever else you choose to call it. I would then use this text file in a text editor to convert every line of text from, for example,
C:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs
to
rename c:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs file1.png
to give a simple example of a file that I just found on my system using this method.
You will need to use a text editor with a columnar editing capability since you have to modify som many files. Old programmer's editors such as CodeWright made this really simple while modern editors such as Eclipse or Notepad++ make this a little more difficult and may require a columnar editing plugin, depending on version. You basically have to make a columnar copy of all of the text in the file, and then paste the copy off to the far right - far enough that a second column of filenames and paths won't overwrite any of the existing file names and paths. You can then use columnar editing features to select and delete the path names of the text in the 2nd column since the rename command requires that the 2nd argument be simply the base filename and extension without the path information. You can use the columnar editing features to prepend every line with "RENAME ". If you attempt to do this without columnar editing features, you will find it slow going!
An alternate way to do this is to use a command formed from a "regular expression" to create the rename command. If you are not familiar with "regular expressions", ask a programmer friend as this is not an easy topic to learn from scratch. If you are familiar with regular expressions, this is probably the simplest way to perform this task. I haven't used them in many years and no longer recall the exact syntax to use or I would tell you myself.
Regardless of what kind of editor you use, the goal is to turn this ASCII file list of paths and filenames into a batch file (simply rename file1.txt to file1.bat when you are finished editing). You can then run the batch file by typing file1.bat at a command prompt.
I have just run into this same problem myself using the same really wonderful tool that you used. I am writing this while waiting for the undelete program to finish. That it restores files with this extra extension seems very anti-intuitive so I will look for an option to make it not do this when it finishes. If I find one, I will post a new answer here that is more specific to this tool. Otherwise, I am going to have rename all kazillion files just as you had to.
You experienced this problem because the disk that you recovered your files to "does not support encryption", according to the Active# UNDELETE documentation. The documentation offers no further explanation of what kind of disks support encryption, etc.
They offer a Decrypt command that restores the file's proper names as a post processing step. Unfortunately, this requires that you "include" each and every file to be decrypted, with no support for wildcards and parsing subdirectories so that is a non-starter, in my opinion given that both of us have hundreds of thousands of files to be renamed.
I did find that by selecting a normal fixed (non-removable) hard drive as the destination of the recovery effort, that the resulting files do not end up encrypted (i.e., they are recovered with the proper file name and extension). I originally chose a large USB based flash drive and the files were stored in their "encrypted" state (not really encrypted, but possibly potentially so and thus they give the .efs extension). Of course, this meant that I had to run the command all over again after switching to a regular hard drive (takes about 16 hours to recover 80GB worth of files due to presence of many sector CRC errors).

Software configuration management tool for hundreds of binary files, many are large

Note: I've tried searching, Stackoverflows near useless. I am not sure what kind of tool I need.
At my organization we need to keep track of the software configuration for many types of computers including the binary installers and automation scripts. Change is infrequent but the size of latest version of the configuration is several gigs.
We are trying to use Mercurial to store changes but it is just too slow, even without many revisions at all. I did an hg status but killed it after it took 10 minutes without finishing.
We are looking for a way to store the current configuration as well as having the old configurations there just in case. I have never done anything like this before and do not know what tools are available or even suitable for such tasks. Can someone point me in the right direction or tell me how the are solving this problem? Thanks
Since hard disk space is cheap and being able to view binary differences isn't very helpful, perhaps the best option you have is to store each configuration in a new directory that is indexed somehow. Example below:
/software/configs/2009-03-15
/software/configs/2009-09-28
/software/configs/2009-09-30
Given the size of your files and the infrequent number of changes, this would allow you to pick a configuration from a given 'tag' without the overhead of revision control.
If you pack your files into a single tar file and generate a SHA-512 hash, then you can be reasonably sure that no one has tampered with your files since they were archived.
While I don't know specific details about how to implement this strategy in mercurial, I have been working with git and git-fat. It sets up a general procedure that is likely to be feasible on mercurial as well. Basically the idea is whenever you add a binary file to the repository, under the hood, the repo creates a symlink to the file that is actually stored in another location as a checksummed object.
This allows large files to be tracked by the repo, without storing the actual data inside. It requires the data to be stored in some other location (perhaps in a binary management system).
It might take some configuration to do it in mercurial, but I think it's an elegantly simple solution.