Best way to verify a file header - header

I'm trying to verify that a file header (metadata) has the correct info and was not corrupted.
For example:
Some one manipulates the file size in the header of an archive to exploit a zero day. How do I check that the file size is actually that size?
What is the best way to do it?

Related

Remove portion of a remote file using SFTP

I would like to remove a portion of a file using the SFTP protocol. Example:
This is a sample text. --> This is a text.
The standard protocol write operation takes an offset and a string of data as inputs. However, the interface does not specify a file length to possibly remove characters from the file. That way the file can only ever grow in size. In the above example, if I attempted to update the file using write, the resulting output would be:
This is a text.e text.
What is the proper way of removing characters in SFTP? Is there perhaps a terminating character which is used to signal the end of a file when using write? Or do I have to just delete the entire file and re-upload a new one?
You cannot remove a part of a file contents. Not even from a local file, let only from a remote file over SFTP.
You have to overwrite the file – or at least whole part of the file starting with the first changed byte. And then truncate the file, in case the new contents is shorter than the original.
You can use copy-data SFTP extension to copy data within the file itself to avoid need to reupload the existing contents. If your SFTP server supports the extension. In OpenSSH, it's supported only in very recent version 9.0.

NSFileSystemFileNumber is changed after file is edited/updated in objective c

I am working on File Management System exactly like Dropbox in Cocoa.
My problem is when i edit any text file at that time NSFileSystemFileNumber is changed.
I want an unique NSFileSystemFileNumber even if that edited file is moved from the particular folder.
In short, I just want to know how to fetch that moved file's old or original path from the database.
Any alternate way to solve out this problem?
Thanks in Adv..!!
It depends on how the editor save functionality is implemented. Each editor will have different functionality and it sounds like the one you are using does the following:
Delete existing file.
Create new file.
Write file data.
Hence you get a new inode each time. Others might:
Truncate existing file.
Write file data.
which would result in the same inode each time.
There is nothing you can about this so you will need to track file changes using the name or something, not the inode.

Ways to achieve de-duplicated file storage within Amazon S3?

I am wondering the best way to achieve de-duplicated (single instance storage) file storage within Amazon S3. For example, if I have 3 identical files, I would like to only store the file once. Is there a library, api, or program out there to help implement this? Is this functionality present in S3 natively? Perhaps something that checks the file hash, etc.
I'm wondering what approaches people have use to accomplish this.
You could probably roll your own solution to do this. Something along the lines of:
To upload a file:
Hash the file first, using SHA-1 or stronger.
Use the hash to name the file. Do not use the actual file name.
Create a virtual file system of sorts to save the directory structure - each file can simply be a text file that contains the calculated hash. This 'file system' should be placed separately from the data blob storage to prevent name conflicts - like in a separate bucket.
To upload subsequent files:
Calculate the hash, and only upload the data blob file if it doesn't already exist.
Save the directory entry with the hash as the content, like for all files.
To read a file:
Open the file from the virtual file system to discover the hash, and then get the actual file using that information.
You could also make this technique more efficient by uploading files in fixed-size blocks - and de-duplicating, as above, at the block level rather than the full-file level. Each file in the virtual file system would then contain one or more hashes, representing the block chain for that file. That would also have the advantage that uploading a large file which is only slightly different from another previously uploaded file would involve a lot less storage and data transfer.

Changing hash of a files

I have a folder full of binary files and I want to make a change to these files so that the hash of these files will change. I want to do this is a fashion that doesn't pertinently corrupt the files. Meaning that the change should still allow the file to operate normally or that I should be able to undo the change at any point in time.
Does anyone know of a script that I could use to do this or many a program that will automate this?
Cheers
UPDATE
Its a edge case that I am trying to deal with. I have a system that only allows me to store a file with a given hash once. Hence I am wanting to change the content hash of the file to allow the file to be stored. Note the system in question is not one I control or can change.
Couldn't I just add a random 1 to the end of the file and then remove it afterward without breaking anything? I'm just not sure how to script this - as in how to modify the binary data in this way. Note I'm in a windows environment.
Without knowing the format of the files, we can't tell. It may in fact be impossible - for instance if these binary files are self-signed with some private key. Changing any single bit within the file is likely to render it invalid.
Is your hash calculated purely from the contents, and not any other metadata that you can change (such as filename or modified date)? If so, you're probably out of luck. If the hash is meant to detect when the content changes, but you're trying to change the hash without actually changing the content, you've clearly got a problem...
What is the hash used for? Why do you want to change it? There may be an alternative solution if you could give us more information about the bigger picture.
EDIT: One alternative is to effectively create your own container format - so while a file is stored in your container format, it's not usable in its original form, but it can be extracted easily. Your container could be as simple as "add four bytes at the end as a seed to disturb the hash" - "extracting" the file would just involve copying it and removing the last four bytes. But the important point is that what you end up with isn't an MP3 file or whatever you started with - it's your custom format, simple as it is. You need to package/extract the file any time you interact with the store.

Comparing uncompressed local files to compressed files stored on Amazon S3?

We put hundreds of image files on Amazon S3 that our users need to synchronize to their local directories. In order to save storage space and bandwidth, we zip the files stored on S3.
On the user's end they have a python script that runs every 5 min to get a current list of files, and download new/updated files.
My question is what's the best way determine what is new or changed to download?
Currently we add an additional header that we put with the compressed file which contains the MD5 value of the uncompressed file...
We start with a file like this:
image_file_1.tif 17MB MD5 = xxxx1234
We compress it (with 7zip) and put it to S3 (with Python/Boto):
image_file_1.tif.z 9MB MD5 = yyy3456 x-amz-meta-uncompressedmd5 = xxxx1234
The problems is we can't get a large list of files from S3 that include the x-amz-meta-uncompressedmd5 header without an additional API for EACH one (SLOW for hundreds/thousands of files).
Our most practical solution is have users get a full list of files (without the extra headers), download the files that do not exist locally. If it does exist locally, then do and additional API call to get the full headers to compare local MD5 checksum against x-amz-meta-uncompressedmd5.
I'm thinking there must be a better way.
You could include the MD5 hash of the uncompressed image into the compressed filename.
So image_file_1.tif could become image_file_1.xxxx1234.tif.z
Your user python file which does the synchronising would therefore have the information needed to determine if it needed to go get the file again from S3, and could either strip out the MD5 part of the filename, or maintain it, depending on what you wanted to do.
Or, you could also maintain, on S3, a single file containing the full file list including the MD5 metadata. So the python script just need to fetch that single file, parse that, and then decide what to do.