Backup application with single instance functionality - file-upload

Currently working on an application, that help you to take backup of the files in you machine at the server (hosted by the company itself), so that you can recover data after any hdd crash. I have implemented Single Instance feature, across the users.
Single Instance : A file uploaded already at the server, wouldn't be uploaded again. Whenever any other instance of the exact file uploaded will not be actual upload but some database changes and linked to the same previously uploaded file.
Issue arise when same file (that has not already been uploaded before) is uploaded simultaneously by more than one users, On Start file wouldn't be detected for an instance (as database is updated only after successful upload/backup). All are running, at once. What will be the best way to implement single instance in this way.
I am thinking when I let all the instance upload as it is. So more than one instance of the file will reside at the server. But whenever another backup of the same file will be taken afterwards, I will remove all the previous instances and link them up with the one. This will not let user double uploads and also less complex on the cost of some disc space that too for a while probably (till next upload of the same file will be done)
Thanks for your thoughts in advance.

Calculate the hash (signature) of the file before upload and store it in the DB.
Then - start uploading.
if a similar file will be mark for uploading during the upload of the first file (you will know b/c you already saved the hash) - you will hold the 2nd file upload, until the first one finish successfully, and then link, if the 1st on fails, you can go to the 2nd source and upload it.

Related

FTP Server Download Returns blank files

We have a process in place built on Excel VBA that uploads a file to FTP Server. On the other side, our client downloads it. Very randomly, they complain that the file they received is blank (the file name is the same though). We then check at our end and see that the file that was uploaded was never blank. So here comes the problem: we're always arguing whether it was our error or theirs.
I figured that there might be a couple of reasons behind it but I have a few questions to ask before coming to conclusions:
If, say, the file was never uploaded (a possibility), what happens when the client runs a download process at their end? Can that download process generate a blank file with the same name as our output file? It sounds impossible to me but since the client is following up on this issue, I have to ask this silly question.
How does the mechanism work - what are the steps that happen on FTP server the moment my process completes uploading the file? I sometimes see that as soon as I upload the file, a 0kb file is created and then a second later (or less) the file with right size appears? Could it be possible that their process is running right before this actual file creation?
Thank you in advance for your help!

FTP client sees a file that isn't there... How can I successfully delete/overwrite this "ghost" file?

So we have a client that creates "training packages" and then uploads them via ftp to their website. They create the training packages in PowerPoint, and then use some program to convert them into html/swf files and package them within a folder. When they upload, they use Filezilla, and just transfer the entire folder over. The folder is uniquely named, uses no spaces or special characters.
These files have uploaded fine for about a year. Recently, they've run into a problem. Whenever they try to upload training package folder, they are immediately presented with the "This file already exists, do you want to overwrite?" message. Except... the folder they're moving is brand new, and the file it's asking to overwrite DOESN'T EXIST. When they choose "Overwrite" the file looks like it transfers, but the file size is wrong, and the training package doesn't work correctly.
This happens with every training package they try to upload. It's not just a badly outputted package. Also, it's always the same file that has the problem--it's the main "player" for the training package, and though it contains different content for every package, it is the same file name (cplayer.swf) every time.
Things they've tried without success:
-Re-uploading the file again by itself, and overwriting
-Deleting the "bad" file and re-uploading the single file - Get the overwrite message again, even though the file DOES NOT EXIST.
-Renaming the file on the server and re-uploading the single file - Get the overwrite message.
-Renaming the single file locally within the package and uploading/renaming it - Won't let us rename because the file already exists.
-Used another FTP client - Same results as above, so not a client specific problem.
-Used a different FTP login - Same results as above, so not a permissions problem.
Other things of note:
-The file is small--it's not a time out problem. Plus, all other files upload fine, and some are a lot larger.
-They've emailed this file to me, and I've uploaded it successfully.
I am completely at my wits end. Does anyone have any ideas where I can at least troubleshoot a little further?
Thanks for the non-help, the downvote, and the general lack of response on what was a pretty serious issue for me.
In case anyone else has a similar problem, here's what was going on:
Virus software (specifically Malware Bytes) was blocking THIS ONE SINGLE FILE. All I had to do was exclude the folder that contained the file.

Prevent file being overwritten

Imagine there are 3 or more independent locations where a file can be modified. These locations communicate to each other through email or mail (direct flash drive restoration). Though there is a big room for flow - to make simultaneous editing to the file and screw up things, this client won't change too much. He rather call everyone that he is working on the last update or tell the other guys that he is waiting for third guy's last update. Anyway, at some point after several exchanges, due to one of participants unintentional error THE LAST VERSION of the file eventually gets mixed up. From this point everyone searches for the last version BY LOOKING THE CONTENT of the file.
This client wants to have a central location (he has actually, that is his PC's some location) and let everybody (including himself) copy any new or suspected new file to this location but prevent file's last version being copied. From this location he has to easily copy, send or open the file and work.
So, here is my concept (2 steps):
step 1: I made an ad to the main application where this file is created or edited. This ad prompts the user to give a version number to the file with every invoked save command from the editing application. In fact the file can be re-saved multiple times but not considered modified (file attributes creation, save etc. do not have great meaning here). This said the user can cancel my ad-in but have saved the file, not saving a new file version.
step 2: multiple solutions:
solution A: I'm thinking to have a folder/file watch and prevent the last version of the file being overwritten. As you know, FileSystemWatcher will fire the change/delete etc., events AFTER FACT so, I have to back copy overwritten file after the fact (w/ some tricks).
solution B: have a database to store all version of files and built-in some shell extension to extract/view files from the database. Move all copied/pasted files to the database (my program folder) and restore latest file in working folder after watcher fires change/delete event.
solution 3: find out built-in windows tools (API etc.) to greatly rely on it with some programming.
Any ideas?
Thanks in advance.

How do services like Dropbox implement delta encoding if their files are stored in the cloud?

Dropbox claims that during syncing only the portion of files that changes are transmitted back to main server, which is obviously a great functionality, but how do they perform changes to files stored in Amazon S3 cloud? So for example, lets say a 30 page document on user's desktop contains changes to only page 4. Dropbox now syncs the blocks representing the changes and what happens on the backend if they files that they store are in the cloud? Does that mean they have to download the 30 page document stored in S3 to their server, then perform replacement of blocks representing page 4, and then uploading back to the cloud? I doubt this would be the case because that would be somewhat inefficient. The other option I could think of is if Amazon S3 provides update of file stored in the cloud based on byte ranges, so for example, make a PUT request to file X from bytes 100-200 which will replace all the bytes from 100 to 200 with value of PUT request. So I was curious how companies that use other cloud services such as Amazon, implement this type of syncing.
Thanks
As S3 and similar storages don't offer filesystem capabilities, anything that pretends to store files and directories needs to emulate a file system. And when doing this files are often split to pages of certain size, where each page is stored in a separate file in the storage. This way the changed block requires uploading only one page (for example) and not the whole file. I should note, that with files like office documents this approach can be faulty if file size is changed - for example, if you insert a page at the beginning or delete a page, then the whole file will be changed and the complete file would need to be re-uploaded. We didn't analyze how Dropbox in particular does his job, and I just described the common scenario. There exist also different "patch algorithms", where a patch can be created locally (if Dropbox has an older local copy in the cache) and then applied to one or more blocks on the server.
There are several synchronizing tools which transfer deltas over the wire like rsync, rdiff, rdiff-backup, etc. For bi-directional synchronising with S3 there are paid services like s3rsync for example. For pure client-side synchronising, tools like zsync can be considered (which is what many people employ to roll-out app updates).
An alternative approach would be to tar-ball a directory, generate a delta file (using rdiff or xdelta3), and upload the delta file by using a timestamp as part of the key. In order to sync, all you need to do is to perform these 2 checks client-side:
You have all the delta files from S3. If not pull them and apply them to generate the latest backup state.
Your last backup state corresponds to your current directory. If not generate a new delta file and push to S3.
The concerning factor here would be the at least 100% additional space utilization, client-side. But this approach will help you revert changes if needed.

How do i force a file to be deleted? Windows server 2008

On my site a user may upload a file (pic, zip, audio, video, whatever). He then may decide to replace it with a newer revision. This user may upload a file, make a post then decide to put up a new revision replacing the old (lets say its a large zip or tar.gz file). Theres a good chance people may be downloading it if he sent out an email or even im for the home user.
Problem. I need to replace the file and people may be downloading and it may be some minutes before it is deleted. I dont want my code to stall until i cant delete or check every second to see if its unused (especially bad if another user can start and he takes long creating a cycle).
How do i delete the file while users are downloading the file? i dont care if they stop i just care that the file can be replaced and new downloads are the new revision.
What about referencing the files indirectly?
A mapping script, maps a virtual file entry from your site to a real file . If the user wants to upload a new revision of his file you just update the mapping, not the real file.
You can install a daily task that scans all files and deletes all files without a mapping and without open connections.
lajuette's answer is right, the easiest solution is to work around the file locking altogether:
When a user uploads file foo.zip, internally store it as foo-v1.zip.
Create a mapping file somewhere (database, code, whatever) that maps foo.zip to foo-v1.zip.
Rather than exposing a direct link to the file, expose a link to a service that gets the file: mysite.com/Download?foo.zip or something. This service uses the mapping to determine which version of the file to send to the client.
When a new version is uploaded, create foo-v2.zip and update the mapping file.
It wouldn't be that hard to write a scheduled task that cleans up old, un-mapped files.
If your oppose to a database and If the filenames are in a fix format (such as user/id.ext) you could append the id with a revision number and enumerate the folder using a pattern (user/id-*) and use the latest revision.