I have a problem. Everyday I have to upload my whole source code (it is a directory with several directories and files) to a server over VPN. The size of source code is around 250 MB. What I do everyday is, compress it (that reduces it's size to around 100 MB), then I transfer this zipped file over ftp to the server and finally unzip it there. It takes me around 20 minutes to transfer that.
I am sure there has got to be a better way of doing this than what I am doing. Either suggest me a better compression mechanism or faster upload method.
If you could set up a Version Control server it would be great, Mercurial and Git are perfect for this.
The other option is using rsync, which is a synchronizing tool that only uploads the differences between the two versions, avoiding repetitive transmission of data.
I'm assuming a UNIX-like environment here, but on windows the options are pretty much the same.
PD: this question is more fitted for superuser.
Related
I am using XAMPP and PHPMyAdmin and I'm trying to load English Wikipedia. Since the file is so big (1.7GB), it take a lot of time. I'm wondering if there is any way to resume the loading process. I have no problem with TimeOut or something like that. The problem is that if my firefox crashes for any reason, the process must start from the scratch.
The part which says allow interrupt is already checked with a check mark. But the problem is that for such a big file that I am loading, it's really difficult to expect to be done without any interrupt. If the laptop is shut down or restarted or so, the process is repeated from the beginning. Is there any way to solve this problem?
In the meantime, I am using
$cfg['UploadDir'] = 'upload';
and load the file from the upload directory on my computer.
Thanks in advance
First, I would recommend against using phpMyAdmin for such a large file. You're going to be constrained by PHP/Apache resource limits for things such as execution time and memory used (or, apparently, some Firefox resource on the client side), to a degree that even if it works properly will have to be done in so many small chunks that it's just not ideal. Even using the UploadDir functionality, you're going to be limited in ways that make it non-ideal to import your file this way. I suggest using the command-line tool for importing a file of this size.
Secondly, if you're going to use phpMyAdmin anyway, it's better to uncompress the file and deal with the raw .sql. This is not intuitive, because of course you think the smaller filesize is better, but phpMyAdmin has to first uncompress the compressed file before it can begin working with it, which can cause problems such as the resource limits (or even running out of disk space). phpMyAdmin can pick up an aborted import, but if you're spending 95% of the execution time uncompressing the file each time, you're going to make very, very slow progress. Actually, I wonder if you're even getting the full file uncompressed on execution before PHP kills the process due to timeout.
phpMyAdmin can pick up execution part way through; you can select which line to begin the import from. If you restart your computer part way through the export, you can use this means to resume your partial import.
Let's say I have a 10 MB file and go through these steps:
Open it in my favorite programming language for Read/Write
Erase everything in the stream
Write exactly 10 MB of random back to the same stream
Save the changes to disk
Delete the file through normal means
Can I be certain that the new 10 MB successfully overwrote the old 10 MB on a sector level in the hard drive? Or is it possible that the "erase everything in the stream" step deletes the old file and potentially writes the new 10 MB in a new location?
The data may still be accessible by a professional who knows what they're doing and can access the raw data on the disk (i.e. without going through the filesystem).
Your program is basically equivalent to the Linux shred command, which contains the following warning:
CAUTION: Note that shred relies on a very important assumption:
that the file system overwrites data in place. This is the traditional
way to do things, but many modern file system designs do not satisfy this
assumption. The following are examples of file systems on which shred is
not effective, or is not guaranteed to be effective in all file system modes:
log-structured or journaled file systems, such as those supplied with
AIX and Solaris (and JFS, ReiserFS, XFS, Ext3, etc.)
file systems that write redundant data and carry on even if some writes
fail, such as RAID-based file systems
file systems that make snapshots, such as Network Appliance's NFS server
file systems that cache in temporary locations, such as NFS
version 3 clients
compressed file systems
There's other situations as well, such as SSDs with wear leveling.
no, since on any modern file system commits are atomic, you can be almost 100% certain the 10Mb did not overwrite the old 10Mb, and that's before we consider journaled file systems that actually guarantee this.
Short answer: No.
This might depend on your language and OS. I have a feeling that the stream calls are passed to the OS and the OS then decides what to do, so I'd lean towards your second question being correct just to err on the safe side. Furthermore, magnetic artifacts will be present after a deletion which can still be used to recover said data. Even overwriting the same sectors with all zeros could leave behind the data in a faded state. Generally it is recommended to make several deletion passes. See here for an explanation or here for an open source C# file shredder.
For Windows you could use the SDelete command line utility which implements the Department of Defense clearing and sanitizing standard:
Secure delete applications overwrite a deleted file's on-disk data
using techiques that are shown to make disk data unrecoverable, even
using recovery technology that can read patterns in magnetic media
that reveal weakly deleted files.
Of particular note:
Compressed, encrypted and sparse are managed by NTFS in 16-cluster
blocks. If a program writes to an existing portion of such a file NTFS
allocates new space on the disk to store the new data and after the
new data has been written, deallocates the clusters previously
occupied by the file.
I am trying to run a moses server on Amazon ec2 ebs-backed instance. The languages models and translation models are about 200GB in total. I am thinking to have a moses installed instance loads languages models and translation models stored on s3. But i do not know how to configure moses.ini file in order to make moses knowing the path of ttable-file and lmodel-file. If anyone has done this before, any help would be greatly appreciated!!
Thank you.
I wouldn't recommend Amazon S3 for this. Amazon S3 is used for efficiently distributing files across the web. But if your whole purpose is to just read these files inside a VM - then saving these in S3 is not the correct choice. Refer to this answer for more details.
To answer your question, yes it is possible to mount an S3 bucket as a folder inside your server using S3FS. Here are instructions for Ubuntu and Red Hat.
But other ideal approaches are:
If you don't have enough space in the hard disk, then install Moses server on a different partition, format it using BTRFS and enable Transparent Compression. This will automatically compress/decompress files when you save/retrieve from the hard disk, so you will end up using much much less space. Also in a lot of benchmarks, Transparent Compression is shown to be faster, since lesser amount of data is transferred between hard disk and RAM. Specially when involving large files.
You can always attach a secondary EBS disk to your running VM (like a secondary hard disk). Use that for storing the translations/models (and you can also combine enable transparent compression as above too)
Run a separate VM without EBS, and just using the normal instance storage, and use that to store the translations alone. Now in your Moses server, you can mount the translations alone from this separate non-EBS VM using SSHFS
Overall, don't use S3, there are other much better ways.
Edit: Added link
Lately I've been having problems reading big files on a network drive and I just can't pinpoint what I may be doing wrong. I tried both in C++ (Unmanaged) and in C# and had about the same performances on both...which were somewhat abysmal.
Sometimes it will read at 4 KB/s a file on the network, but if this file is located on the local HD it will achieve easily the maximum data rate the HD can output. That is with reading 64 KB chunks at a time... I tried with bigger buffers up to insane numbers, or smaller and it doesn't make much differences.
I tried async IO in C# with BeginRead on the FileStream and OVERLAPPED IO in C++ as well as synchronous reads and they all had the same problems, which is being slow on the network.
The only solution we came up with is to copy the file using the OS CopyFile function on the local HD before actually reading the file but I'm not too satisfied with this approach. It just seems like CopyFile is doing something we are not that makes it incredibly faster than our approach.
Anyone has a clue as to why this is?
We would have to guess, since you aren't showing us your code. So my guess is that Windows file copy is opening the file with the FILE_FLAG_SEQUENTIAL_SCAN flag which in turn causes the file system/cache to choose optimal block sizes and submit read requests in anticipation of read calls that havn't been submitted yet.
We only can assume that you have been trying really all possible methods of reading/writing. Have you been reading synchronously or asynchronously? Did you try I/O completion ports? Or ReadFileEx() function? I would guess that the Windows CopyFile() function detects that you want to read a file from network and will use different method for reading then it would use for disk access.
If you have really exhausted all possible reading methods, and if you really need thing to be solved, then I would suggest to check out a bit on what is the CopyFile() function doing. There are numerous tools for doing that. E.g.: this one (or some other -- links on the same page).
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I make extensive use of source control for anything that relates to a project I'm working on (source, docs etc.) and I've never lost anything that way.
However, I have had two or three crashes (spread over the last 4 years) on my development machine that forced me to reinstall my system and reconfigure my apps (eclipse, vim, Firefox, etc.). For weeks after reinstalling, I was missing one little app or another, some PHP or Python module wasn't there, stuff like that.
While this is not fatal, it's very annoying and sucks up time. Because it seemed so rare, I didn't bother about an actual solution, but meanwhile I've developed a mindset where I just don't want stuff like that happening anymore.
So, what are good backup solutions for a development machine? I've read this very similar question, but that guy really wants something different than me.
What I want is to have spare harddrives on the shelf and reduce my recovery time after a crash to something like an hour or less.
Thinking about this, I figured there might also be a way to use the backup mechanism for keeping two or more dev workstations in sync, so I can continue work at a different PC anytime.
EDIT: I should've mentioned that
I'm running Linux
I want incremental backup, so that it's cheap to do it frequently (once or twice a day)
RAID is good, but I'm on a laptop most of the time, no second hd in there, no E-SATA and I'm not sure about RAIDing to a USB drive: would that actually work?
I've seen sysadmins use rsync, has anybody had any experiences with that?
I would set up the machine how you like it and then image it. Then, you can set up rsync(or even SVN) to backup your homedir nightly/etc.
Then when your computer dies, you can reimage, and then redeploy your home dir.
The only problem would be upgraded/new software, but the only way to deal completely with that would be to do complete nightly backups of your drive(s).
Thanks, this sounds like a good suggestion. I think it should be possible to also update the image regularly (to get software updates / installs), but maybe not that often. E. g. I could boot the image in a VM and perform a global package update or something.
Hanno
You could create an image of your workstation after you've installed & configured everything. Then when your computer crashes, you can just restore the image.
A (big) downside to this, is that you won't have any updates or changes you've made since you created the image.
Cobian Backup is a reliable backup system for Windows that will perform scheduled backups to an external drive.
You could create a hard drive image. Restoring from a backup image restores everything to the exact state that it was at the time you took the image.
Or you could create an installer that installs just about everything needed.
Since you expressed interest in rsync, here's an article that covers how to make a bootable backup image via rsync for Debian Linux:
http://www.debian-administration.org/articles/575
Rsync is fast and easy for local and network syncing and is by nature incremental.
You can use RAID-1 for that. It’s the synchronize type, not the backup type.
I use RAID mirroring in conjunction with an external hard drive using Vista's system backup utility to backup the entire machine. That way I can easily fix a hard drive failure, but in the event my system becomes corrupted, I can restore from the E-SATA drive (which I only connect for backup).
Full disclosure: I've never had to restore the backup, so it's kind of like the airbag in your car; hopefully it works when you need it, but there's no way to be sure. Also, the backup process is manual (it can be automated) so I'm only as safe as the last backup.
You can use the linux "dd" command line utility to clone a hard drive.
Just boot from a linux cd, clone or restore your drive and reboot.
It works great for Windows/Mac drives too.
This will clone partition 1 of the first hard drive (/dev/sda) to partition 1 of the second drive (/dev/sdb)
dd if=/dev/sda1 of=/dev/sdb1
This will clone partition 1 of the first hard drive to a FILE on the second drive.
dd if=/dev/sda1 of=/media/drive2/backup/2009-02-25.iso
Simply swap the values for if= and of= to restore the drive.
If you boot from the Ubuntu live CD it will automount your USB drives making it easy to perform the backup/restore with external drive(s).
CAUTION: Verify the identity of your drives BEFORE running the above commands. It's easy to overwrite the wrong drive if you not careful.
Guess this is not exactly what you are looking for, but I just document all what I install and configure on a machine. Google Docs lets me do this from anywhere, and keeps the document intact when the machine crashes.
A good step by step document usually reduces the recovery time to one day or so
If you use a Mac, just plug in an external hard drive and Time Machine will do the rest, creating a complete image of your machine on the schedule you set. I restored from a Time Machine image when I swapped out the hard drive in my MacBook Pro and it worked like a charm.
One other option that a couple of guys use at my company is to have their development environment on a large Linux server. They just use their local machines to run an NX client to access the remote desktop (NX is much faster than VNC) - this has the advantages of fast performance, automatic backup of their files on the server, and the fact that they're developing on the same hardware that our customers use.
No matter what solution you use, it is always a good idea to have a secondary backup, too. Secondary backup should be off-site and include your essential work (source code, important docs). In case something happens to your main site (fire at the office, somebody breaks in and steals all your hardware, etc.), you would still be able to recover, eventually.
There are many online backup solutions. You could just get a remote storage at a reliable provider (e.g. Amazon S3) and sync your work on a daily basis. The solution depends on the type of access you can get, but rsync is probably the tool you would use for that.