I'm running into a persistent problem with rsync when using the file-from option. This option lets me pass rsync a pregenerated list of files I want to transfer. Currently I'm using the following
rsync -a -s -H --logfile=./rsync-logfile_1460 --files-from=./f.1460 /source/path /dest/path
So the file list is in the file f.1460. This file list corresponds to around 250k files from a single directory that has 500k files in it (I know, it's insane but that's how this genomics application spits out its results). The process moves along pretty well until it has transferred around 100k files and then it hangs for around 7 hours. Then it starts up again, moves another 10k to 20k and hangs for another 7 hours. Rinse. Lather. Repeat. This is a problem as we are moving a large amount of data (100s of TB) from one data storage system to another.
There is more than enough space on the target file system and the source file system doesn't have any faults. Note, both file systems are lustre running over ZFS. The rsync is running on a node with 128 cores and 256GB of RAM so I'm not resource limited.
Does anyone know if this is a problem with rsync? Are there better options I can use to get around this? I'm not seeing errors in the rsync log when it hangs - it sits there for ~7 hours not doing much of anything.
Any insight would be appreciated.
Related
I am currently running geth mist on Linux with an SSD and would like to move some (or all) of the chain data to an external drive to conserve space.
I understand there is a command line option to move the data directory:
geth --datadir <path to data directory>
My concerns are
Will implementing this now slow syncing due to the external drive being much slower than my internal SSD?
Will it cause a re-sync of the entire blockchain?
Currently, I run the following script on my bitcoin blocks directory and it avoids both of those issues by keeping high throughput data on the SSD and moving large, but less frequently accessed data, to the external drive.
#!/bin/bash
set -e
BLK_TARGET=/mnt/ssd/core/blocks #Replace with your destination, no trailing slash
find . -name '*.dat' -type f -printf '%f\n' > tomove
while read line; do
echo $line
mv "$line" "$BLK_TARGET/$line"
ln -s "$BLK_TARGET/$line" "$line"
done <tomove
rm tomove
echo Done
When I tried similar on the Ethereum Network it triggered a re-sync of the entire chain.
Can anyone recommend a similar process for Ethereum's geth client or appease my two concerns?
Current directory sizes on my SSD are:
:~$ du -sh .ethereum/*
37G .ethereum/geth
0 .ethereum/geth.ipc
12K .ethereum/history
28K .ethereum/keystore
I have listed three links below that I am using as references for when I dealt with this issue. I would advise reading them both after my answer.
It depends. As long as Mist can access that directory you should be fine; however, if Mist is not set up with the external drive properly it will have to resynch the entire chain and will take longer. Additionally, if you delete your chain data and start mist with the fast flag that should speed things. I have included a link to set up an external hard drive as well which is the last link
https://www.reddit.com/r/ethereum/comments/68emnn/does_ethereum_wallet_have_pruning_availability/
https://www.reddit.com/r/ethereum/comments/7hzb78/is_there_a_way_to_prune_my_chaindata_folder_on/
https://ethereum.stackexchange.com/questions/3338/how-can-i-specify-an-external-harddrive-as-the-download-target-for-the-mist-bloc
External SSDs tend to be significantly slower than internal. (Internal SSDs come in two main categories, SATA and NVMe, and while NVMe is significantly faster than SATA, SATA is still significantly faster than external.) As of mid-2019 I had difficulty syncing the chain with an external drive that could manage 150 MB read/write, though there are much faster external drives.
(If you'd like a guide to speed checking your external drive, this article goes through how to in section 7 ('Disk Performance Checkpoint').)
Assuming that you do still want to set up Geth to write to the external drive, if you run geth --datadir /path/to/external/drive, and the chaindata folder in the new location is empty, then Geth will begin syncing from the beginning of the chain. There's a simple way to avoid this problem, though: you can simply copy the chaindata from the internal drive onto the external drive. (This will take some time, but far, far less than syncing the chain.) I would recommend copying, and only deleting the old chaindata after you see that Geth is cooperating with the new location; you wouldn't want to discover that your new chaindata has issues without having a backup.
I am trying to copy files exceeding 500000 (around 1 TB) , through ssh however the pipe fails as I have exceeded the time limit for the ssh into the remote computer,
so I moved on to archiving and compressing (using tar and gzip) all the files on the remote computer, however even if i leave the process in the background, since I exceed the time for ssh'ing' into the remote computer the process is cancelled.
finally, I moved on to compressing the files one by one and then tarring ( based on a suggestion that archiving consumes a lot of time for large number of files ) however, I get error that argument list is too long.
Since all these files are spread in 20 such folders, I do not want to enter each and divide into further folders and archiving & compressing it.
Any suggestions would be really helpful.
Definitely tar and gz either the whole thing or the 20 directories individually (I would do the latter to divide and conquer at least a little.) That reduces overall transfer time and provides a good error check on the other end.
Use rsync through ssh to do the transfer. If it gets hosed in the middle, use rsync --append to pick up where you left off.
I plan on using a GCE cluster and gsutil to transfer ~50Tb of data from Amazon S3 to GCS. So far I have a good way to distribute the load over however many instances I'll have to use but I'm getting pretty slow transfer rates in comparison to what I achieved with my local cluster. Here are the details of what I'm doing
Instance type: n1-highcpu-8-d
Image: debian-6-squeeze
typical load average during jobs: 26.43, 23.15, 21.15
average transfer speed on a 70gb test (for a single instance): ~21mbps
average file size: ~300mb
.boto process count: 8
.boto thread count: 10
Im calling gsutil on around 400 s3 files at a time:
gsutil -m cp -InL manifest.txt gs://my_bucket
I need some advice on how to make this transfer faster on each instance. I'm also not 100% on whether the n1-highcpu-8-d instance is the best choice. I was thinking of possibly parallelizing the job myself using python, but I think that tweaking the gsutil settings could yield good results. Any advice is greatly appreciated
If you're seeing 21Mbps per object and running around 20 objects at a time, you're getting around 420Mbps throughput from one machine. On the other hand, if you're seeing 21Mbps total, that suggests that you're probably getting throttled pretty heavily somewhere along the path.
I'd suggest that you may want to use multiple smaller instances to spread the requests across multiple IP addresses; for example, using 4 n1-standard-2 instances may result in better total throughput than one n1-standard-8. You'll need to split up the files to transfer across the machines in order to do this.
I'm also wondering, based on your comments, how many streams you're keeping open at once. In most of the tests I've seen, you get diminishing returns from extra threads/streams by the time you've reached 8-16 streams, and often a single stream is at least 60-80% as fast as multiple streams with chunking.
One other thing you may want to investigate is what download/upload speeds you're seeing; copying the data to local disk and then re-uploading it will let you get individual measurements for download and upload speed, and using local disk as a buffer might speed up the entire process if gsutil is blocking reading from one pipe due to waiting for writes to the other one.
One other thing you haven't mentioned is which zone you're running in. I'm presuming you're running in one of the US regions rather than an EU region, and downloading from Amazon's us-east S3 location.
use the parallel_thread_count and parallel_process_count values in your boto configuration (usually, ~/.boto) file.
You can get more info on the -m option by typing:
gsutil help options
I am working on an open source backup utility that backs up files and transfers them to various external locations such as Amazon S3, Rackspace Cloud Files, Dropbox, and remote servers through FTP/SFTP/SCP protocols.
Now, I have received a feature request for doing incremental backups (in case the backups that are made are large and become expensive to transfer and store). I have been looking around and someone mentioned the rsync utility. I performed some tests with this but am unsure whether this is suitable, so would like to hear from anyone that has some experience with rsync.
Let me give you a quick rundown of what happens when a backup is made. Basically it'll start dumping databases such as MySQL, PostgreSQL, MongoDB, Redis. It might take a few regular files (like images) from the file system. Once everything is in place, it'll bundle it all in a single .tar (additionally it'll compress and encrypt it using gzip and openssl).
Once that's all done, we have a single file that looks like this:
mybackup.tar.gz.enc
Now I want to transfer this file to a remote location. The goal is to reduce the bandwidth and storage cost. So let's assume this little backup package is about 1GB in size. So we use rsync to transfer this to a remote location and remove the file backup locally. Tomorrow a new backup file will be generated, and it turns out that a lot more data has been added in the past 24 hours, and we build a new mybackup.tar.gz.enc file and it looks like we're up to 1.2GB in size.
Now, my question is: Is it possible to transfer just the 200MB that got added in the past 24 hours? I tried the following command:
rsync -vhP --append mybackup.tar.gz.enc backups/mybackup.tar.gz.enc
The result:
mybackup.tar.gz.enc 1.20G 100% 36.69MB/s 0:00:46 (xfer#1, to-check=0/1)
sent 200.01M bytes
received 849.40K bytes
8.14M bytes/sec
total size is 1.20G
speedup is 2.01
Looking at the sent 200.01M bytes I'd say the "appending" of the data worked properly. What I'm wondering now is whether it transferred the whole 1.2GB in order to figure out how much and what to append to the existing backup, or did it really only transfer the 200MB? Because if it transferred the whole 1.2GB then I don't see how it's much different from using the scp utility on single large files.
Also, if what I'm trying to accomplish is at all possible, what flags do you recommend? If it's not possible with rsync, is there any utility you can recommend to use instead?
Any feedback is much appreciated!
The nature of gzip is such that small changes in the source file can result in very large changes to the resultant compressed file - gzip will make its own decisions each time about the best way to compress the data that you give it.
Some versions of gzip have the --rsyncable switch which sets the block size that gzip works at to the same as rsync's, which results in a slightly less efficient compression (in most cases) but limits the changes to the output file to the same area of the output file as the changes in the source file.
If that's not available to you, then it's typically best to rsync the uncompressed file (using rsync's own compression if bandwidth is a consideration) and compress at the end (if disk space is a consideration). Obviously this depends on the specifics of your use case.
It sent only what it says it sent - only transferring the changed parts is one of the major features of rsync. It uses some rather clever checksumming algorithms (and it sends those checksums over the network, but this is negligible - several orders of magnitude less data than transferring the file itself; in your case, I'd assume that's the .01 in 200.01M) and only transfers those parts it needs.
Note also that there already are quite powerful backup tools based on rsync - namely, Duplicity. Depending on the license of your code, it may be worthwhile to see how they do this.
New rsync --append WILL BREAK your file contents, if there are any changes in your existing data. (Since 3.0.0)
I run an image sharing website that has over 1 million images (~150GB). I'm currently storing these on a hard drive in my dedicated server, but I'm quickly running out of space, so I'd like to move them to Amazon S3.
I've tried doing an RSYNC and it took RSYNC over a day just to scan and create the list of image files. After another day of transferring, it was only 7% complete and had slowed my server down to a crawl, so I had to cancel.
Is there a better way to do this, such as GZIP them to another local hard drive and then transfer / unzip that single file?
I'm also wondering whether it makes sense to store these files in multiple subdirectories or is it fine to have all million+ files in the same directory?
One option might be to perform the migration in a lazy fashion.
All new images go to Amazon S3.
Any requests for images not yet on Amazon trigger a migration of that one image to Amazon S3. (queue it up)
This should fairly quickly get all recent or commonly fetched images moved over to Amazon and will thus reduce the load on your server. You can then add another task that migrates the others over slowly whenever the server is least busy.
Given that the files do not exist (yet) on S3, sending them as an archive file should be quicker than using a synchronization protocol.
However, compressing the archive won't help much (if at all) for image files, assuming that the image files are already stored in a compressed format such as JPEG.
Transmitting ~150 Gbytes of data is going to consume a lot of network bandwidth for a long time. This will be the same if you try to use HTTP or FTP instead of RSYNC to do the transfer. An offline transfer would be better if possible; e.g. sending a hard disc, or a set of tapes or DVDs.
Putting a million files into one flat directory is a bad idea from a performance perspective. while some file systems would cope with this fairly well with O(logN) filename lookup times, others do not with O(N) filename lookup. Multiply that by N to access all files in a directory. An additional problem is that utilities that need to access files in order of file names may slow down significantly if they need to sort a million file names. (This may partly explain why rsync took 1 day to do the indexing.)
Putting all of your image files in one directory is a bad idea from a management perspective; e.g. for doing backups, archiving stuff, moving stuff around, expanding to multiple discs or file systems, etc.
One option you could use instead of transferring the files over the network is to put them on a harddrive and ship it to amazon's import/export service. You don't have to worry about saturating your server's network connection etc.