Downloading large files from S3 with snakemake

Downloading large files from S3 with snakemake - snakemake

I am making a workflow in snakemake 5.8.2 which takes as an input huge files from S3 (280Gb each of 4 files). The first rule just concatenates the files.
When I run the workflow, it seems to download only 5GB of some split version of each file, delete the file and fails to concatenate. I know aws transfers files in 5GB batches but I expected snakemake to handle this in the background.
Am I missing something? Is this a bug?
Thanks,
Ilya.

Related

How run pg_restore for s3 folder?

Here is example about restoring postgres from s3 sigle file. It reads this file into stdout and redirects this strem into pg_restore tool. But what if there are several gz files on s3? Is there a way to make pg_restore read them whihout downloading into any temp folder?
About loop restoring
First of all it is unclear how postgress handle this situation. I have 50-100 gz files with different sizes. Yes they have a nems and can be sorted. But will postgres perform correct restore when I you only single file?
Also loop leads to donwload all files into some folder. Files can be big. So it it better to restore them from s3 directly.

Prevent rclone from re-copying files to AWS S3 Deep Archive

I'm using rclone in order to copy some files to an S3 bucket (deep archive). The command I'm using is:
rclone copy --ignore-existing --progress --max-delete 0 "/var/vmail" foo-backups:foo-backups/vmail
This is making rclone to copy files that I know for sure that already exist in the bucket. I tried removing the --ignore-existing flag (which IMHO is badly named, as it does exactly the opposite of what you'd initially expect), but I still get the same behaviour.
I also tried adding --size-only, but the "bug" doesn't get fixed.
How can I make rclone copy only new files?

You could use rclone sync, check out https://rclone.org/commands/rclone_sync/
Doesn’t transfer unchanged files, testing by size and modification time or MD5SUM. Destination is updated to match source, including deleting files if necessary.

It turned out to be a bug in rclone. https://github.com/rclone/rclone/issues/3834

Snakemake does not recognize symlinked files as input files

I am running a snakemake pipeline on a couple of files that I symlinked from another directory.
However, it seems that the snakemake file does not recognize the input files when they are symlinked. Is this supposed to happen, or is there someway around this? I am getting a 'missing input files' error.
When I cp my input file into the exact same directory and run the script, it works fine. It's not a big deal to have to cp stuff around, but with a ton of data this might be a little more trouble than it's worth.
Version is 4.6.0

Can make figure out file dependencies in AWS S3 buckets?

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?

The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.

I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

S3DistCP copies some files for the manifest and doesnt copy the rest

We are using S3Distcp to copy files from S3 to HDFS by using a manifest file - i.e., we use --copyFromManifest argument in the S3Distcp command. At the S3DistCP step, however, only some of the files are copied that are listed in the manifest. I am not sure where should we start looking for problems - i.e., why are some files being copies and others are not?
Thanks

Maybe the problem is you have files with the same name but on different directories. In this case you will need to change the way you construct the baseName and srcDir fields. Please describe how you build your manifest file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Downloading large files from S3 with snakemake - snakemake

Related

How run pg_restore for s3 folder?

Prevent rclone from re-copying files to AWS S3 Deep Archive

Snakemake does not recognize symlinked files as input files

Can make figure out file dependencies in AWS S3 buckets?

S3DistCP copies some files for the manifest and doesnt copy the rest

Categories

Resources