Best way to specify local inputs and S3 inputs without modifying rules? - snakemake

I have a snakemake workflow that runs on a local HPC. I also want to use this workflow on AWS using boto to push files to and from S3 as needed. This circumvents IO issues when used in conjunction with --no-shared-fs. I was thinking the best way to do this was just using --default-remote-provider S3 & --default-remote-prefix to specify the bucket name. However, when this option gets invoked, it appends the S3.remote & bucket prefix to all the inputs. This includes executable programs such as: bwa, samtools, bedtools, etc. When boto tries to download these programs using boto, it doesn't keep executable permissions.
Is there a good way to specify "local" inputs vs S3 inputs without having to modify the rules? This way I can keep only 1 version of the workflow if I need to change/improve it.
Thanks for any help!

You can enclose the local files with local(...), analogous to the way to mark temp files. This way, they will be excluded from default remote application.
Note that tools should usually not be input or output files. Consider using the snakemake conda integration and bioconda for software deployment and specification:
http://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management

Related

Does gitlab-ci have a way to download a script and store it on the local file system so it could be run?

Does gitlab-ci have a way to download a script and store it on the local file system so it could be run? Looks like other's have asked similar questions (see below).
One way to do it would be to use curl (but curl has to exist in the CI runner):
curl -o ./myscript -k https://example.com/myscript.sh
This was from https://stackoverflow.com/a/22800194/3281336.
If I have a script that I would like to use in multiple CI-pipelines, I'd like to have a way to download the script to the local file system to use in the pipeline. NOTE: I don't have the ability to create a custom runner or docker image in my given situation.
If the script were available via git or an https website, what are my alternatives?
Some search results
https://docs.gitlab.com/ee/ci/yaml/includes.html - Gitlab supports a way to include files, even from GIT repos. This might work I just haven't read how.
How to run a script from file in another project using include in GitLab CI? - Similar but the answer uses a multi-project pipeline and a trigger which is really (I think) a different answer.
.gitlab-ci.yml to include multiple shell functions from multiple yml files - Similar but the question is dealing with scripts in YAML files and I'm dealing with a stand alone script.
How to include a PowerShell script file in a GitLab CI YAML file - So far this is the closest to my question and some might consider it the same even though the question is asking about a powershell script. The answer said this wasn't possible to include a script (so maybe this is not possible using the GitLab CI syntax).
If it is possible, please let me know how to do this.

Is there a way to generate YAML instead of JSON CloudFormation templates using the Serverless Framework?

The Serverless framework is such a great tool. I use it wherever possible.
I would like to know if there is a way to update the serverless.yml file to output yaml instead of json when generating CloudFormation templates. In the .serverless folder they are in json format but would really be great if they can be auto-generated to yaml instead.
I would not prefer to use great tools like https://www.json2yaml.com/
Any help greatly appreciated.
There's always a way but the simple end-user answer is no.
The serverless-framework has a naming strategy file per provider and for AWS its hard-coded to cloudformation-template-[create|update]-stack.json. When the file writer does its job it looks at the extension and runs the JSON writer.
However as per the AWS naming file in their repo, they've made it available to be modified by writing a custom plugin. As long as your plugin changed the naming strategy to anything that ends in .yml the file writing service will switch to a YAML writing strategy.

Can make figure out file dependencies in AWS S3 buckets?

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?
The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.
I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

What is the optimal way to store data-files for testing using travis-ci + Docker?

I am trying to set-up the testing of the repository using travis-ci.org and Docker. However, I couldn't find any manuals about what is the politics on memory usage.
To perform a set of tests (test.sh) I need a set of input files to run on, which are very big (up to 1 Gb, but average 500 Mb).
One idea is to wget directly in test.sh script, but for each test-run it would be not efficient to download the input file again and again.
The other idea is to create a separate dockerfile containing the test-files and mount it as a drive, but this would be not nice to push such a big dockerimage in the general register.
Is there a general prescription for such tests?
Have you considered using Travis File Cache?
You can write your test.sh script in a way so that it will only download a test file if it was not available on the local file system yet.
In your .travis.yml file, you specify which directories should be cached after a successful build. Travis will automatically restore that directory and files in it at the beginning of the next build. As your test.sh script will then notice the file exists already, it will simply skip the download and your build should be a little faster.
Note that how the Travis cache works is that it will create an archive file and put it on some cloud storage where it will need to download it later on as well. However, the assumption is that the network traffic will likely be inside that "cloud" and potentially in the same data center as well. This should still give you some benefits in terms of build time and lower use of resources in your own infrastructure.

Which S3 manager creates '_$folder$' files for pseudo-folders?

I know that S3 has no folders, and I'm working with an inherited application that has some buckets filled with folder_name_$folder$ items. I know that a lot of different tools use these files, or other folder sigils, depending on the tool, to help represent 'folders' to various visual interfaces. I'm wondering which one uses this particular convention.
I'd like to remove them so that my various rake tasks that run down lists of files can go faster, but I'm afraid I'll end up breaking some tool that someone else in the company uses. Can anyone say which tools create these keys, and what functionality, if any, removing them would break? S3fox? The main AWS console?
the _$folder$ folders are created with S3Fox, AWS Console doesn't create them. you can safely delete them if you like
Thanks
Andy
They are also created by Hadoop S3 Native FileSystem (NativeS3FileSystem).
The $folder$ is created the hadoop and spark application during the file movement process.
You can remove them using s3 command
aws s3 rm s3://path --recursive --exclude "*" --include "*\$folder\$"
They seem to be created by 3Hub as well.
I've encounter this issue when I was creating Athena Partition when there was no such key in S3. These files can be safely removed.
Please check also here https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/