How to supply snakemake options within the Snakefile itself? - snakemake

I would like to specify singularity bind paths inside the snakefile (ie. snakemake script) and not via command line. I believe this could be done somehow via their api by from snakemake import something, etc. How do I achieve this?
Broadly speaking, how do we supply options/arguments to snakemake via their api within a Snakefile?

I made a pipeline that does a couple things, and one of those things is download samples. Starting 30 downloads at the same time is a waste of resources so I wanted to limit the number of parallel downloads, and I don't want to always pass --resources parallel_resources=1 to the command. I noticed that the snakemake.workflow exists when a Snakefile is executed, and here I added this as a resource:
workflow.global_resources.update({'parallel_downloads': 1})
I have no experience with singularity, so I don't fully understand what you want. But my guess is that this is somewhere stored in the workflow and you can change it there.
p.s. this is not at all through an API, or guaranteed to work between versions

Related

Multiple users executing the same workflow

Are there guidelines regarding how to share a Snakemake workflow among multiple users on the same data under Linux, or is the whole thing considered bad practice?
Let me explain in case it's not clear:
Suppose user A executes a workflow in directory dir/. Assume the workflow terminates successfully, and he/she then properly sets file/directory permissions recursively on all output and intermediate files and the .snakemake/ subdirectory for other users to read/write, of course.
User B subsequently navigates to dir/, adds input files to the workflow, then executes it. Can anything go wrong?
TL;DR: I'm asking about non-concurrent execution of the same workflow by distinct users on the same system, and on the same data on disk. Is Snakemake designed for such use cases?
It's possible to run snakemake --nolock which will prevent locking of the directory, so multiple runs can be made from inside the same directory. However, without lock, there's now an opening for errors due to concurrent runs trying to modify the same files. It's probably OK, if you are certain that this will be avoided, e.g. if you are in constant communication with another user about which files will be modified.
An alternative option is to create a third directory/path, and put all the data there. This way you can work from separate directories/path and avoid costly recomputes.
I would say that from the point of view of snakemake, and workflow management in general, it's ok for user B to add or update input files and re-run the pipeline. After all, one of the advantages of a workflow management system is to update results according to new input. The problem is that user A could find her results updated without being aware of it.
From the top of my head and without more detail this is what I would suggest. Make snakemake read the list of input files from a table (pandas comes in handy for this) or from some configuration file. Keep this sample sheet under version control (with git/github) together with the Snakefile and other source code.
When users update the working directory with new files, they will also need to update the sample sheet in order for snakemake to "see" the new input and other users will know about it via version control. I prefer this setup over dumping files in a directory and letting snakemake process whatever is in there.

Snakemake - delete all non-output files produced by a workflow

I have a workflow that produces tons of files, most of them are not the output of any rule (they are intermediate results). I'd like to have the option of deleting everything that is not the output of any rule after the workflow is complete. This would be useful for archiving.
Right now the only way I found to do that is to define all outputs of all rules as protected, and then run snakemake --delete-all-output. Two questions:
1. Is this the way to go, or is there a better solution?
2. Is there a way to automatically define all outputs as protected, or do I have to go through the entire code and wrap all outputs with protected()?
Thanks!
Maybe the option --list-untracked helps?
--list-untracked, --lu
List all files in the working directory that are not
used in the workflow. This can be used e.g. for
identifying leftover files. Hidden files and
directories are ignored.
In addition to #dariober's suggestion, here's a few ideas:
It sounds like you know this already, but you could wrap unneeded output in temp(), which will cause Snakemake to delete it automatically. You can combine this with --notemp for debugging. With temp(), deletion will happen progressively, not after the workflow is complete.
Another option may be to use the onsuccess hook defined by snakemake. From the docs, "The onsuccess handler is executed if the workflow finished without error." So, say, if throughout the workflow, you put unneeded file in a temp/ folder or similar, you could use shutil.rmtree("temp") in onsuccess, which would delete all your unneeded files only after the workflow finished successfully, as you require. (Note also the similar onerror, should you need it.)

serverless - aws - SecureLambdaFunction env

I'm having the following case:
I setting several environments variables on my serverless.yml file like:
ONE_CLIENT_SECRET=${ssm:/one/key_one~true}
ONE_CLIENT_PUBLIC=${ssm:/one/key_two~true}
ANOTHER_SERVICE_KEY=${ssm:/two/key_one~true}
ANOTHER_SERVICE_SECRET=${ssm:/two/key_two~true}
let' say I have like 10 envs, when I try to deploy I get the following error:
An error occurred: SecureLambdaFunction - Lambda was unable to configure your environment variables because the environment variables you have provided exceeded the 4KB limit. String measured: JSON_WITH_MY_VARIABLES_HERE
So I cannot deploy, I have an idea of what the problem is but I dont have a clear path to solve it, so my questions are:
1) How can I extend the 4Kb limit?
2) assuming my variables are set using SSM, I'm using the EC2 Parameter store to save them. (this is more related to a serverless team or someone that knows the topic) how does it work behind the scenes?
- when I run sls deploy does it fetch for the values and included on the .zip file? (this is what I think it does, I just want to clarify) or does it fetch the values when I exec the lambdas? I'm asking cause I go to the aws lambda console and I can see em set there.
Thanks!
After taking a look around in deep, I came with the following conclusion:
Using this pattern ONE_CLIENT_SECRET=${ssm:/one/key_one~true} means that the sls framework is going to download the values on compilation time and embed into the project, this is where the problem comes, you can see this after uploading the project, your variables are going to be set on plain text on the lambda console.
My solution was to use a middy middleware to load ssm values when executing the lambda. This means, you need to code your project in a way that does not trigger any code until the variables are available and find a good strategy to catch the variables (cold start), otherwise, it will add more time to the execution.
The limit of 4Kb cannot be changed and after read about this, it seems obvious.
So short story, find a strategy of middleware and embed values that work best for you if you find this problem.

Is there a way to make Gitlab CI run only when I commit an actual file?

New to Gitlab CI/CD.
What is the proper construct to use in my .gitlab-ci.yml file to ensure that my validation job runs only when a "real" checkin happens?
What I mean is, I observe that the moment I create a merge request, say—which of course creates a new branch—the CI/CD process runs. That is, the branch creation itself, despite the fact that no files have changed, causes the .gitlab-ci.yml file to be processed and pipelines to be kicked off.
Ideally I'd only want this sort of thing to happen when there is actually a change to a file, or a file addition, etc.—in common-sense terms, I don't want CI/CD running on silly operations that don't actually really change the state of the software under development.
I'm passably familiar with except and only, but these don't seem to be able to limit things the way I want. Am I missing a fundamental category or recipe?
I'm afraid what you ask is not possible within Gitlab CI.
There could be a way to use the CI_COMMIT_SHA predefined variable since that will be the same in your new branch compared to your source branch.
Still, the pipeline will run before it can determine or compare SHA's in a custom script or condition.
Gitlab runs pipelines for branches or tags, not commits. Pushing to a repo triggers a pipeline, branching is in fact pushing a change to the repo.

amazon s3 renaming and overwriting files, recommendations and risks

I have a bucket with two kinds of file names:
[Bucket]/[file]
[Bucket]/[folder]/[file]
For example, I could have:
MyBucket/bar
MyBucket/foo/bar
I want to rename all the [Bucket]/[folder]/[file] files to [Bucket]/[file] files (and thus overwriting / discarding the [Bucket]/[file] files).
So as in the previous example, i want MyBucket/foo/bar to become MyBucket/bar (and overwrite / duscard the original MyBucket/bar).
I tried two methods:
Using s3cmd's move command: s3cmd mv s3://MyBucket/foo/bar s3://MyBucket/bar
Using Amazon's SDK for php: rename(s3://MyBucket/foo/bar, s3://MyBucket/bar)
Both methods seem to work, but - considering I have to do this as a batch process on thousands of files,
my questions are:
Which method is preferred?
Are there other better methods?
Must I delete the old files prior to the move/rename? (it seems to work fine without it, but I might not be aware of risks involved)
Thank you.
Since I asked this question about 5 months ago, I had some time to gain some insights; so I will answer it myself:
From what I have seen, there is no major difference performance-wise. I can imagine that calling s3cmd from within PHP might be costly, due to invoking an external process for each request; but then again - Amazon's SDK uses cURL to send it's requests, so there is not much of a difference.
One difference I did notice, is that Amazon's SDK tends to throw cURL exceptions (seemingly randomly, and rarely), but s3cmd did not crash at all. My scripts run on 10's of thousands of files, so I had to learn the hard way to deal with these cURL exceptions.
My theory is that cURL crashes when there is a communication conflict on the server (for example, when two processes try to use the same resource). I am working on a development server on which sometimes several processes access S3 with cURL simultaneously; these are the only situations in which cURL exhibited this behaviour.
For conclusion:
Using s3cmd might be more stable, but using the SDK allows more versatility and better integration with you PHP code; as long as you remember to handle the rare cases (I'd say 1 for every 1000 requests, when several processes run simultaneously) in which the SDK throws a cURL exception.
Since either methods, s3cmd and SDK, will eveltually issue the same REST call, you can safely choose the one is best for you.
When you are moving a file, if the target exists, it is always replaced, then, if you don't want this behavior, you will need to check whether the target file name already exists, in order to perform or not the move operation.