minimal requirements for specifying file paths in a cloud bucket - snakemake

I am very familiar with GCP but new to Snakemake.
I have a simple working example with a snakemake file that just has:
rule sed_example:
input:
"in{sample}.txt"
output:
"out{sample}.txt"
shell:
"sed 'y/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/nopqrstuvwxyzabcdefghijklmNOPQRSTUVWXYZABCDEFGHIJKLM/' {input} > {output}"
This runs fine on my local machine with the command:
snakemake -s Snakefile --verbose --cores=1 out{1,2,3,4,5}.txt
If I run it without specifying the output filenames, I get the error:
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
But I guess that is expected.
Next, I want to run the same thing but have the files be in a GCP/GCS bucket. I put the files there with gsutil rsync and I can even list them from inside a snakefile, if I use this code snippet inside the snakemake file, so my google auth setup seems fine:
from os.path import join
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider()
GS_PREFIX = "snakemake-cluster-test-02-bucket"
samples, *_ = GS.glob_wildcards(GS_PREFIX + 'in{sample}.txt')
print(samples)
The input files are there:
$ gsutil ls gs://snakemake-cluster-test-02-bucket/
gs://snakemake-cluster-test-02-bucket/Snakefile
gs://snakemake-cluster-test-02-bucket/in1.txt
gs://snakemake-cluster-test-02-bucket/in2.txt
gs://snakemake-cluster-test-02-bucket/in3.txt
gs://snakemake-cluster-test-02-bucket/in4.txt
gs://snakemake-cluster-test-02-bucket/in5.txt
I was hoping it would be as easy as:
snakemake -s Snakefile --verbose --cores=1 --default-remote-provider GS --default-remote-prefix gs://snakemake-cluster-test-02-bucket/ out{1,2,3,4,5}.txt
but that gives me the error:
MissingRuleException:
No rule to produce out4.txt (if you use input functions make sure that they don't raise unexpected exceptions).
I guess ultimately I don't understand how Snakemake generates the absolute paths for files, as the auto-magic does not work for me. I tried various ways to specify bucket/filename...
I tried:
snakemake -s Snakefile --verbose --cores=1 --default-remote-provider GS --default-remote-prefix gs://snakemake-cluster-test-02-bucket/ out{1,2,3,4,5}.txt
MissingRuleException:
No rule to produce out3.txt (if you use input functions make sure that they don't raise unexpected exceptions).
snakemake -s Snakefile --verbose --cores=1 --default-remote-provider GS --default-remote-prefix gs://snakemake-cluster-test-02-bucket/ gs://snakemake-cluster-test-02-bucket/out{1,2,3,4,5}.txt
ValueError: Bucket names must start and end with a number or letter.
I tried changing the inputs and/or outputs in the snakefile to GS.remote("in{sample}.txt") instead of just "in{sample}.txt" and got various errors such as:
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/remote/GS.py", line 224, in parse
if len(m.groups()) != 2:
AttributeError: 'NoneType' object has no attribute 'groups'
I also tried variations of:
GS.remote("gs://snakemake-cluster-test-02-bucket/in{sample}.txt")
GS.remote("snakemake-cluster-test-02-bucket/in{sample}.txt")
GS.remote("gs://snakemake-cluster-test-02-bucket/in1.txt")
GS.remote("snakemake-cluster-test-02-bucket/in1.txt")
here is the output of my most common error:
(snakemake) alex-mbp-923:snakemake-example $ snakemake -np --verbose --cores=1 --default-remote-provider GS --default-remote-prefix gs://snakemake-cluster-test-02-bucket/ out{1,2,3,4,5}.txt
/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
['1', '2', '3', '4', '5']
Building DAG of jobs...
Full Traceback (most recent call last):
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/__init__.py", line 626, in snakemake
success = workflow.execute(
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/workflow.py", line 655, in execute
dag.init()
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 172, in init
job = self.update(self.file2jobs(file), file=file, progress=progress)
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 1470, in file2jobs
raise MissingRuleException(targetfile)
snakemake.exceptions.MissingRuleException: No rule to produce out1.txt (if you use input functions make sure that they don't raise unexpected exceptions).
MissingRuleException:
No rule to produce out1.txt (if you use input functions make sure that they don't raise unexpected exceptions).
(snakemake) alex-mbp-923:snakemake-example $ cat Snakefile
from os.path import join
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider()
GS_PREFIX = "snakemake-cluster-test-02-bucket"
samples, *_ = GS.glob_wildcards(GS_PREFIX + '/in{sample}.txt')
print(samples)
rule sed_example:
input:
"in{sample}.txt"
output:
"out{sample}.txt"
shell:
"sed 'y/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/nopqrstuvwxyzabcdefghijklmNOPQRSTUVWXYZABCDEFGHIJKLM/' {input} > {output}"
What am I missing? Clearly I am not specifying the paths correctly but I can't figure out what the correct way should be.
Specifying
GS_PREFIX = "snakemake-cluster-test-02-bucket"
vs
GS_PREFIX = "gs://snakemake-cluster-test-02-bucket"
doesn't seem to matter, and I guess that's OK.
Other examples I looked at:
https://github.com/bhattlab/bhattlab_workflows/blob/master/preprocessing/10x_longranger_GCP.snakefile
https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html
https://blog.liang2.tw/posts/2017/08/snakemake-google-cloud/
https://www.bsiranosian.com/bioinformatics/large-scale-bioinformatics-in-the-cloud-with-gcp-kubernetes-and-snakemake/
Regards,
Alex

It seems that in the snakemake command the option --default-remote-prefix needs just "snakemake-cluster-test-02-bucket".
The rest is assumed, since you have GS value for --default-remote-provider option.
Boris

Related

snakemake rule won't run the complete command

I am working on this snakemake pipeline where the last rule looks like this:
rule RunCodeml:
input:
'{ProjectFolder}/{Fastas}/codeml.ctl'
output:
'{ProjectFolder}/{Fastas}/codeml.out'
shell:
'codeml {input}'
This rule does not run and the error seems to be that the program codeml can't find the .ctl file because it looks for an incomplete path: '/work_beegfs/sunam133/Anas_plasmids/Phylo_chromosome/Acinet_only/Klebs_Esc_SCUG/cluster_536/co'
although the command seems correct:
shell:
codeml /work_beegfs/sunam133/Anas_plasmids/Phylo_chromosome/Acinet_only/Klebs_Esc_SCUG/cluster_536/codeml.ctl
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)'
And here the output from running with -p option:
error when opening file /work_beegfs/sunam133/Anas_plasmids/Phylo_chromosome/Acinet_only/Klebs_Esc_SCUG/cluster_1083/co
tell me the full path-name of the file? Can't find the file. I give up.
I find this behavior very strange and I can't figure out what is going on. Any help would be appreciated.
Thanks!
D.
Ok, so the problem was not snakemake but the program I am calling (codeml), which is restricted in the length of the string given as path to the control file.

Submitting results to Kaggle competition from command line regardless of kernel type or file name, in or out of Kaggle

Within Kaggle: How can I submit my results to Kaggle competition regardless of kernel type or file name?
And if I am in a notebook outside Kaggle (Colab, Jupyter, Paperspace, etc.)?
Introduction (you can skip this part)
I was looking around for a method to do that. In particular, being able to submit at any point within the notebook (so you can test different approaches), a file with any name (to keep things separated), and any number of times (respecting the Kaggle limitations).
I found many webs explaining the process like
Making Submission
1. Hit the "Publish" button at the top of your notebook screen.
If you have written an output file, then you have an "Output" tab.
2. Output > Submit to Competition
However they fail to clarify that the Kernel must be of type "Script" and not "Notebook".
That has some limitations that I haven't fully explored.
I just wanted to be able to submit whatever file from the notebook, just like any other command within it.
The process
Well, here is the process I came up with.
Suggestions, errors, comments, improvements are welcome. Specifically I'd like to know why this method is no better than the one described above.
Process:
Install required libraries
Provide your kaggle credentials
using the file kaggle.json OR
setting some environment variables with your kaggle credentials
Submit with a simple command.
Q: Where do I get my kaggle credentials?
A: You get them from https://www.kaggle.com > 'Account' > "Create new API token"
1. Install required libraries
# Install required libraries
!pip install --upgrade pip
!pip install kaggle --upgrade
2. Provide your kaggle credentials -- setting some environment variables with your kaggle credentials
# Add your PRIVATE credentials
# Do not use "!export KAGGLE_USERNAME= ..." OR "" around your credential
%env KAGGLE_USERNAME=abc
%env KAGGLE_KEY=12341341
# Verify
!export -p | grep KAGGLE_USERNAME
!export -p | grep KAGGLE_KEY
See Note below.
2. Provide your kaggle credentials -- using the file kaggle.json
%mkdir --parents /root/.kaggle/
%cp /kaggle/input/<your_private_dataset>/kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
How you get the file there is up to you.
One simple way is this:
Download the kaggle.json to your computer
In kaggle, create a private dataset (Your_Profile > Datasets > New Dataset)
Add the kaggle.json to that Dataset
Add the private Dataset to your notebook ( Data > Add Data > Datasets > Your Datasets)
This may seem a bit cumbersome, but soon or later your API credentials may change and updating the file in one point (the dataset) will update it in all your notebooks.
(source: googleapis.com)
3. Submit with a simple command.
Here <competition-name> is the code name of the competition. You can get it from the url of the competition or from the section "My submissions" within the competition page.
(source: googleapis.com)
# Submit
!kaggle competitions submit -c <competition-name> -f submission.csv -m "Notes"
# example:
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m "Notes"
# View results
!kaggle competitions submissions -c <competition-name>
# example:
!kaggle competitions submissions -c bike-sharing-demand
Note:
If you are too conscious about security of your credentials and/or want to share the kernel, then you can type the 2 commands with your credentials on the "Console" instead of within the notebook (example below). They will be valid/available during that session only.
import os
os.environ['KAGGLE_USERNAME'] = "here DO use double quotes"
os.environ['KAGGLE_KEY'] = "here DO use double quotes"
You can find the console at the bottom of your kernel.
(source: googleapis.com)
PS: Initially this was posted here, but when the answer grew the Markdown display breaks in Kaggle (not in other places), therefore I had to take it out of Kaggle.

shell() function in run does not use singularity

EDIT
I have now posted this question as an issue on the Snakemake bitbucket given this seems to be an unknown behavior.
I am using snakemake with the --use-singularity option.
When I use a classic rule of the form:
singularity: mycontainer
rule myrule:
input:
output:
shell:
"somecommand"
with the somecommand only present in the singularity container, everything goes fine.
However, when I need to use some python code in the run part of the rule, the command is not found.
rule myrule:
input:
output:
run:
some python code here
shell("somecommand")
The only workaround I found is to use
shell("singularity exec mycontainer somecommand")
but this is not optimal.
I am either missing something, such as an option, or this is a missing feature in snakemake.
What I would like to obtain is to use the shell() function with the --use-singularity option.
Snakemake doesn't allow using --use-conda with run block and this is why:
The run block of a rule (see Rules) has access to anything defined in the Snakefile, outside of the rule. Hence, it has to share the conda environment with the main Snakemake process. To avoid confusion we therefore disallow the conda directive together with the run block. It is recommended to use the script directive instead (see External scripts).
I bet --use-singularity is not allowed with run block for the same reason.

scp fails with "protocol error: filename does not match request"

I have a script that uses SCP to pull a file from a remote Linux host on AWS. After running the same code nightly for about 6 months without issue, it started failing today with protocol error: filename does not match request. I reproduced the issue on some simpler filenames below:
$ scp -i $IDENT $HOST_AND_DIR/"foobar" .
# the file is copied successfully
$ scp -i $IDENT $HOST_AND_DIR/"'foobar'" .
protocol error: filename does not match request
# used to work, i swear...
$ scp -i $IDENT $HOST_AND_DIR/"'foobarbaz'" .
scp: /home/user_redacted/foobarbaz: No such file or directory
# less surprising...
The reason for my single quotes was that I was grabbing a file with spaces in the name originally. To deal with the spaces, I had done $HOST_AND_DIR/"'foo bar'" for many months, but starting today, it would only accept $HOST_AND_DIR/"foo\ bar". So, my issue is fixed, but I'm still curious about what's going on.
I Googled the error message, but I don't see any real mentions of it, which surprises me.
Both hosts involved have OpenSSL 1.0.2g in the output of ssh -v localhost, and bash --version says GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)
Any ideas?
I ended up having a look through the source code and found the commit where this error is thrown:
GitHub Commit
remote->local directory copies satisfy the wildcard specified by the
user.
This checking provides some protection against a malicious server
sending unexpected filenames, but it comes at a risk of rejecting
wanted files due to differences between client and server wildcard
expansion rules.
For this reason, this also adds a new -T flag to disable the check.
They have added a new flag -T that will ignore this new check they've added so it is backwards compatible. However, I suppose we should look and find out why the filenames we're using are flagged as restricted.
In my case, I had [] characters in the filename that needed to be escaped using one of the options listed here. for example:
scp USERNAME#IP_ADDR:"/tmp/foo\[bar\].txt" /tmp

What does "--logtostderr" mean in the command line while using tensorflow's object detection api?

when training object detection models using tensorflow, we always input
python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_pets.config
But I wonder what the functionality of "--logtostderr" is ? what if omit it?
as the name implies, it sends the logs to STDERR standard file, which would allow you to append at the end of the command: 2> somefilecontainingthelogs.txt
You can read more about STDIN, STDOUT and STDERR here: http://www.learnlinux.org.za/courses/build/shell-scripting/ch01s04.html
If you were to not include the --logtostderr parameter, the logs would typically be sent over to STDOUT; practically if you were to run the command as you have in your question, the result would be the same. But if you were using 2> for redirecting the logs to a file, omitting the --logtostderr would no longer log anything and the logs would appear on the screen since STDOUT is not redirected to a file.