setting FTP proxy with snakemake FTP provider - snakemake

I'm trying to download files from a ftp server using snakemake "snakemake.remote.FTP" as follows:
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()
chrList = [*range(1,23)]
...
# Download 1K genomes vcf files
rule download1kgenomes:
input: FTP.remote(expand("ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr{chr}.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz",chr=chrList), keep_local=True, immediate_close=True),
FTP.remote(expand("ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr{chr}.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.tbi",chr=chrList), keep_local=True, immediate_close=True),
FTP.remote("ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/20140625_related_individuals.txt", keep_local=True, immediate_close=True),
FTP.remote("ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel", keep_local=True, immediate_close=True),
output: expand(config["refPanelDir"]+"/ALL.chr{chr}.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz",chr=chrList),
expand(config["refPanelDir"]+"/ALL.chr{chr}.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.tbi",chr=chrList),
config["refPanelDir"]+"/20140625_related_individuals.txt",
config["refPanelDir"]+"/integrated_call_samples_v3.20130502.ALL.panel"
params: outdir = config["refPanelDir"]
run:
shell("mv {input} {params.outdir}")
This works perfectly well under a normal internet connection and I think it's a great snakemake feature. Unfortunately, my university is behind a proxy and when I try this at the office, I get error messages when failing to connect to the remote server:
File "/my/path/to/python3.9/site-packages/ftputil/error.py", line 199, in __exit__
raise FTPOSError(*exc_value.args, original_error=exc_value) from exc_value
ftputil.error.FTPOSError: [Errno 110] Connection timed out
Debugging info: ftputil 5.0.1, Python 3.9.5 (linux)
Does anybody know how to specify a ftp proxy setting to snakemake and if this is even possible?
Regards

Related

Prefect not finding .env file

While running a prefect flow from Pycharm everything works fine but when I start it from Prefect Server, the flow doesn't find the .env file with my credentials and fails with my own assertion error from this code:
class MyDotenv:
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
dotenv_file = ".\\04_keep_local\\.env"
assert os.path.isfile(dotenv_file), "\n-> Could't locate .env file!"
dotenv.load_dotenv(dotenv_file)
I've used these commands on my virtual environment (venv) to start the server and the agent:
prefect backend server
prefect server start
prefect agent local start
Any ideas?
Did you perhaps start your LocalAgent in a directory that doesn't contain your .env file?
The LocalAgent runs a flow as a subprocess of itself. Meaning the directory your flow runs in is the directory you executed prefect agent local start

minimal requirements for specifying file paths in a cloud bucket

I am very familiar with GCP but new to Snakemake.
I have a simple working example with a snakemake file that just has:
rule sed_example:
input:
"in{sample}.txt"
output:
"out{sample}.txt"
shell:
"sed 'y/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/nopqrstuvwxyzabcdefghijklmNOPQRSTUVWXYZABCDEFGHIJKLM/' {input} > {output}"
This runs fine on my local machine with the command:
snakemake -s Snakefile --verbose --cores=1 out{1,2,3,4,5}.txt
If I run it without specifying the output filenames, I get the error:
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
But I guess that is expected.
Next, I want to run the same thing but have the files be in a GCP/GCS bucket. I put the files there with gsutil rsync and I can even list them from inside a snakefile, if I use this code snippet inside the snakemake file, so my google auth setup seems fine:
from os.path import join
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider()
GS_PREFIX = "snakemake-cluster-test-02-bucket"
samples, *_ = GS.glob_wildcards(GS_PREFIX + 'in{sample}.txt')
print(samples)
The input files are there:
$ gsutil ls gs://snakemake-cluster-test-02-bucket/
gs://snakemake-cluster-test-02-bucket/Snakefile
gs://snakemake-cluster-test-02-bucket/in1.txt
gs://snakemake-cluster-test-02-bucket/in2.txt
gs://snakemake-cluster-test-02-bucket/in3.txt
gs://snakemake-cluster-test-02-bucket/in4.txt
gs://snakemake-cluster-test-02-bucket/in5.txt
I was hoping it would be as easy as:
snakemake -s Snakefile --verbose --cores=1 --default-remote-provider GS --default-remote-prefix gs://snakemake-cluster-test-02-bucket/ out{1,2,3,4,5}.txt
but that gives me the error:
MissingRuleException:
No rule to produce out4.txt (if you use input functions make sure that they don't raise unexpected exceptions).
I guess ultimately I don't understand how Snakemake generates the absolute paths for files, as the auto-magic does not work for me. I tried various ways to specify bucket/filename...
I tried:
snakemake -s Snakefile --verbose --cores=1 --default-remote-provider GS --default-remote-prefix gs://snakemake-cluster-test-02-bucket/ out{1,2,3,4,5}.txt
MissingRuleException:
No rule to produce out3.txt (if you use input functions make sure that they don't raise unexpected exceptions).
snakemake -s Snakefile --verbose --cores=1 --default-remote-provider GS --default-remote-prefix gs://snakemake-cluster-test-02-bucket/ gs://snakemake-cluster-test-02-bucket/out{1,2,3,4,5}.txt
ValueError: Bucket names must start and end with a number or letter.
I tried changing the inputs and/or outputs in the snakefile to GS.remote("in{sample}.txt") instead of just "in{sample}.txt" and got various errors such as:
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/remote/GS.py", line 224, in parse
if len(m.groups()) != 2:
AttributeError: 'NoneType' object has no attribute 'groups'
I also tried variations of:
GS.remote("gs://snakemake-cluster-test-02-bucket/in{sample}.txt")
GS.remote("snakemake-cluster-test-02-bucket/in{sample}.txt")
GS.remote("gs://snakemake-cluster-test-02-bucket/in1.txt")
GS.remote("snakemake-cluster-test-02-bucket/in1.txt")
here is the output of my most common error:
(snakemake) alex-mbp-923:snakemake-example $ snakemake -np --verbose --cores=1 --default-remote-provider GS --default-remote-prefix gs://snakemake-cluster-test-02-bucket/ out{1,2,3,4,5}.txt
/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
['1', '2', '3', '4', '5']
Building DAG of jobs...
Full Traceback (most recent call last):
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/__init__.py", line 626, in snakemake
success = workflow.execute(
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/workflow.py", line 655, in execute
dag.init()
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 172, in init
job = self.update(self.file2jobs(file), file=file, progress=progress)
File "/Users/alex/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/dag.py", line 1470, in file2jobs
raise MissingRuleException(targetfile)
snakemake.exceptions.MissingRuleException: No rule to produce out1.txt (if you use input functions make sure that they don't raise unexpected exceptions).
MissingRuleException:
No rule to produce out1.txt (if you use input functions make sure that they don't raise unexpected exceptions).
(snakemake) alex-mbp-923:snakemake-example $ cat Snakefile
from os.path import join
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider()
GS_PREFIX = "snakemake-cluster-test-02-bucket"
samples, *_ = GS.glob_wildcards(GS_PREFIX + '/in{sample}.txt')
print(samples)
rule sed_example:
input:
"in{sample}.txt"
output:
"out{sample}.txt"
shell:
"sed 'y/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/nopqrstuvwxyzabcdefghijklmNOPQRSTUVWXYZABCDEFGHIJKLM/' {input} > {output}"
What am I missing? Clearly I am not specifying the paths correctly but I can't figure out what the correct way should be.
Specifying
GS_PREFIX = "snakemake-cluster-test-02-bucket"
vs
GS_PREFIX = "gs://snakemake-cluster-test-02-bucket"
doesn't seem to matter, and I guess that's OK.
Other examples I looked at:
https://github.com/bhattlab/bhattlab_workflows/blob/master/preprocessing/10x_longranger_GCP.snakefile
https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html
https://blog.liang2.tw/posts/2017/08/snakemake-google-cloud/
https://www.bsiranosian.com/bioinformatics/large-scale-bioinformatics-in-the-cloud-with-gcp-kubernetes-and-snakemake/
Regards,
Alex
It seems that in the snakemake command the option --default-remote-prefix needs just "snakemake-cluster-test-02-bucket".
The rest is assumed, since you have GS value for --default-remote-provider option.
Boris

Do snakemake-wrappers get fetched once and stored locally? Or fetched everytime from remote URL?

When snakemake wrappers are used with snakemake scripts, do they get fetched everytime from remote URL or are they stored locally? I don't see them stored anywhere in .snakemake directory, which makes me think it gets fetched everytime even if same version of wrapper script is used everytime.
Unfortunately Snakemake does fetch wrapper script from remote URL everytime, and currently there doesn't appear to be a solution to change this behavior. One alternative is to have a local clone of the wrapper repository and point snakemake to it using --wrapper-prefix.
Here is the error message when internet is not available:
RuleException:
WorkflowError in line 16 of /Users/blah/Downloads/Snakefile:
URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
File "/Users/blah/Downloads/Snakefile", line 16, in __rule_tabix
File "//anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run

'gunzip' is not recognized as an internal or external command, operable program or batch file. System command 'gunzip' failed

I am trying to analyse my raw GNSS data on the GNSS Analyser app from here https://github.com/google/gps-measurement-tools. The installation guide includes the following step:
4.2 gunzip installation
The automatic ftp code inside GnssAnalysis will download ephemeris zip files, and attempt to
unzip them using gunzip.
Download gzip.exe from here http://ftp.gnu.org/gnu/gzip/gzip-1.9.zip
Extract the files from the zip file, rename gzip.exe to gunzip.exe
Move gunzip.exe to somewhere in your Windows path (type path in the Windows
Command Prompt to see what your path is, typically you will find a directory
C:\Windows\system32 and you can put gunzip.exe there.)
However, upon downloading gunzip, I cant find a gzip.exe file, and hence tried renaming the gzip.c and gzip.h file instead. It did not work and I got this error when attempting to process my own raw data.
I have just tried and got success to import DB from a backup file:
gzip -d < C:\Users\my-user\Downloads\my-db-backup.sql.gz | mysql -u root -p MY_DB_NAME

Aerospike docker - 100L, 'UDF: Execution Error 1

I deployed an Aerospike container using the official docker hub image. When I try to execute test_list = client.llist(key, 'test_list'), my Python client script returns the following error:
exception.UDFError: (100L, 'UDF: Execution Error 1', 'src/main/llist/llist_operations.c', 93)
I looked at the Aerospike logs and found that each time this code is executed, the error below gets printed:
: WARNING (udf): (src/main/mod_lua.c:599) Lua Create Error: module 'llist' not found:
no field package.preload['llist']
no file './llist.lua'
no file '/usr/local/share/luajit-2.0.3/llist.lua'
no file '/usr/local/share/lua/5.1/llist.lua'
no file '/usr/local/share/lua/5.1/llist/init.lua'
no file '/opt/aerospike/sys/udf/lua/llist.lua'
no file '/opt/aerospike/sys/udf/lua/external/llist.lua'
no file '/opt/aerospike/usr/udf/lua/llist.lua'
no file './llist.so'
no file '/usr/local/lib/lua/5.1/llist.so'
no file '/usr/local/lib/lua/5.1/loadall.so'
no file '/opt/aerospike/sys/udf/lua/llist.so'
no file '/opt/aerospike/sys/udf/lua/external/llist.so'
no file '/opt/aerospike/usr/udf/lua/llist.so'
: INFO (udf): (udf.c:954) lua error, ret:1
I could not find the relevant lua files or a lua installation in the container. I have my code working fine when I run it directly on the host. Is there some extra configuration that needs to be done to the container?
LDTs were dropped in 3.15.
https://www.aerospike.com/docs/guide/ldt_guide.html
Excerpt:
Aerospike has removed the Large Data Type feature as of server version 3.15 after deprecating this functionality 12 months earlier. Please see the removal notice and deprecation notice. The features listed below are no longer in Aerospike servers.