How to use Snakemake for memory management? - snakemake

I tried to enforce Snakemake to run a rule (with many jobs) sequentially to avoid memory conflict.
rule run_eval_all:
input:
expand("config["out_model"] + "{iLogit}.rds", iLogit = MODELS)
rule eval_model:
input:
script = config["src_est"] + "evals/script.R",
model = config["out_model"] + "{iLogit}.rds",
output:
"out/{iLogit}.rds"
threads: 5
resources:
mem_mb = 100000
shell:
"{runR} {input.script} "
"--out {output}"
And I run the rule by snakemake --cores all --resources mem_mb=100000 run_eval_all. But I keep getting errors like:
x86_64-conda-linux-gnu % snakemake --resources mem_mb=100000 run_eval_all
Traceback (most recent call last):
File "/local/home/zhakaida/mambaforge/envs/r_snake/bin/snakemake", line 10, in <module>
sys.exit(main())
File "/local/home/zhakaida/mambaforge/envs/r_snake/lib/python3.9/site-packages/snakemake/__init__.py", line 2401, in main
resources = parse_resources(args.resources)
File "/local/home/zhakaida/mambaforge/envs/r_snake/lib/python3.9/site-packages/snakemake/resources.py", line 85, in parse_resources
for res, val in resources_args.items():
AttributeError: 'list' object has no attribute 'items'
If I run snakemake --cores all run_eval_all, it works but jobs run in parallel (as expected) and sometimes induces memory overuse and collapse. How shall I properly claim memory for Snakemake?

The error is due to a known issue with parsing the --resources argument in Snakemake 6.5.1, https://github.com/snakemake/snakemake/issues/1069.
Update to snakemake 6.5.3 or later and see if your problem still exists.

Related

Python Script that used to work, is now getting automatically killed in Ubuntu

I was once able to run the below python script on my Ubuntu machine without the memory errors I was getting on windows.
import pandas as pd
import numpy as np
#create a pandas dataframe for each input file
dfs1 = pd.read_csv('s1.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfs2 = pd.read_csv('s2.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfr = pd.read_csv('r.csv' , encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
#combine them into one dataframe
dfs12r = pd.concat([dfs1, dfs2, dfr],ignore_index=True)#withour ignore index the line numbers are not adjusted
# bow is "comming
wordlist=[]
for line in range(8052):
for row in range(106) :
#print(line,row,dfs12r[row][line])
if dfs12r[row][line] not in wordlist :
wordlist.append(dfs12r[row][line])
wordlist.sort()
#print(wordlist)
print(len(wordlist)) #12350
dfBOW = pd.DataFrame(np.zeros((len(dfs12r.index), len(wordlist))),dtype='int')
#create the dictionary
wordDict = dict.fromkeys(wordlist,'default')
counter=0
for word in wordlist :
wordDict[word] = counter
counter+=1
#print(wordDict)
#will start scanning every word from dfS12R and +1 the respective cell in dfBOW
for line in range(8052):
for row in range(107):
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
Unfortunately, probably after some automatic Ubuntu updates I am now getting the simple message "KIlled", after trying to run the process without any further explanation.
Through simple print statements I know that the script is interrupted inside the for loop in the end.
I understand that I should be able to make the script more memory efficient, but I am also hoping for guidance on how to get Ubuntu able to run again the same script like they used to. (Through the TOP command I can see the all of my memory including the swap is being used while inside this loop)
Could paging have been disabled somehow after the updates? Any advice is welcome.
I still have 16GB of RAM, and use Ubuntu 20.04 (Specs are the same before and after the script stopped working). I use dual boot on the same SSD.
Below is the error I am getting from teh same script on windows :
Traceback (most recent call last):
File "D:\sharedfiles\Organised\WorkSpace\ptixiaki\github\ptixiaki\code\makingthedata\2.1 Approach (Same as 2 but turning all words to lowercase)\2.1_CSVtoDataframe\CSVtoBOW.py", line 60, in <module>
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1143, in __setitem__
self._maybe_update_cacher()
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1279, in _maybe_update_cacher
ref._maybe_cache_changed(cacher[0], self, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\frame.py", line 3950, in _maybe_cache_changed
self._mgr.iset(loc, arraylike, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\managers.py", line 1141, in iset
blk.delete(blk_locs)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\blocks.py", line 388, in delete
self.values = np.delete(self.values, loc, 0) # type: ignore[arg-type]
File "<__array_function__ internals>", line 5, in delete
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\numpy\lib\function_base.py", line 4555, in delete
new = arr[tuple(slobj)]
MemoryError: Unable to allocate 501. MiB for an array with shape (12234, 10736) and data type int32

Snakemake: Data-dependent conditional execution of rules, IndexError

When executing the below snakemake pipeline, I get an error: IndexError: list index out of range. I think it's because fastqc_pretrim is being executed for all SAMPLEs. However, not all samples pass basecalling QC, so only some files will need to be processed here. I am trying to use checkpointing to get this to run. Looking at the log, we can see it is trying to run fastqc_pretrim for sample "FAQ20773_pass_barcode01_68fda206_1". However, if you look above that line in the LOG, FAQ20773_fail_barcode03_68fda206_0 is actually the only sample that passed with a .fastq.gz file. I'm not sure why the correct sample is not running.
LOG:
snakemake --use-conda --jobs 1 -pr
['FAQ20773_fail_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_2', 'FAQ20773_fail_barcode03_68fda206_0', 'FAQ20773_fail_barcode02_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_1']
The flag 'directory' used in rule guppy_basecall_persample is only valid for outputs, not inputs.
Building DAG of jobs...
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode01_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_2
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode03_68fda206_0
['basecall/FAQ20773_fail_barcode03_68fda206_0/pass/fastq_runid_68fda20603fe08e9e2a4eef8718997203b603497_0_0.fastq.gz']
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode02_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_1
[]
Updating job all.
Using shell: /usr/bin/bash
[Thu Aug 26 13:13:51 2021]
rule fastqc_pretrim:
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log
jobid: 19
reason: Missing output files: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
wildcards: sample=FAQ20773_pass_barcode01_68fda206_1
resources: tmpdir=/tmp
/home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py
Activating conda environment: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4
Traceback (most recent call last):
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py", line 41, in <module> shell( File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/shell.py", line 130, in __new__ cmd = format(cmd, *args, stepout=2, **kwargs) File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/utils.py", line 427, in format return fmt.format(_pattern, *args, **variables) File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 161, in format return self.vformat(format_string, args, kwargs)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 165, in vformat result, _ = self._vformat(format_string, args, kwargs, used_args, 2)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 205, in _vformat obj, arg_used = self.get_field(field_name, args, kwargs)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 278, in get_field obj = obj[i]
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/io.py", line 1536, in __getitem__ return super().__getitem__(key)
IndexError: list index out of range
[Thu Aug 26 13:13:52 2021]
Error in rule fastqc_pretrim:
jobid: 19
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log (check log file(s) for error message)
conda-env: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4
RuleException:
CalledProcessError in line 60 of /mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile:
Command 'source /home/hvasquezgross/miniconda3/bin/activate '/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4'; /home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py' returned non-zero exit status 1.
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile", line 60, in __rule_fastqc_pretrim
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Snakemake
import glob
configfile: "config.yaml"
inputdirectory=config["directory"]
SAMPLES, = glob_wildcards(inputdirectory+"/{sample}.fast5", followlinks=True)
print(SAMPLES)
wildcard_constraints:
sample="\w+\d+_\w+_\w+\d+_.+_\d"
##### target rules #####
rule all:
input:
expand('basecall/{sample}/sequencing_summary.txt', sample=SAMPLES),
"qc/multiqc.html"
rule make_indvidual_samplefiles:
input:
inputdirectory+"/{sample}.fast5",
output:
"lists/{sample}.txt",
shell:
"basename {input} > {output}"
checkpoint guppy_basecall_persample:
input:
directory=directory(inputdirectory),
samplelist="lists/{sample}.txt",
output:
summary="basecall/{sample}/sequencing_summary.txt",
directory=directory("basecall/{sample}/"),
params:
config["basealgo"]
shell:
"guppy_basecaller -i {input.directory} --input_file_list {input.samplelist} -s {output.directory} -c {params} --compress_fastq -x \"auto\" --gpu_runners_per_device 3 --num_callers 2 --chunks_per_runner 200"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.guppy_basecall_persample.get(**wildcards).output[1]
print(checkpoint_output)
exparr = expand("basecall/{sample}/pass/{runid}.fastq.gz", sample=wildcards.sample,
runid=glob_wildcards(os.path.join(checkpoint_output, "pass/", "{runid}.fastq.gz")).runid)
print(exparr)
return exparr
rule fastqc_pretrim:
input:
aggregate_input
output:
html="qc/fastqc_pretrim/{sample}.html",
zip="qc/fastqc_pretrim/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_pretrim/{sample}.log"
threads: 1
wrapper:
"0.77.0/bio/fastqc"
rule multiqc:
input:
#expand("basecall/{sample}.fastq.gz", sample=SAMPLES)
expand("qc/fastqc_pretrim/{sample}_fastqc.zip", sample=SAMPLES)
output:
"qc/multiqc.html"
params:
"" # Optional: extra parameters for multiqc.
log:
"logs/multiqc.log"
wrapper:
"0.77.0/bio/multiqc"
I think you are making things more complicated than necessary by using checkpoint and wrapper. This is what I would do, more or less:
rule guppy_basecall_persample:
input:
...
output:
summary="basecall/{sample}/sequencing_summary.txt",
directory=directory("basecall/{sample}/"),
shell:
r"""
guppy ...
"""
rule fastqc_pretrim:
input:
directory= directory("basecall/{sample}/"),
output:
html="qc/fastqc_pretrim/{sample}.html",
zip="qc/fastqc_pretrim/{sample}_fastqc.zip"
shell:
r"""
fastqc {input.directory}/pass/*.fastq.gz
"""

SentencePiece in Google Colab

I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece binaries in Google Colab. Do I need to build with cmake?
When I try and install using pip install sentencepiece and try to include sentencepiece in my "transforms" in my script, I get this following error
After running this script (matched from the OpenNMT translation tutorial)
!onmt_build_vocab -config en-sp.yaml -n_sample -1
I get:
Traceback (most recent call last):
File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 63, in main
build_vocab_main(opts)
File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 32, in build_vocab_main
transforms = make_transforms(opts, transforms_cls, fields)
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/transform.py", line 176, in make_transforms
transform_obj.warm_up(vocabs)
File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/tokenize.py", line 110, in warm_up
load_src_model.Load(self.src_subword_model)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
Below is how my script is written. I'm not sure what the not a string is coming from.
## Where the samples will be written
save_data: en-sp/run/example
## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt
## Where the model will be saved
save_model: drive/MyDrive/Europarl/model/model
# Prevent overwriting existing files in the folder
overwrite: False
# Corpus opts:
data:
europarl:
path_src: train_europarl-v7.es-en.es
path_tgt: train_europarl-v7.es-en.en
transforms: [sentencepiece, filtertoolong]
weight: 1
valid:
path_src: dev_europarl-v7.es-en.es
path_tgt: dev_europarl-v7.es-en.en
transforms: [sentencepiece]
skip_empty_level: silent
world_size: 1
gpu_ranks: [0]
...
EDIT: So I went ahead and Googled the issue more and found a google colab project that built sentencepiece using cmake here https://colab.research.google.com/github/mymusise/gpt2-quickly/blob/main/examples/gpt2_quickly.ipynb#scrollTo=dDAup5dxDXZW. However, even after building using cmake, I'm still getting this issue.
To fix this issue, I had to filter and tokenize my dataset and then train with sentencepiece. I used the scripts from this helpful source: https://github.com/ymoslem/MT-Preparation to do everything and now my model is training!

some error/warning messages for running Tensorflow implementation

When running a Tensorflow implementation, I got the following error/warning messages, which does not include the line of python code that causes this issue. At the same time, the result is still generated. I am not sure what do these messages indicate?
Exception ignored in: <bound method Session.__del__ of <tensorflow.python.client.session.Session object at 0x2b48ec89f748>>
Traceback (most recent call last):
File "/data/tfw/lib/python3.4/site- packages/tensorflow/python/client/session.py", line 140, in __del__
File "/data/tfw/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 137, in close
UnboundLocalError: local variable 'status' referenced before assignment
Today I also encountered this exception while running some Multi layer perceptron model on Windows 10 64 Bit with Python 3.5 and TensorFlow 0.12
I have seen this answer for this exception
it induced by different gc sequence, if python collect session first , the program will exit successfully, if python collect swig memory(tf_session) first, the program exit with failure.
Here

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?
=============
the related code is below:
def def_ps(self):
self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)
def save(self,idx):
ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
if not os.path.exists(ret):
msg = "save model for %u path %s not exists."%(idx,ret)
lg.error(msg)
raise Exception(msg);
=============
the log is below:
2016-06-02 21:33:52,323 root ERROR save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root ERROR has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
File "d_rl_main_model_dist_0.py", line 755, in run_worker
model_a.save(next_model_idx)
File "d_rl_main_model_dist_0.py", line 360, in save
Trainer.save(self,save_idx)
File "d_rl_main_model_dist_0.py", line 289, in save
raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.
===========
not meets the tensorflow api which define Saver.save as below:
https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Saver
tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)
Returns:
A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.
Raises:
TypeError: If sess is not a Session.
ValueError: If latest_filename contains path components.
The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.
Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.