Change the directory name and fastq file name with snakemake - snakemake

I would like to change the folder names and the names of the read files. From here I found something similar (Move and rename files from multiple folders using snakemake):
workdir: "/path/to/workdir/"
import pandas as pd
from io import StringIO
sample_file = StringIO("""fastq sampleID
BOB_1234/fastq/BOB_1234.R1.fastq.gz TAG_1/fastq/TAG_1.R1.fastq.gz
BOB_1234/fastq/BOB_1234.R2.fastq.gz TAG_1/fastq/TAG_1.R2.fastq.gz
BOB_3421/fastq/BOB_3421.R1.fastq.gz TAG_2/fastq/TAG_2.R1.fastq.gz
BOB_3421/fastq/BOB_3421.R2.fastq.gz TAG_2/fastq/TAG_2.R2.fastq.gz""")
df = pd.read_table(sample_file, sep="\s+", header=0)
sampleID = df.sampleID
fastq = df.fastq
rule all:
input:
expand("{sample}", sample=df.sampleID)
rule move_and_rename_fastqs:
input: fastq = lambda w: df[df.sampleID == w.sample].fastq.tolist()
output: "{sample}"
shell:
"""echo mv {input.fastq} {output}"""
I can an error:
MissingOutputException in line 19 of snakefile:
Job Missing files after 5 seconds:
TAG_1/fastq/TAG_1.R1.fastq.gz

Related

how to add "wildcard-specific wildcard" via snakemake checkpoint

I can't find appropriate words to describe my need, please see the code.
My ideal workflow would do this:
I have known the categories that will be created during the workflow (cates)
I don't know how many files and what files would be created in each category (files)
For each category, rule create_file will be run first.
And then the checkpoint is triggered, I will know what files have been created for each category.
Then, for each file created in the create_file rule, a mock rule append_to_file_name take the file as input, and do the operation.
Files produced in create_file is wildcard specific so I call my need as "wildcard-specific" wildcard
cates=["A", "B"]
# pretend that you don't know about the files about to be created
files={
"A": ["a.txt", "b.txt"],
"B": ["c.txt", "d.txt"]
}
def get_append_to_file_output(wildcards):
files = glob_wildcards(f"{wildcards.cate}/{{file}}.txt").sample
appened = expand(f"{wildcards.cate}/{{file}}_append.txt", file = files)
return appened
rule all:
input:
get_append_to_file_output,
checkpoint create_file:
output: ddd=directory("{cate}"),
run:
from pathlib import Path
Path(output.ddd).mkdir(parents=True, exist_ok=True)
for file in files[wildcards.cate]:
Path(file).touch()
rule append_to_file_name:
input: ddd="{cate}",
output: "{cate}/{file}_append.txt",
run:
from pathlib import Path
Path(output[0]).touch()
cates=["A", "B"]
# pretend that you don't know about the files about to be created
files={
"A": ["a", "b"],
"B": ["c", "d"]
}
def get_append_to_file_output(wildcards):
cate_dir = checkpoints.create_file.get(**wildcards).output.ddd
files = glob_wildcards(f"{cate_dir}/{{file}}.txt").file
print(files)
appened = expand(f"results/append_filename/{wildcards.cate}/{{file}}_append.txt", file = files)
return appened
rule all:
input:
expand("results/flags/{cate}_aggregate.flag", cate=cates)
checkpoint create_file:
output:
ddd=directory("results/create_file/{cate}"),
run:
from pathlib import Path
Path(output.ddd).mkdir(parents=True, exist_ok=True)
for file in files[wildcards.cate]:
Path(f"{output.ddd}/{file}.txt").touch()
rule append_to_filename:
input:
ddd=rules.create_file.output.ddd
output:
appended="results/append_filename/{cate}/{file}_append.txt"
shell:
"cp {input.ddd}/{wildcards.file}.txt {output.appended}"
rule fake_aggregate:
input:
get_append_to_file_output
output:
touch("results/flags/{cate}_aggregate.flag")
I've solved this by adding a fake aggregating rule

Import multiple files in pandas

I am trying to import multiple files in pandas. I have created 3 files in the folder
['File1.xlsx', 'File2.xlsx', 'File3.xlsx'] as read by files = os.listdir(cwd)
import os
import pandas as pd
cwd = os.path.abspath(r'C:\Users\abc\OneDrive\Import Multiple files')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file), ignore_index=True)
df.head()
# df.to_excel('total_sales.xlsx')
print (files)
Upon running the code, I am getting the error (even though the file does exist in the folder)
FileNotFoundError: [Errno 2] No such file or directory: 'File1.xlsx'
Ideally, I want a code where I define a list of files in a LIST and then read the files through the loop using the path and the file LIST.
I think the following should work
import os
import pandas as pd
cwd = os.path.abspath(r'C:\Users\abc\OneDrive\Import Multiple files')
paths = [os.path.join(cwd,path) for path in os.listdir(cwd) if path.endswith('.xlsx')]
df = pd.concat(pd.read_excel(path,ignore_index=True) for path in paths)
df.head()
The idea is to get a list of full paths and then read them all in and concatenate them into a single dataframe on the next line

How are nested checkpoints resolved in snakemake?

I need to use nested checkpoints in snakemake since for every dynamic file I have to create again other dynamic files. So far, I am unable to resolve the two checkpoints properly. Below, you find a minimal toy example.
It seems that until the first checkout is not properly resolved, the second checkpoint is not even executed, thus a single aggregate rule won't work.
I don't know how to invoke the two checkpoints and resolve the wildcards.
import os.path
import glob
rule all:
input:
'collect/all_done.txt'
#generate a number of files
checkpoint create_files:
output:
directory('files')
run:
import random
r = random.randint(1,10)
for x in range(r):
output_dir = output[0] + '/' + str(x+1)
import os
if not os.path.isdir(output_dir):
os.makedirs(output_dir, exist_ok=True)
output_file=output_dir + '/test.txt'
print(output_file)
with open(output_file, 'w') as f:
f.write(str(x+1))
checkpoint create_other_files:
input: 'files/{i}/test.txt'
output: directory('other_files/{i}/')
shell:
'''
L=$(( $RANDOM % 10))
for j in $(seq 1 $L);
do
mkdir -p {output}/{j}
cp -f {input} {output}/$j/test2.txt
done
'''
def aggregate(wildcards):
i_wildcard = checkpoints.create_files.get(**wildcards).output[0]
print('in_def_aggregate')
print(i_wildcard)
j_wildcard = checkpoints.create_other_files.get(**wildcards).output[0]
print(j_wildcard)
split_files = expand('other_files/{i}/{j}/test2.txt',
i =glob_wildcards(os.path.join(i_wildcard, '{i}/test.txt')).i,
j = glob_wildcards(os.path.join(j_wildcard, '{j}/test2.txt')).j
)
return split_files
#non-sense collect function
rule collect:
input: aggregate
output: touch('collect/all_done.txt')
Currently, I get the following error from snakemake:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 collect
1 create_files
3
[Thu Nov 14 14:45:01 2019]
checkpoint create_files:
output: files
jobid: 2
Downstream jobs will be updated after completion.
Job counts:
count jobs
1 create_files
1
files/1/test.txt
files/2/test.txt
files/3/test.txt
files/4/test.txt
files/5/test.txt
files/6/test.txt
files/7/test.txt
files/8/test.txt
files/9/test.txt
files/10/test.txt
Updating job 1.
in_def_aggregate
files
[Thu Nov 14 14:45:02 2019]
Error in rule create_files:
jobid: 2
output: files
InputFunctionException in line 53 of /TL/stat_learn/work/feldmann/Phd/Projects/HIVImmunoAdapt/HIVIA/playground/Snakefile2:
WorkflowError: Missing wildcard values for i
Wildcards:
Removing output files of failed job create_files since they might be corrupted:
files
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I am interested in having the files /other_files/{checkpoint_1_wildcard}/{checkpoint_2_wildcard}/test2.txt
I am not entirely sure what you were trying to do, so I rewrote it quite some. But does clarify the problem?
import glob
import random
from pathlib import Path
rule all:
input:
'collect/all_done.txt'
checkpoint first:
output:
directory('first')
run:
for i in range(random.randint(1,10)):
Path(f"{output[0]}/{i}").mkdir(parents=True, exist_ok=True)
Path(f"{output[0]}/{i}/test.txt").touch()
checkpoint second:
input:
'first/{i}/test.txt'
output:
directory('second/{i}')
run:
for j in range(random.randint(1,10)):
Path(f"{output[0]}/{j}").mkdir(parents=True, exist_ok=True)
Path(f"{output[0]}/{j}/test2.txt").touch()
rule copy:
input:
'second/{i}/{j}/test2.txt'
output:
'copy/{i}/{j}/test2.txt'
shell:
"""
cp -f {input} {output}
"""
def aggregate(wildcards):
outputs_i = glob.glob(f"{checkpoints.first.get().output}/*/")
outputs_i = [output.split('/')[-2] for output in outputs_i]
split_files = []
for i in outputs_i:
outputs_j = glob.glob(f"{checkpoints.second.get(i=i).output}/*/")
outputs_j = [output.split('/')[-2] for output in outputs_j]
for j in outputs_j:
split_files.append(f"copy/{i}/{j}/test2.txt")
return split_files
rule collect:
input:
aggregate
output:
touch('collect/all_done.txt')

Snakemake read input from file

I am trying to use file that will be written during the run as an input to another rule, but it always give me error FileNotFoundError: [Errno 2] No such file or directory:
Is there a way to fix it or other implementation to have the same logic.
def vc_list(wildcards):
my_list = []
with open(wildcards.mydir+"/file_B.txt", 'r') as data_in:
for line in data_in:
my_list.append(line.strip())
return(my_list)
# rule A will process file_A.txt and give me file_B.txt
rule A:
input: "{mydir}/file_A.txt"
output: "{mydir}/file_B.txt"
shell: "seq 1 5 > {output}" # assume that `seq 1 5` is the output from proicessing the file
rule B:
input: "{vlaue}"
output: "{vlaue}.vc"
shell: "pythoncode.py {input} {output}"
# rule C will process file_B.txt to give me list of values that will be used to expanded the input, then will use rile B to produce it
rule C:
input:
processed_file = rules.A.output, #"{mydir}/file_B.txt",
my_list = lambda wildcards: expand("{mydir}/{value}.vc", mydir=wildcards.mydir, value=vc_list(wildcards))
output: "{mydir}/done.txt"
shell: "touch {output}"
#I always have the error that "{mydir}/file_B.txt" does not exist
The error now:
test_loop.snakefile:
FileNotFoundError: [Errno 2] No such file or directory: 'read_file/file_B.txt'
Wildcards:
mydir=read_file
Thanks,
The answer to my question is to use checkpoint as dynamic will be deprecated.
Here is how the logic should be changed:
rule:
input: 'done.txt'
checkpoint A:
output: 'B.txt'
shell: 'seq 1 2 > {output}'
rule N:
input: "genome.fa"
output: '{num}.bam'
shell: "touch {output}"
rule B:
input: '{num}.bam'
output: '{num}.vc'
shell: "touch {output}"
def aggregate_input(wildcards):
with open(checkpoints.A.get(**wildcards).output[0], 'r') as f:
return [num.rstrip() + '.vc' for num in f]
rule C:
input: aggregate_input
output: touch('done.txt')
Credit goes to Eric Lim
Your script fails even before the workflow starts, on the phase of the pipeline construction.
So, there is nothing surprising regarding the rules A and B: Snakemake reads their input and output sections and finds no problem with them. Then it starts reading the rule C where the input section calls the vc_list() function which in turn tries to read the file 'read_file/file_B.txt' even before the workflow has started! For sure it doesn't find the file and produces the error.
As for what to do, you need to clarify the task first. Most probable you are trying to use dynamic information in the input rule. In this case you need to use dynamic files or checkpoints.

snakemake rules: Passing on variables outside of the file name

So far I used snakemake to generate individual plots with snakemake. This has worked great! Now though, I want to create a rule that creates a combined plot across the topics, without explicitly putting the name in the plot. See the combined_plot rule below.
topics=["soccer", "football"]
params=[1, 2, 3, 4]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"plot.py --input={input} --output={output}"
rule combined_plot:
input:
# all data_p={param}_{topic}.csv files
output:
"combined_p={param}_plot.png"
shell:
"plot2.py " + # one "--input=" and one "--output" for each csv file
Is there a simple way to do this with snakemake?
If I understand correctly, the code below should be more straightforward as it replaces the lambda and the glob with the expand function. It will execute the two commands:
plot2.py --input=data_p=1_soccer.csv --input=data_p=1_football.csv --output combined_p=1_plot.png
plot2.py --input=data_p=2_soccer.csv --input=data_p=2_football.csv --output combined_p=2_plot.png
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("combined_p={param}_plot.png", param=params),
rule combined_plot:
input:
csv= expand("data_p={{param}}_{topic}.csv", topic= topics)
output:
"combined_p={param}_plot.png",
run:
inputs= ['--input=' + x for x in input.csv]
shell("plot2.py {inputs} --output {output}")
I got a working version, by using a function called 'wcs' as input (see here) and I used run instead of shell. In the run section I could first define a variable before executing the result with shell(...).
Instead of referring to the files with glob I could also have directly used the topics in the lambda function.
If anyone with more experience sees this, please tell me if this is the "right" way to do it.
from glob import glob
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"echo plot.py {input} {output}"
rule combined_plot:
input:
lambda wcs: glob("data_p={param}_*.csv".format(**wcs))
output:
"combined_p={param}_plot.png"
run:
inputs=" ".join(["--input " + inp for inp in input])
shell("echo plot2.py {inputs}")