I am trying to process bulk RNA-seq data using salmon through snakemake in the conda/mamba environment.
I am receiving the following error when running snakemake:
(snakemake) pratik#pratik:~/Desktop/ra-fls$ snakemake --cores
Building DAG of jobs...
MissingInputException in line 75 of /home/pratik/Desktop/ra-fls/Snakefile:
Missing input files for rule salmon_quant:
fastq/SRR3350597_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_1.fastq.gz
This is my Snakefile:
DATASETS = ["SRR3350543_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq",
"SRR3350544_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq",
"SRR3350545_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq",
"SRR3350546_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq",
"SRR3350547_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq",
"SRR3350548_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq",
"SRR3350549_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq",
"SRR3350550_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq",
"SRR3350551_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq",
"SRR3350552_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq",
"SRR3350553_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq",
"SRR3350554_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq",
"SRR3350555_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq",
"SRR3350556_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq",
"SRR3350557_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq",
"SRR3350558_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq",
"SRR3350559_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq",
"SRR3350561_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq",
"SRR3350562_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq",
"SRR3350563_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq",
"SRR3350564_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq",
"SRR3350565_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq",
"SRR3350566_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq",
"SRR3350567_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq",
"SRR3350568_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq",
"SRR3350569_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq",
"SRR3350570_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq",
"SRR3350571_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq",
"SRR3350572_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq",
"SRR3350573_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq",
"SRR3350574_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq",
"SRR3350575_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq",
"SRR3350576_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq",
"SRR3350577_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq",
"SRR3350578_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq",
"SRR3350579_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq",
"SRR3350580_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq",
"SRR3350581_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq",
"SRR3350582_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq",
"SRR3350583_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq",
"SRR3350584_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq",
"SRR3350585_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq",
"SRR3350586_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq",
"SRR3350587_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq",
"SRR3350588_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq",
"SRR3350589_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq",
"SRR3350590_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq",
"SRR3350591_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq",
"SRR3350592_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq",
"SRR3350593_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq",
"SRR3350595_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq",
"SRR3350596_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq",
"SRR3350597_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq",
"SRR3350598_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq",
"SRR3350599_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq",
"SRR3350600_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq",
"SRR3350601_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq",
"SRR3350602_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq",
"SRR3350603_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq",
"SRR3350604_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq",
"SRR3350605_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq",
"SRR3350606_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq",
"SRR3350607_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq",
"SRR3350608_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq",
"SRR3350609_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq",
"SRR3350610_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq",
"SRR3350611_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq",
"SRR3350612_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq"]
SALMON = "/home/pratik/anaconda3/envs/salmon/bin/salmon"
rule all:
input: expand("quants/{dataset}/quant.sf", dataset=DATASETS)
rule salmon_quant:
input:
r1 = "fastq/{sample}_1.fastq.gz",
r2 = "fastq/{sample}_2.fastq.gz",
index = "gencode.v38_salmon_1.5.0"
output:
"quants/{sample}/quant.sf"
params:
dir = "quants/{sample}"
shell:
"{SALMON} quant -i {input.index} -l A -p28 --validateMappings \
--gcBias -o {params.dir} \
-1 {input.r1} -2 {input.r2}"
I have tried changing the file paths for the r1 and r2 inputs. However I think I am missing something or have too much.
Here is my ls where the fastq folder contains all of the fastq.gz files, transcriptome is in gencode.v38_salmon.1.5.0 folder and quants folder is empty:
(snakemake) pratik#pratik:~/Desktop/ra-fls$ ls
fastq gencode.v38.transcripts.fa.gz Snakefile
gencode.v38_salmon_1.5.0 quants sra_explorer_fastq_aspera_download.sh
Here is the fastq folder:
(snakemake) pratik#pratik:~/Desktop/ra-fls/fastq$ ls
SRR3350543_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350543_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350544_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350544_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350545_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350545_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350546_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350546_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350547_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350547_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350548_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350548_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350549_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350549_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350550_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350550_GSM2112323_RA_knee_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350551_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350551_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350552_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350552_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350553_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350553_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350554_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350554_GSM2112324_RA_knee_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350555_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350555_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350556_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350556_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350557_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350557_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350558_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350558_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350559_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350559_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350561_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350561_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350562_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350562_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350563_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350563_GSM2112325_RA_knee_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350564_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350564_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350565_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350565_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350566_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350566_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350567_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350567_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350568_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350568_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350569_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350569_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350570_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350570_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350571_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350571_GSM2112326_RA_knee_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350572_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350572_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350573_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350573_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350574_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350574_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350575_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350575_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350576_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350576_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350577_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350577_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350578_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350578_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350579_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350579_GSM2112327_RA_knee_5_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350580_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350580_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350581_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350581_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350582_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350582_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350583_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350583_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350584_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350584_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350585_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350585_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350586_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350586_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350587_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350587_GSM2112328_RA_hip_1_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350588_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350588_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350589_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350589_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350590_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350590_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350591_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350591_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350592_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350592_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350593_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350593_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350595_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350595_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350596_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350596_GSM2112329_RA_hip_2_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350597_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350598_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350598_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350599_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350599_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350600_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350600_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350601_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350601_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350602_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350602_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350603_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350603_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350604_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350604_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350605_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350605_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350606_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350606_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350607_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350607_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350608_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350608_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350609_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350609_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350610_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350610_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350611_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350611_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR3350612_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR3350612_GSM2112331_RA_hip_4_Homo_sapiens_RNA-Seq_2.fastq.gz
I think the Snakefile is ok, SRR3350597_GSM2112330_RA_hip_3_Homo_sapiens_RNA-Seq_1.fastq.gz is simply missing. See the ls output of yours, that file is not in it.
Related
I am processing file using Nextflow, that have a sample Id and would like to carry this sampleID across processes, so im using tuples. The relevant snippet of the code is here:
process 'rsem_quant' {
input:
val genome from params.genome
tuple val(sampleId), file(read1), file(read2) from samples_ch
output:
tuple sampleId , path "${sampleId}.genes.results" into rsem_ce
script:
"""
module load RSEM
rsem-calculate-expression --star --keep-intermediate-files \
--sort-bam-by-coordinate --star-output-genome-bam --strandedness reverse \
--star-gzipped-read-file --paired-end $genome \
$read1 $read2 $sampleId
"""
The problem is that when using a tuple as an output, I get the following error:
No such variable: sampleId
If I remove the tuple, and just output either part (sampleId, or the path) it works fine, any help is appreciated
I was unable to reproduce the error with the code supplied. I suspect your output block needs to define the output type val for the 'sampleId' variable:
output:
tuple val(sampleId) , path("${sampleId}.genes.results") into rsem_ce
A minimal example to run RSEM on paired-end reads (using Conda) might look like:
nextflow.enable.dsl=2
params.ref_name = 'GRCh38_GENCODE_v31'
params.ref_fasta = 'ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh38.primary_assembly.genome.fa.gz'
params.ref_gtf = 'ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.primary_assembly.annotation.gtf.gz'
params.strandedness = 'reverse'
include { gunzip as gunzip_fasta } from './gzip.nf'
include { gunzip as gunzip_gtf } from './gzip.nf'
process 'rsem_prepare_ref' {
conda 'rsem star samtools'
input:
val ref_name
path ref_fasta
path ref_gtf
output:
path "${ref_name}"
"""
mkdir "${ref_name}"
rsem-prepare-reference \\
--gtf "${ref_gtf}" \\
--star \\
"${ref_fasta}" \\
"${ref_name}/${ref_name}"
"""
}
process 'rsem_calculate_expression' {
tag { sample }
conda 'rsem star samtools'
input:
tuple val(sample), path(reads)
path ref_name
output:
tuple val(sample), path("${sample}.genes.results")
script:
def (read1, read2) = reads
"""
rsem-calculate-expression \\
--star \\
--sort-bam-by-coordinate \\
--star-output-genome-bam \\
--strandedness "${params.strandedness}" \\
--star-gzipped-read-file \\
--paired-end \\
"${read1}" \\
"${read2}" \\
"${ref_name}/${ref_name}" \\
"${sample}"
"""
}
workflow {
reads = Channel.fromFilePairs( './data/*_{1,2}.fastq.gz' )
ref_fasta = gunzip_fasta( params.ref_fasta )
ref_gtf = gunzip_gtf( params.ref_gtf )
rsem_prepare_ref( params.ref_name, ref_fasta, ref_gtf )
rsem_calculate_expression( reads, rsem_prepare_ref.out )
}
Contents of gzip.nf:
process gunzip {
tag { gzfile.name }
input:
path gzfile
output:
path "${gzfile.getBaseName()}"
when:
gzfile.getExtension() == "gz"
"""
gzip -dc "${gzfile}" > "${gzfile.getBaseName()}"
"""
}
Run using:
nextflow run test.nf -resume -ansi-log false
Results:
N E X T F L O W ~ version 21.04.3
Launching `main.nf` [awesome_poincare] - revision: 51040c89cc
[cf/ffec1a] Cached process > gunzip_fasta (GRCh38.primary_assembly.genome.fa.gz)
[ce/b7a04b] Cached process > gunzip_gtf (gencode.v38.primary_assembly.annotation.gtf.gz)
[f1/bcb8e3] Cached process > rsem_prepare_ref
[de/f7906e] Submitted process > rsem_calculate_expression (HBR_Rep2)
[1e/3984da] Submitted process > rsem_calculate_expression (UHR_Rep1)
[59/907f56] Submitted process > rsem_calculate_expression (UHR_Rep3)
[26/41db23] Submitted process > rsem_calculate_expression (HBR_Rep1)
[e8/2c98fe] Submitted process > rsem_calculate_expression (UHR_Rep2)
[03/bbb42b] Submitted process > rsem_calculate_expression (HBR_Rep3)
I am new to nextflow and here is a practice that I wanted to test for a real job.
#!/usr/bin/env nextflow
params.cns = '/data1/deliver/phase2/CNVkit/*.cns'
cns_ch = Channel.fromPath(params.cns)
cns_ch.view()
The output of this script is:
N E X T F L O W ~ version 21.04.0
Launching `cnvkit_call.nf` [festering_wescoff] - revision: 886ab3cf13
/data1/deliver/phase2/CNVkit/002-002_L4_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/015-002_L4.SSHT89_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/004-005_L1_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/018-008_L1.SSHT31_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/003-002_L3_sorted_dedup.cns
/data1/deliver/phase2/CNVkit/002-004_L6_sorted_dedup.cns
Here 002-002, 015-002, 004-005 etc are sample ids. I am trying to write a simple process to output a file such as ${sample.id}_sorted_dedup.calls.cns but I am not sure how to extract these ids and output it.
process cnvcalls {
input:
file(cns_file) from cns_ch
output:
file("${sample.id}_sorted_dedup.calls.cns") into cnscalls_ch
script:
"""
cnvkit.py call ${cns_file} -o ${sample.id}_sorted_dedup.calls.cns
"""
}
How to revise the process cnvcalls to make it work with sample.id?
There's lots of ways to extract the sample names/ids from filenames. One way could be to split on the underscore and take the first element:
params.cns = '/data1/deliver/phase2/CNVkit/*.cns'
cns_ch = Channel.fromPath(params.cns)
process cnvcalls {
input:
path(cns_file) from cns_ch
output:
path("${sample_id}_sorted_dedup.calls.cns") into cnscalls_ch
script:
sample_id = cns_file.name.split('_')[0]
"""
cnvkit.py call "${cns_file}" -o "${sample_id}_sorted_dedup.calls.cns"
"""
}
Though, my preference would be to input the sample name/id alongside the input file using a tuple:
params.cns = '/data1/deliver/phase2/CNVkit/*.cns'
cns_ch = Channel.fromPath(params.cns).map {
tuple( it.name.split('_')[0], it )
}
process cnvcalls {
input:
tuple val(sample_id), path(cns_file) from cns_ch
output:
path "${sample_id}_sorted_dedup.calls.cns" into cnscalls_ch
"""
cnvkit.py call "${cns_file}" -o "${sample_id}_sorted_dedup.calls.cns"
"""
}
#BASENAME# does not appear to work in the install_dir: parameter of the Meson custom_target() function.
protoc = find_program('protoc')
protobuf_sources= [
'apples.proto',
'oranges.proto',
'pears.proto'
]
protobuf_generated_go = []
foreach protobuf_definition : protobuf_sources
protobuf_generated_go += custom_target('go_' + protobuf_definition,
command: [protoc, '--proto_path=#CURRENT_SOURCE_DIR#', '--go_out=paths=source_relative:#OUTDIR#', '#INPUT#'],
input: protobuf_definition,
output: '#BASENAME#.pb.go',
install: true,
install_dir: 'share/gocode/src/github.com/foo/bar/protobuf/go/#BASENAME#/'
)
endforeach
I need the generated files to end up in at directory based on the basename of the input file:
share/gocode/src/github.com/foo/bar/protobuf/go/apples/apples.pb.go
share/gocode/src/github.com/foo/bar/protobuf/go/oranges/oranges.pb.go
share/gocode/src/github.com/foo/bar/protobuf/go/pears/pears.pb.go
If I use #BASENAME# in install_dir: to try and create the directory needed, it does not expand, and instead just creates a literal '#BASENAME#' directory.
share/gocode/src/github.com/foo/bar/protobuf/go/#BASENAME#/apples.pb.go
share/gocode/src/github.com/foo/bar/protobuf/go/#BASENAME#/oranges.pb.go
share/gocode/src/github.com/foo/bar/protobuf/go/#BASENAME#/pears.pb.go
How can the required installed directory location based on the basename be achieved?
(just 3 files in the above example, I actually have 30+ files)
Yes, it looks as there is no support for placeholders like BASENAME for install_dir parameter since this feature aims at file names not directories. But you can process iterator that is string in a loop:
foreach protobuf_definition : protobuf_sources
...
install_dir: '.../go/#0#'.format(protobuf_definition.split('.')[0])
endforeach
I am building a workflow in snakemake and would like to recycle one of the rules to two different input sources. The input sources could be either source1 or source1+source2 and depending on the input the output directory would also vary. Since this was quite complicated to do in the same rule and I didn't want to create the copy of the full rule I would like to create two rules with different input/output, but running same command.
Is it possible to make this work? I get the DAG resolved correctly but the job don't go through on the cluster (ERROR : bamcov_cmd not defined)..
An example below (both rules use the same command at the end):
this is command
def bamcov_cmd():
return( (deepTools_path+"bamCoverage " +
"-b {input.bam} " +
"-o {output} " +
"--binSize {params.bw_binsize} " +
"-p {threads} " +
"--normalizeTo1x {params.genome_size} " +
"{params.read_extension} " +
"&> {log}") )
this is the rule
rule bamCoverage:
input:
bam = file1+"/{sample}.bam",
bai = file1+"/{sample}.bam.bai"
output:
"bamCoverage/{sample}.filter.bw"
params:
bw_binsize = bw_binsize,
genome_size = int(genome_size),
read_extension = "--extendReads"
log:
"bamCoverage/logs/bamCoverage.{sample}.log"
benchmark:
"bamCoverage/.benchmark/bamCoverage.{sample}.benchmark"
threads: 16
run:
bamcov_cmd()
this is the optional rule2
rule bamCoverage2:
input:
bam = file2+"/{sample}.filter.bam",
bai = file2+"/{sample}.filter.bam.bai"
output:
"bamCoverage/{sample}.filter.bw"
params:
bw_binsize = bw_binsize,
genome_size = int(genome_size),
read_extension = "--extendReads"
log:
"bamCoverage/logs/bamCoverage.{sample}.log"
benchmark:
"bamCoverage/.benchmark/bamCoverage.{sample}.benchmark"
threads: 16
run:
bamcov_cmd()
What you asked is possible in python.
It depends if you have JUST python code in the file, or python and Snakemake.
I will answer that first, and then I have a follow up response because I want you to set it up differently so you don't have to do it this way.
Just Python:
from fileContainingMyBamCovCmdFunction import bamcov_cmd
rule bamCoverage:
...
run:
bamcov_cmd()
Visually, see how I do it in this file, to reference access to buildHeader and buildSample. These files are being called by a Snakefile. It should work the same for you.
https://github.com/LCR-BCCRC/workflow_exploration/blob/master/Snakemake/modules/py_buildFile/buildFile.py
EDIT 2017-07-23 - Updating code segment below to reflect user comment
Snakemake and Python:
include: "fileContainingMyBamCovCmdFunction.suffix"
rule bamCoverage:
...
run:
shell(bamcov_cmd())
EDIT END
If the function is truly specific to the bamCoverage call, if you prefer you can put it back in the rule. This implies it's not being called elsewhere, which may be true.
Be careful when annotating files using '.' notation, I use '_' as I find it's easier to prevent creating cyclical dependencies this way.
Also, if you do end up leaving the two rules separately, you will likely end up with ambiguity errors.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=ruleorder#handling-ambiguous-rules
When possible, it's best practice to have rules generating unique outputs.
As for alternatives, consider setting up the code like this?
from subprocess import call
rule all:
input:
"path/to/file/mySample.bw"
#OR
#"path/to/file/mySample_filtered.bw"
bamCoverage:
input:
bam = file1+"/{sample}.bam",
bai = file1+"/{sample}.bam.bai"
output:
"bamCoverage/{sample}.bw"
params:
bw_binsize = bw_binsize,
genome_size = int(genome_size),
read_extension = "--extendReads"
log:
"bamCoverage/logs/bamCoverage.{sample}.log"
benchmark:
"bamCoverage/.benchmark/bamCoverage.{sample}.benchmark"
threads: 16
run:
callString= deepTools_path + "bamCoverage " \
+ "-b " + wilcards.input.bam \
+ "-o " + wilcards.output \
+ "--binSize " str(params.bw_binsize) \
+ "-p " + str({threads}) \
+ "--normalizeTo1x " + str(params.genome_size) \
+ " " + str(params.read_extension) \
+ "&> " + str(log)
call(callString, shell=True)
rule filterBam:
input:
"{pathFB}/{sample}.bam"
output:
"{pathFB}/{sample}_filtered.bam"
run:
callString="samtools view -bh -F 512 " + wildcards.input \
+ ' > ' + wildcards.output
call(callString, shell=True)
Thoughts?
I am trying to add mono to core-image-minimal for P202RDB custom Linux distro. Here is my bblayers.conf file:
# LAYER_CONF_VERSION is increased each time build/conf/bblayers.conf
# changes incompatibly
LCONF_VERSION = "6"
BBPATH = "${TOPDIR}"
BBFILES ?= ""
BBLAYERS ?= " \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta-yocto \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta-yocto-bsp \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/meta-freescale \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/meta-freescale-internal \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/meta-freescale-extra \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/meta-mono \
"
BBLAYERS_NON_REMOVABLE ?= " \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta \
/home/testuser/QorIQ-SDK-V1.9-20151210-yocto/sources/poky/meta-yocto \
"
Now, when I try to build image using bitbake core-image-minimal, I get following output from it:
Loading cache: 100% |##############################################################################################################| ETA: 00:00:00
Loaded 1496 entries from dependency cache.
NOTE: Resolving any missing task queue dependencies
Build Configuration:
BB_VERSION = "1.26.0"
BUILD_SYS = "x86_64-linux"
NATIVELSBSTRING = "Debian-8.6"
TARGET_SYS = "powerpc-fsl-linux-gnuspe"
MACHINE = "p2020rdb"
DISTRO = "fsl-qoriq"
DISTRO_VERSION = "1.9"
TUNE_FEATURES = "m32 spe ppce500v2"
TARGET_FPU = "ppc-efd"
meta
meta-yocto
meta-yocto-bsp = "(detachedfromb74ea96):ddf114933ccfc6e3ce51a10e8e8f95e514b73578"
meta-freescale = "(detachedfrom7fb32a2):7fb32a20983a0ebd5503eb42e851550b0deb8679"
meta-freescale-internal = "(detachedfrom220bff8):220bff8b2030e5af7393b5870d74c6f0af0d76d1"
meta-freescale-extra = "(nobranch):ced26c806cb566b1400a2f4f26a94d8d44d13233"
meta-mono = "daisy:f01b4f7a98d07abcf4c1f845c057199e112fb7d6"
NOTE: Preparing RunQueue
NOTE: Executing SetScene Tasks
NOTE: Executing RunQueue Tasks
NOTE: Tasks Summary: Attempted 1248 tasks of which 1248 didn't need to be rerun and all succeeded.
It seems mono repository is found, then I prepare SD card using this image and it boots without problems on target board, however, mono command is not available. What am I missing?
Add
IMAGE_INSTALL_append = " mono"
to your local.conf. Just adding a layer doesn't add any package to your image.
Even better, create your own image, and add mono to IMAGE_INSTALL in that recipe.