Problems compressing files using Nextflow - gzip

I try to compress files with the suffix '.js' in a Nextflow pipeline.
My problem is that the 'result.tar.gz' archive only contains soft links to the original file and not the actual file.
Does anybody know a answer to that problem?
My example Code:
#!/usr/bin/env nextflow
a_ch = Channel
.fromPath('a.js')
b_ch = Channel
.fromPath('b.js')
process testTar {
publishDir "." ,mode: 'copy' , pattern: "*.tar.gz"
input:
path "a.js" from a_ch
path "b.js" from b_ch
output:
path("result.tar.gz") into results_ch
"""
tar -czvf "result.tar.gz" *.js
"""
}
Thank you in advance.

I don't know what tar you have, but try adding an h in the options, e.g. -chzvf, to dereference symbolic links.

Related

Error caused by missing output files while running Nextflow

I have an error when i run nextflow consist of the following sentence
Error executing process > 'BWA_INDEX (Homo_sapiens_assembly38_chr1.fasta)'
Caused by:
Missing output file(s) FASTA.* expected by process 'BWA_INDEX(Homo_sapiens_assembly38_chr1.fasta)'
I use the following script.
#!/usr/bin/env nextflow
params.PublishDir = "/home/nextflow_test/genesFilter"
params.pathFasta = "/home/nf-core/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38_chr1.fasta"
InputFasta = file(params.pathFasta)
process BWA_INDEX {
tag {InputFasta.name}
publishDir (
path: "${params.PublishDir}",
mode: 'copy',
overwrite: 'true',
saveAs: "${params.PublishDir}/${it}"
)
input:
path InputFasta
output:
file("FASTA.*") into bwa_indexes
script:
"""
bwa-mem2 index "${InputFasta}"
"""
}
ch_bwa = bwa_indexes
Nevertheless into the work directory (specified after the error sentence) the process does work correctly and the output files are generated but not on my desire output directory. I tried to replace the "file" by the "path" on the script in the line:
output:
file("FASTA.*")
As well as replace "FASTA.* " for "${params.PublishDir}/FASTA.*"
but the error still appears. I don't know exactly why it happens. ¿Maybe could be due to the use of params to specify the inputs and outputs?
Thanks in advance!
Missing output file(s) FASTA.* expected by process 'BWA_INDEX(Homo_sapiens_assembly38_chr1.fasta)'
Nextflow is expecting files matching the glob pattern FASTA.* in the working directory, but they could not be found when the process exited (successfully). You just need to tell Nextflow what files to expect in your output declaration. The files that bwa-mem2 index Homo_sapiens_assembly38_chr1.fasta should have created might look like:
Homo_sapiens_assembly38_chr1.fasta.0123
Homo_sapiens_assembly38_chr1.fasta.amb
Homo_sapiens_assembly38_chr1.fasta.ann
Homo_sapiens_assembly38_chr1.fasta.bwt.2bit.64
Homo_sapiens_assembly38_chr1.fasta.bwt.8bit.32
Homo_sapiens_assembly38_chr1.fasta.pac
The following output declaration should be sufficient to find these files:
output:
path("${InputFasta}.*") into bwa_indexes
Note that only files that are declared in your output block are published to the publishDir. Also, the 'saveAs' publishDir parameter must be a closure for it to work correctly. You will need to fix this (or just remove the line entirely) to make your example work.

Discard part of filename in Snakemake: "Wildcards in input files cannot be determined from output files"

I am running into a WildcardError: Wildcards in input files cannot be determined from output files problem with Snakemake. The issue is that I don't want to keep a variable part of my input file name. For instance, suppose I have these files.
$ mkdir input
$ touch input/a-foo.txt
$ touch input/b-wsdfg.txt
$ touch input/c-3523.txt
And I have a Snakemake file like this:
subjects = ['a', 'b', 'c']
result_pattern = "output/{kind}.txt"
rule all:
input:
expand(result_pattern, kind=subjects)
rule step1:
input:
"input/{kind}-{fluff}.txt"
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
I want the output file names to just have the part I'm interested in. I understand the principle that every wildcard in input needs a corresponding wildcard in output. So is what I'm trying to do a sort of anti-pattern? For instance, I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other. Should I be renaming my input files prior to feeding into snakemake?
I want the output file names to just have the part I'm interested in [...]. I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other.
It seems to me you need to decide how to resolve such conflicts. If the input files are:
input/a-bar.txt
input/a-foo.txt <- Note duplicate {a}
input/b-wsdfg.txt
input/c-3523.txt
How do you want the output files to be named and according to what criteria? The answer is independent of snakemake but depending on your circumstances you could include python code within the Snakefile to do handle such conflicts automatically.
Basically, once you make such decisions you can work on the solution.
But suppose there are no file name conflicts, it seems like the wildcard system doesn't handle cases where you want to remove some variable fluff from a filename
The variable part can be handled using python's glob patterns:
import glob
...
rule step1:
input:
glob.glob("input/{kind}-*.txt")
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
You could even be more elaborate and use a dedicated function to match files given the {kind} wildcard:
def get_kind_files(wc):
ff = glob.glob("input/%s-*.txt" % wc.kind)
if len(ff) != 1:
raise Exception('Exepected exactly 1 file for kind "%s"' % wc.kind)
# Possibly more checks tha you got the right file
return ff
rule step1:
input:
get_kind_files,
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""

Accessing file path from a config.yaml in Snakemake

I'm working with Snakemake for NGS analysis. I have a list of input files, stored in a YAML file as follows:
DATASETS:
sample1: /path/to/input/bam
.
.
A very simplified skeleton of my Snakemake file, as described earlier in Snakemake: How to use config file efficiently and https://www.biostars.org/p/406452/, is as follows:
rule all:
input:
expand("report/{sample}.xlsx", sample = config["DATASETS"])
rule call:
input:
lambda wildcards: config["DATASETS"][wildcards.sample]
output:
"tmp/{sample}.vcf"
shell:
"some mutect2 script"
rule summarize:
input:
"tmp/{sample}.vcf"
output:
"report/{sample}.xlsx"
shell:
"processVCF.py"
This complains about missing input files for rule all. I'm really not too sure what I am missing out here: Could someone perhaps point out where I can start looking to try to solve my problem?
This problem persists even when I execute snakemake -n tmp/sample1.vcf, so it seems the problem is related to the inability to pass the input file to the rule call. I have a nagging feeling that I'm really missing something trivial here.

Snakemake always rebuilds targets, even when up to date

I'm new to snakemake and running into some behavior I don't understand. I have a set of fastq files with file names following the standard Illumina convention:
SAMPLENAME_SAMPLENUMBER_LANE_READ_001.fastq.gz
In a directory reads/raw_fastq. I'd like to create symbolic links to simplify the names to follow the pattern:
SAMPLENAME_READ.fastq.gz
In a directory reads/renamed_raw_fastq
My aim is that as I add new fastq files to the project, snakemake will create symlinks only for the newly-added files.
My snakefile is as follows:
# Get sample names from read file names in the "raw" directory
readRootDir = 'reads/'
readRawDir = readRootDir + 'raw_fastq/'
import os
samples = list(set([x.split('_', 1)[0] for x in os.listdir(readRawDir)]))
samples.sort()
# Generate simplified names
readRenamedRawDir = readRootDir + 'renamed_raw_fastq/'
newNames = expand(readRenamedRawDir + "{sample}_{read}.fastq.gz", sample = samples, read = ["R1", "R2"])
# Create symlinks
import glob
def getRawName(wildcards):
rawName = glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0]
return rawName
rule all:
input: newNames
rule rename:
input: getRawName
output: "reads/renamed_raw_fastq/{sample}_{read}.fastq.gz"
shell: "ln -sf {input} {output}"
When I run snakemake, it tries to generate the symlinks as expected but:
Always tries to create the target symlinks, even when they already exist and have later timestamps than the source fastq files.
Throws errors like:
MissingOutputException in line 68 of /work/nick/FAW-MIPs/renameRaw.snakefile:
Missing files after 5 seconds:
reads/renamed_raw_fastq/Ben21_R2.fastq.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
It's almost like snakemake isn't seeing the ouput files it creates. Can anyone suggest what I might be missing here?
Thanks!
I think
ln -sf {input} {output}
gives a symlink pointing to a missing file, i.e., it doesn't point to the source file. You could fix it by e.g. using absolute paths, like:
def getRawName(wildcards):
rawName = os.path.abspath(glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0])
return rawName
(As an aside, I would make sure that renaming fastq files the way you do doesn't result in a name-collision, for example when the same sample is sequenced on different lanes of the same flow cell.)

glob_wildcards on multiple directories with different file names

I am trying to write a rule that takes two files from different directories and the put the output of the rule into the same directory as in the file structure below:
DIR_A
dir1
file1.clean.vcf
dir2
file2.clean.vcf
dir3
file1.output.vcf
file2.output.vcf
so far I have tried using glob_wildcards:
(DIR,NAME) = glob_wildcards("DIR_A/{dir}/{name}.clean.vcf")
input: expand("DIR_A/{dir}/{name}.clean.vcf", dir=DIR, name=NAME)
output: "DIR_A/dir3/{name}.output.vcf
but it throws an error:
MissingInputException in line 80 of DIR_A:
Missing input files for rule convert_output:
DIR_A/dir1/file2.clean.vcf
DIR_A/dir2/file1.clean.vcf
adding zip to input:
input: expand("DIR_A/{dir}/{name}.clean.vcf", zip, dir=DIR, name=NAME)
if $ snakemake -s snakefile -n (dry run):
rule conv_output:
input: DIR_A/dir1/file1.clean.vcf, DIR_A/file2/file2.clean.vcf
This is accepted by snakemake and prevents the above error, but now both file1.clean.vcf and file2.clean.vcf are both inputs to the rule but the {name} wild card makes the rule run once per file. This ends up as a many files to one file rather than the one to one that I am looking for.
Is there a way to set this up so I can get the output of the rule conv_output to act on each of the files then put the output in dir3? Any help would be greatly appreciated!!
Using python, pair input vcf sample/filename to its path, and then use it to specify input path in snakemake rule. Below example works for directory structure given in the question.
from pathlib import Path
def pair_name_to_infiles():
# get all *.clean.vcf files recursively under DIR_A
vcf_path = Path('DIR_A').glob('**/*.clean.vcf')
# pair vcf name to infile path using a dictionary
vcf_infiles_dict = {}
for f in vcf_path:
vcf_name = f.name.replace('.clean.vcf', '')
vcf_infiles_dict[vcf_name] = str(f)
return vcf_infiles_dict
# using function written in python code, map vcf name to their infile path
vcf_infiles_dict = pair_name_to_infiles()
rule all:
input:
expand('DIR_A/dir3/{vcf_name}.output.vcf', vcf_name=vcf_infiles_dict.keys())
rule foo:
input:
lambda wildcards: vcf_infiles_dict[wildcards.vcf_name]
output:
'DIR_A/dir3/{vcf_name}.output.vcf'
shell:
'touch {output}'