STAR snakemake wrapper with configfile - snakemake

I'm trying to use the STAR wrapper for snakemake (described here) but would like to include an option for an annotation file using the --sjdbGTFfile flag. My goal is to specify a gtf annotation file by referring to a configfile, but I can't seem to get it to work.
Relevant snippet of my config file:
annotations: ref_files/cal09_human/GRCh38.103_Cal09.gtf
The relevant rule looks like this:
rule map_reads:
message:
"""
Mapping trimmed reads from {wildcards.sample} to host genome
"""
input:
fq1 = "tmp/{sample}_R1_trimmed.fastq.gz",
fq2 = "tmp/{sample}_R2_trimmed.fastq.gz"
params:
index= config["genome_index"],
extra="--sjdbGTFfile {config[annotations]} \
--sjdbOverhang 149 \
--outFilterType BySJout \
--outFilterMultimapNmax 10 \
--alignSJoverhangMin 5 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--outSAMtype BAM SortedByCoordinate \
--runMode alignReads"
output:
"star_mapping/{sample}_Aligned.sortedByCoord.out.bam"
log:
"logs/star/{sample}.log"
threads: 20
wrapper:
"0.74.0/bio/star/align"
I was specifying the config file as if it were within a shell command (like described here), but it's not working:
Jun 03 16:49:52 ..... started STAR run
Jun 03 16:49:52 ..... loading genome
Jun 03 16:50:32 ..... processing annotations GTF
FATAL error, could not open file pGe.sjdbGTFfile={config[annotations]}
Jun 03 16:50:32 ...... FATAL ERROR, exiting
I ran the following directly in the terminal and it runs just fine:
STAR --runThreadN 24 \
--genomeDir ref_files/cal09_human_star_index \
--sjdbGTFfile ref_files/cal09_human/GRCh38.103_Cal09.gtf \
--sjdbOverhang 149 \
--outFilterType BySJout \
--outFilterMultimapNmax 10 \
--alignSJoverhangMin 5 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--outSAMtype BAM SortedByCoordinate \
--runMode alignReads
I double checked the I didn't make an obvious mistake like misspecifying the file path in the config file. Does anyone know of a way to achieve what I'm trying to do?

extra="--sjdbGTFfile {config[annotations]} \
--sjdbOverhang 149 ..."
It seems {config[annotations]} is not replaced with the dictionary value but it is instead passed as it is.
If you are on python 3.6+, try using (note the f)
extra=f"--sjdbGTFfile {config[annotations]} \
--sjdbOverhang 149 ..."
or, old style:
extra= "--sjdbGTFfile %s \
--sjdbOverhang 149 ..." % config[annotations]

Related

Delete unwanted Snakemake Outputs

I have looked at a few other post about Snakemake and deleting unneeded data to clean up diskspace. I have designed a rule called: rule BamRemove that touches my rule all. However, my the workflow manager isnt recognizing. I am getting this error: WildcardError in line 35 of /PATH:
No values given for wildcard 'SampleID'. I am not seeing why. Any help to get this to work would be nice.
sampleIDs = d.keys()
rule all:
input:
expand('bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam', sampleID=sampleIDs),
expand('bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai', sampleID=sampleIDs),
expand('logs/{SampleID}_removed.txt', sampleID=sampleIDs) #Line 35
# Some tools require unzipped fastqs
rule AnnotateUMI:
input: 'bams/{sampleID}_unisamp_L001_001.star_rg_added.sorted.dmark.bam'
output: 'bams/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bam',
# Modify each run
params: '/data/Test/fastqs/{sampleID}_unisamp_L001_UMI.fastq.gz'
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar AnnotateBamWithUmis \
-i {input} \
-f {params} \
-o {output}')
rule SortSam:
input: rules.AnnotateUMI.output
output: 'bams/{sampleID}_Qsorted.MarkUMI.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx110g -jar /data/Tools/picard.jar SortSam \
INPUT={input} \
OUTPUT={output} \
SORT_ORDER=queryname')
rule MItag:
input: rules.SortSam.output
output: 'bams/{sampleID}_Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar SetMateInformation \
-i {input} \
-o {output}')
rule GroupUMI:
input: rules.MItag.output
output: 'bams/{sampleID}_grouped.Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar GroupReadsByUmi \
-i {input} \
-s adjacency \
-e 1 \
-m 20 \
-o {output}')
rule ConcensusUMI:
input: rules.GroupUMI.output
output: 'bams/{sampleID}_concensus.Qunsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.2.jar CallMolecularConsensusReads \
--input={input} \
--min-reads=1 \
--output={output}')
rule STARmap:
input: rules.ConcensusUMI.output
output:
log = 'bams/{sampleID}_UMI_Concensus_Log.final.out',
bam = 'bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam'
params: 'bams/{sampleID}_UMI_Concensus_'
threads: 32
run:
# Each user needs to genome path
shell('STAR \
--runThreadN {threads} \
--readFilesIn {input} \
--readFilesType SAM PE \
--readFilesCommand samtools view -h \
--genomeDir /data/reference/star/STAR_hg19_v2.7.5c \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--limitBAMsortRAM 220000000000 \
--outFileNamePrefix {params}')
rule Index:
input: rules.STARmap.output.bam
output: 'bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai'
params:
threads: 32
run:
shell('samtools index {input}')
rule BamRemove:
input:
AnnotateUMI_BAM = rules.AnnotateUMI.output,
AnnotateUMI_BAI = '{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bai',
SortSam = rules.SortSam.output,
MItag = rules.MItag.output,
GroupUMI = rules.GroupUMI.output,
ConcensusUMI = rules.ConcensusUMI.output
output: touch('logs/{SampleID}_removed.txt')
threads: 32
run:
shell('rm {input}')
expand('logs/{SampleID}_removed.txt', sampleID=sampleIDs) #Line 35
^^^ ^^^
The error is due to SampleID being different from sampleID, make them consistent throughout the script.

Have Snakemake recognize complete files upon relaunch

I have created this Snakemake workflow. This pipeline works really well; however, if any rule fails and I relaunch, Snakemake isnt recognizing all completed files. For instances, Sample A finishes all the way through and creates all files for rule all, but Sample B fails at rule Annotate UMI. When I relaunch, snakemake wants to do all jobs for both A and B, instead of just B. What do I need to get this to work?
sampleIDs = [A, B]
rule all:
input:
expand('PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam', sampleID=sampleIDs),
expand('PATH/bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai', sampleID=sampleIDs),
expand('/PATH/logfiles/{sampleID}_removed.txt', sampleID=sampleIDs)
# Some tools require unzipped fastqs
rule AnnotateUMI:
# Modify each run
input: 'PATH/{sampleID}_unisamp_L001_001.star_rg_added.sorted.dmark.bam'
# Modify each run
output: 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bam'
# Modify each run
params: 'PATH/{sampleID}_unisamp_L001_UMI.fastq.gz'
threads: 36
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar AnnotateBamWithUmis \
-i {input} \
-f {params} \
-o {output}')
rule SortSam:
input: rules.AnnotateUMI.output
# Modify each run
output: 'PATH/{sampleID}_Qsorted.MarkUMI.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx110g -jar PATH/picard.jar SortSam \
INPUT={input} \
OUTPUT={output} \
SORT_ORDER=queryname')
rule MItag:
input: rules.SortSam.output
# Modify each run
output: 'PATH/{sampleID}_Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar SetMateInformation \
-i {input} \
-o {output}')
rule GroupUMI:
input: rules.MItag.output
# Modify each run
output: 'PATH/{sampleID}_grouped.Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar GroupReadsByUmi \
-i {input} \
-s adjacency \
-e 1 \
-m 20 \
-o {output}')
rule ConcensusUMI:
input: rules.GroupUMI.output
# Modify each run
output: 'PATH/{sampleID}_concensus.Qunsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.2.jar CallMolecularConsensusReads \
--input={input} \
--min-reads=1 \
--output={output}')
rule STARmap:
input: rules.ConcensusUMI.output
# Modify each run
output:
log = 'PATH/{sampleID}_UMI_Concensus_Log.final.out',
bam = 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam'
# Modify each run
params: 'PATH/{sampleID}_UMI_Concensus_'
threads: 32
run:
# Each user needs to genome path
shell('STAR \
--runThreadN {threads} \
--readFilesIn {input} \
--readFilesType SAM PE \
--readFilesCommand samtools view -h \
--genomeDir PATH/STAR_hg19_v2.7.5c \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--limitBAMsortRAM 220000000000 \
--outFileNamePrefix {params}')
rule Index:
input: rules.STARmap.output.bam
# Modify each run
output: 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai'
params:
threads: 32
run:
shell('samtools index {input}')
rule BamRemove:
input:
AnnotateUMI_BAM = rules.AnnotateUMI.output,
# Modify each run and include in future version to delete
#AnnotateUMI_BAI = 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bai',
SortSam = rules.SortSam.output,
MItag = rules.MItag.output,
GroupUMI = rules.GroupUMI.output,
ConcensusUMI = rules.ConcensusUMI.output,
STARmap = rules.STARmap.output.bam,
Index = rules.Index.output
# Modify each run
output: touch('PATH/logfiles/{sampleID}_removed.txt')
threads: 32
run:
shell('rm {input.AnnotateUMI_BAM} {input.SortSam} {input.MItag} {input.GroupUMI} {input.ConcensusUMI}')

How to run 'run_squad.py' on google colab? It gives 'invalid syntax' error

I downloaded the file first using:
!curl -L -O https://github.com/huggingface/transformers/blob/master/examples/legacy/question-answering/run_squad.py
Then used following code:
!python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--output_dir models/bert/ \
--data_dir data/squad \
--overwrite_output_dir \
--overwrite_cache \
--do_train \
--train_file /content/train.json \
--version_2_with_negative \
--do_lower_case \
--do_eval \
--predict_file /content/val.json \
--per_gpu_train_batch_size 2 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--threads 10 \
--save_steps 5000
Also tried following:
!python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--do_train \
--do_eval \
--do_lower_case \
--train_file /content/train.json \
--predict_file /content/val.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 584 \
--doc_stride 128 \
--output_dir /content/
The error says in both the codes:
File "run_squad.py", line 7
^ SyntaxError: invalid syntax
What exactly is the issue? How can I run the .py file?
SOLVED: It was giving error because I was downloading the github link rather than the script in github. Once I copied and used 'Raw' link to download the script, the code ran.

How to convert configure options for use with cmake

I have a script for building a project that I need to upgrade from using configure to cmake. The original configure command is
CFLAGS="$SLKCFLAGS" \
CXXFLAGS="$SLKCFLAGS" \
./configure \
--with-clang \
--prefix=$PREFIX \
--libdir=$PREFIX/lib${LIBDIRSUFFIX} \
--incdir=$PREFIX/include \
--mandir=$PREFIX/man/man1 \
--etcdir=$PREFIX/etc/root \
--docdir=/usr/doc/$PRGNAM-$VERSION \
--enable-roofit \
--enable-unuran \
--disable-builtin-freetype \
--disable-builtin-ftgl \
--disable-builtin-glew \
--disable-builtin-pcre \
--disable-builtin-zlib \
--disable-builtin-lzma \
$GSL_FLAGS \
$FFTW_FLAGS \
$QT_FLAGS \
--enable-shared \
--build=$ARCH-slackware-linux
I am not familiar enough with cmake to know how to do the equivalent. I would prefer a command line option but am open to modifying the CMakeLists.txt file as well.

Compiling tensorflow: undefined reference to `clSetUserEventStatus#OPENCL_1.1'

PS: Question is at the end, the following is just background information that might help.
I'm trying to compile tensorflow, but getting this error:
bazel-out/host/bin/_solib_local/_U#local_Uconfig_Usycl_S_Ssycl_Csyclrt___Uexternal_Slocal_Uconfig_Usycl_Ssycl_Slib/libComputeCpp.so:
undefined reference to `clSetUserEventStatus#OPENCL_1.1'
By using "strace" tool, i've isolated the exact statement that's failing to be:
(
cd /home/rh/.cache/bazel/_bazel_rh/81919f16ea125cb9f08993f06569f022/execroot/tensorflow && \
exec env - PATH=/home/rh/.nvm/versions/node/v6.9.5/bin:/home/rh/perl5/bin:/home/rh/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rh/bin:/usr/local/java/jdk1.8.0_74/bin:/home/rh/.local/bin:/home/rh/myscripts \
/usr/bin/clang++-3.6 \
-o \
bazel-out/host/bin/tensorflow/python/gen_functional_ops_py_wrappers_cc \
-Lbazel-out/host/bin/_solib_local/_U#local_Uconfig_Usycl_S_Ssycl_Csyclrt___Uexternal_Slocal_Uconfig_Usycl_Ssycl_Slib \
-Wl,-rpath,$ORIGIN/../../_solib_local/_U#local_Uconfig_Usycl_S_Ssycl_Csyclrt___Uexternal_Slocal_Uconfig_Usycl_Ssycl_Slib \
-pthread \
-Wl,-no-as-needed \
-B/usr/bin/ \
-Wl,-no-as-needed \
-Wl,--build-id=md5 \
-Wl,--hash-style=gnu \
-Wl,-S \
-Wl,#bazel-out/host/bin/tensorflow/python/gen_functional_ops_py_wrappers_cc-2.params \
-Wl,--no-undefined
)
strace also confirms that this command IS reading /usr/local/computecpp/lib/libComputeCpp.so, which contains the aforementioned clSetUserEventStatus symbol.
I verified that libComputeCpp.so contains clSetUserEventStatus by checking the output of the nm command:
nm /usr/local/computecpp/lib/libComputeCpp.so | grep clSetUserEventStatus
>> U clSetUserEventStatus##OPENCL_1.1
So here is my question: clang (and/or its linker) is reading the file that contains the symbol that it's complaining about... why is it "missing something" and complaining that it's undefined?
How can I get this compile to work?