output of one rule as input of another - snakemake

I am new to snakemake and I'm trying to write a complex pipeline with many steps and branching points. One of the earlier steps is a STAR alignment.
Here I want to use the genome alignment for some stuff and the transcriptome aligment for others. I'm outputting two output files and I want to use each of these as inputs for other rules in snakemake.
If possible I want to avoid using actual filenames and I want snakemake to deal with it for me.
rule star:
input:
reads=samples.iloc[0,1].split(",")
output:
tx_align=temp("/".join([output_directory, "star", samplename+"Aligned.toTranscriptome.out.bam"])),
genome_align="/".join([output_directory, "star", samplename+"Aligned.sortedByCoord.out.bam"])
params:
index=config["resources"]["star_index"],
gtf=config["resources"]["gtf"],
prefix="/".join([output_directory, "star", samplename])
log: config["root_dir"]+"/"+str{samples["samplename"]}+"star.log"
threads: 10
shell:
"""
STAR --runMode alignReads \
--runThreadN {threads} \
--readFilesCommand zcat \
--readFilesIn {reads} \
--genomeDir {params.index} \
--outFileNamePrefix {params.prefix} \
--twopassMode Basic \
--sjdbGTFfile {params.gtf} \
--outFilterType BySJout \
--limitSjdbInsertNsj 1200000 \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs None \
--alignSoftClipAtReferenceEnds Yes \
--quantMode TranscriptomeSAM GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:{samplename}, SM:sm1 \
--outSAMattributes All \
--outSAMunmapped Within \
--outSAMprimaryFlag AllBestScore \
--chimSegmentMin 15 \
--chimJunctionOverhangMin 15 \
--chimOutType Junctions \
--chimMainSegmentMultNmax 1 \
--genomeLoad NoSharedMemory
"""
So then I wan to use something like
rule rsem:
input:
rules.star.ouput[0]
output:
somefile
run:
etc
I'm not even sure if this is possible.

EDIT:
nwm here is the solution
rule1
input: some_input
output:
out1=output1,
out2=output2
shell:
"command {input} {out1} {out2}"
rule2
input:rules.rule1.output.out1

Related

Pyspark df.count() Why does it work with only one executor?

i am trying to read data from kafka and I want to get the count of it.
It takes a long time because it works with only one executor. how can i increase it?
spark = SparkSession.builder.appName('oracle_read_test') \
.config("spark.driver.memory", "30g") \
.config("spark.driver.maxResultSize", "64g") \
.config("spark.executor.cores", "10") \
.config("spark.executor.instances", "15") \
.config('spark.executor.memory', '30g') \
.config('num-executors', '20') \
.config('spark.yarn.executor.memoryOverhead', '32g') \
.config("hive.exec.dynamic.partition", "true") \
.config("orc.compress", "ZLIB") \
.config("hive.merge.smallfiles.avgsize", "40000000") \
.config("hive.merge.size.per.task", "209715200") \
.config("dfs.blocksize", "268435456") \
.config("hive.metastore.try.direct.sql", "true") \
.config("spark.sql.orc.enabled", "true") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("spark.sql.sources.partitionOverwriteMode","dynamic") \
.getOrCreate()
df = spark.read.format("kafka") \
.option("kafka.bootstrap.servers","localhost:9092") \
.option("includeHeaders","true") \
.option("subscribe","test") \
.load()
df.count()
How many partitions does your topic have? If only one, then you cannot have more executors.
Otherwise, --num-executors exists as a flag to spark-submit.
Also, this code only counts the records returned in one batch, not the entire topic. Counting the entire topic would take even longer.

Sample Input from file

I am trying to create the input for rules from a sample file. The sample file contains a Column SampleID which should be used as sample wildcard. I want to extract the paths of normal and tumor bams from the columns Path_Normal and Path_Tumor per SampleID from the data frame.
For this I tried like this:
import pandas as pd
input_table = "sampletable.tsv"
samples = pd.read_table(input_table).set_index("SampleID", drop=False)
rule all:
input:
expand("/directory/sm_mutect2_paired/vcf/{sample}.mt2.vcf", sample=samples.index)
rule Mutect2:
input:
tumor = samples[samples['SampleID']=="{sample}"]['Path_Tumor'],
normal = samples[samples['SampleID']=="{sample}"]['Path_Normal']
output:
"/directory/sm_mutect2_paired/vcf/{sample}.mt2.vcf"
conda:
"envs/gatk_mutect2_paired.yaml"
shell:
"gatk --java-options '-Xmx16G -XX:+UseParallelGC -XX:ParallelGCThreads=16' Mutect2 \
-R /directory/ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta \
{input.tumor} \
{input.normal} \
-L /directory/GATK_interval_files_Agilent/S07604514_hs_hg38/S07604514_Covered.bed \
-O {output} \
--af-of-alleles-not-in-resource 2.5e-06 \
--germline-resource /directory/GATK_gnomad/af-only-gnomad.hg38.vcf.gz \
-pon /home/zyto/unger/GATK_PoN/1000g_pon.hg38.vcf.gz"
...
When doing a dry run I do not get an error message but the execution fails because the input is empty which becomes looking at the log:
atk --java-options '-Xmx16G -XX:+UseParallelGC -XX:ParallelGCThreads=16' Mutect2 -R /directory/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta -L /directory/GATK_interval_files_Agilent/S07604514_hs_hg38/S07604514_Covered.bed -O /directory/WES_Rezidiv_HNSCC_Clonality/sm_mutect2_paired/vcf/HL05_Rez_HL05_NG.mt2.vcf --af-of-alleles-not-in-resource 2.5e-06 --germline-resource /directory/GATK_gnomad/af-only-gnomad.hg38.vcf.gz -pon /directory/GATK_PoN/1000g_pon.hg38.vcf.gz
The two input files should appear between "Mutect2" and "-R".
So it looks I am doing something wrong defining the inputs...
You need to defer the determination of the input files of that rule to the so-called DAG phase, when jobs and wildcard values are known. This works via input functions. I would strongly recommend to do the official Snakemake tutorial, which covers this topic in depth.

Can't get $TPU_NAME environment variable to work properly

I'm a newbie! I'm trying to train BERT model from scratch on a Kaggle kernel. Can't get the BERT run_pretraining.py script to work on TPUs. Works fine on CPUs though. I'm guessing the issue is with the $TPU_NAME environment variable.
!python run_pretraining.py \
--input_file='gs://xxxxxxxxxx/*' \
--output_dir=/kaggle/working/model/ \
--do_train=True \
--do_eval=True \
--bert_config_file=/kaggle/input/bert-bangla-test-config/config.json \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=2 \
--learning_rate=2e-5 \
--use_tpu=True \
--tpu_name=$TPU_NAME
If the script is using a tf.distribute.cluster_resolver.TPUClusterResolver() (https://www.tensorflow.org/api_docs/python/tf/distribute/cluster_resolver/TPUClusterResolver), then you can simply instantiate the TPUClusterResolver without any arguments, and it'll automatically pickup the TPU_NAME (https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/tpu/client/client.py#L47).
Okay, I found a rookie solution :P
run:
import os
os.environ
from the returned dictionary, you can get the address. Just copy paste it or something. It'll be in the form 'TPU_NAME': 'grpc://xxxxxxx'.

Snakemake: HISAT2index buind and alignment using touch

Following my previous question:Snakemake: HISAT2 alignment of many RNAseq reads against many genomes UPDATED.
I wanted to run the hisat2 alignment using touch in snakemake.
I have several genome files with suffix .1.ht2l to .8.ht2l
bob.1.ht2l
...
bob.8.ht2l
steve.1.ht2l
...
steve.8.ht2l
and sereval RNAseq samples
flower_kevin_1.fastq.gz
flower_kevin_2.fastq.gz
flower_daniel_1.fastq.gz
flower_daniel_2.fastq.gz
I need to align all rnaseq reads against each genome.
workdir: "/path/to/dir/"
(HISAT2_INDEX_PREFIX,)=glob_wildcards('/path/to/dir/{prefix}.fasta')
(SAMPLES,)=glob_wildcards("/path/to/dir/{sample}_1.fastq.gz")
rule all:
input:
expand("{prefix}.{sample}.bam", zip, prefix=HISAT2_INDEX_PREFIX, sample=SAMPLES)
rule hisat2_build:
input:
database="/path/to/dir/{prefix}.fasta"
output:
done = touch("{prefix}")
threads: 2
shell:
("/Tools/hisat2-2.1.0/hisat2-build -p {threads} {input.database} {wildcards.prefix}")
rule hisat2:
input:
hisat2_prefix_done = "{prefix}",
fastq1="/path/to/dir/{sample}_1.fastq.gz",
fastq2="/path/to/dir/{sample}_2.fastq.gz"
output:
bam = "{prefix}.{sample}.bam",
txt = "{prefix}.{sample}.txt",
log: "{prefix}.{sample}.snakemake_log.txt"
threads: 50
shell:
"/Tools/hisat2-2.1.0/hisat2 -p {threads} -x {wildcards.prefix}"
" -1 {input.fastq1} -2 {input.fastq2} --summary-file {output.txt} |"
"/Tools/samtools-1.9/samtools sort -# {threads} -o {output.bam}"
The output gives me bob and steve aligned ONLY against ONE rna-seq sample (i.e. flower_kevin). I don't know how to solve. Any suggestions would be helpful.
I solved the problem by removing zip from rule all. Critics about the syntax of code is still welcome.

Is there a way to convert images embed in a pdf from jpg/gif/whatever to png or gif?

the biggest part of the question is in the title...
I have big pdf files made from concatenated scanned documents which are something like press articles: text+images. The important part is the text, not the pictures...
That's why I thought (accordingly to this article) to compress all the images in the pdf to png or gif...
Thanks for all your propositions, I already spend too much time to try optimize my ghostscript command line options :-p
FYI that my current ghostscript 9.14 command line in production :
gs -q -sDEVICE=pdfwrite \
-dSAFER -dNOPAUSE -dBATCH -dQUIET -dPDFSETTINGS=/ebook \
-dColorImageResolution=150 -dGrayImageResolution=150 -dMonoImageResolution=800 \
-dPreserveOPIComments=false -dPreserveOverprintSettings=false \
-dUCRandBGInfo=/Remove -dProcessColorModel=/DeviceRGB -dMaxInlineImageSize=0 \
-dDetectDuplicateImages=true -dFastWebView=false -dUseFlateCompression=true \
-dAutoFilterGrayImages=false -dAutoFilterColorImages=false \
-dColorImageDownsampleThreshold=1.2 \
-sOUTPUTFILE=/tmp/screen_20140602103745.pdf \
-c "512000000 setvmthreshold /QFactor 0.80 /Blend 1 /ColorTransform 1 /HSamples [2 1 1 2] /VSamples [2 1 1 2]" \
-f /usr/bases/dicodrp/pdf/pdf_concatenes/20140602103745.pdf
I got about 40% compression and something just readable, but I think I can improve the readibility just in changing the image compression type (I got that noisy jpg artifacts...)
No, I can't increase the dpi because that will increase the file size... :-)