BWA alignment "fail to locate the index files" - indexing

This question has been asked previously, but unfortunately for me the solutions posted did not resolve my issue. I am trying to use BWA to align my ddradseq paired end reads to a reference genome, and keep running into the issue of the program throwing the error [E::bwa_idx_load_from_disk] fail to locate the index files.
This is the code I am using:
# Load the BWA module:
module load bwa
bwa mem -t 2 genome/genome.fasta \ processrad_out/sample1.1.fq.gz processrad_out/sample1.2.fq.gz \
>bwa_out/sample1.sam
My data it stored in the 'processrad_out' directory, indexed genome in the 'genome' directory, and I want the output to be stored in the 'bwa_out' directory.
I have already indexed the genome, and have tried running my script in the indexed genome directly, but still seem to have the same issue. These are the indexed files I have:
LeachsGenome.fasta
LeachsGenome.fasta.ann
LeachsGenome.fasta.fai
LeachsGenome.fasta.sa
LeachsGenome.fasta.amb
LeachsGenome.fasta.bwt
LeachsGenome.fasta.pac
Previously when I indexed I was missing the .fai file but managed to correct that issue. Despite, the program is still having difficulty locating the indexed files. I have also tried re-indexing as that seems to have worked for some people but no such luck. I know my paths are correct and I can quite figure out what the correct solution is.
I am fairly new to bioinformatics and am trying to learn as much as I can. Any suggestions are welcome and highly appriciated!

Related

How to properly use batch flag in snakemake to subset DAG

I have a workflow written in snakemake that balloons at one point to a large number of output files. This happens because there are many combinations of my wildcards yielding on the order of 60,000 output files.
With this number of input files, DAG generation is very slow for subsequent rules to the point of being unusable. In the past, I've solved this issue by iterating with bash loops across subsetted configuration files where all but one of the wildcards is commented out. For example, in my case I had one wildcard (primer) that had 12 possible values. By running snakemake iteratively for each value of "primer", it divided up the workflow into digestible chunks (~5000 input files). With this strategy DAG generation is quick and the workflow proceeds well.
Now I'm interested in using the new --batch flag available in snakemake 5.7.4 since Johannes suggested to me on twitter that it should basically do the same thing I did with bash loops and subsetted config files.
However, I'm running into an error that's confusing me. I don't know if this is an issue I should be posting on the github issue tracker or I'm just missing something basic.
The rule my workflow is failing on is as follows:
rule get_fastq_for_subset:
input:
fasta="compute-workflow-intermediate/06-subsetted/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.fasta",
fastq="compute-workflow-intermediate/04-sorted/{sample}.{direction}.SSU.{group}.fastq"
output:
fastq=temp("compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.fastq"),
fastq_revcomp="compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.revcomped.fastq"
conda:
"envs/bbmap.yaml"
shell:
"filterbyname.sh names={input.fasta} include=t in={input.fastq} out={output.fastq} ; "
"revcompfastq_according_to_pyNAST.py --inpynast {input.fasta} --infastq {output.fastq} --outfastq {output.fastq_revcomp}"
I set up a target in my all rule to generate the output specified in the rule, then tried running snakemake with the batch flag as follows:
snakemake --configfile config/config.yaml --batch get_fastq_for_subset=1/20 --snakefile Snakefile-compute.smk --cores 40 --use-conda
It then fails with this message:
WorkflowError:
Batching rule get_fastq_for_subset has less input files than batches. Please choose a smaller number of batches.
Now the thing I'm confused about is that my input for this rule should actually be on the order of 60,000 files. Yet it appears snakemake getting less than 20 input files. Perhaps it's only counting 2, as in the number of input files that are specified by the rule? But this doesn't really make sense to me...
The source code for snakemake is here, showing how snakemake counts things but it's a bit beyond me:
https://snakemake.readthedocs.io/en/stable/_modules/snakemake/dag.html
If anyone has any idea what's going on, I'd be very grateful for your help!
Best,
Jesse

Snakemake › Access multiple keys from config file

I have the question about the proper handling of the config file. I'm trying to solve my issue for a couple of days now but with the best will, I just can't find out how to do it. I know that this question is maybe quite similar with all the others here and I really tried to use them - but I didn't really get it. I hope that some things about how snakemake works will be more clear when I solved this problem.
I'm just switching to snakemake and I thought I just can easily convert my bash script. To get familiar with snakemake I started trying a simple Data-Processing pipeline. I know I could solve my case while defining every variable within the snakefile. But I want to use an external config file.
First is to say, for better understanding I decided just to post the code which I thought would work somehow. I already played around with different versions for a "rule all" and the "lambda" functions, but nothing worked so far and it just would be confusing. I'm really a bit embarrassed and confused about why I can't get this working. The variable differs from the key because I aways had a version where I redefine the variable, like:
$ sample=config["samples"]
I would be incredibly thankful for an example code.
What I'd like to have is:
The config file:
samples:
- SRX1232390
- SRX2312380
names:
- SomeData
- SomeControl
adapters:
- GATCGTAGC
- GATCAGTCG
And then I thought I can just call the keys like different variables.
rule download_fastq:
output:
"fastq/{name}.fastq.gz"
shell:
"fastq-dump {wildcards.sample} > {output}"
later there will be more rules, so I thought for them I also just need a key:
rule trimming_cutadapt:
input:
"fastq/{name}.fastq"
output:
"ctadpt_{name}.fastq"
shell:
"cutadapt -a {adapt}"
I also tried something with a config file like this:
samples:
Somedata: SRX1232131
SomeControl: SRX12323
But in the end I also didn't find the final solution nor would I know how to add a third "variable" then.
I hope it is somehow understandable what I want to have. It would be very awesome if someone could help me.
EDIT:
Ok - I reworked my code and tried to dig into everything. I fear my understanding lacks in connecting the things I read in this case. I would really appreciate some tips which will probably help me to understand my confusion.
First of all: Rather than try to download data from a pipeline I decided to do this in a config step. I tried out two different versions now:
Based on this answer I tried version one. I like the version with the two files. But I'm stuck in how to deal with the variables now in things like using them with the lambda function or everything where you normally would write "config["sample"]".
So my problem here is that I don't knwo ho to proceed or how the correct syntax is now to call the variables.
#version one
configfile: "config.yaml"
sample_file = config["sample_file"]
import pandas as pd
sample = pd.read_table(sample_file)['samples']
adapt = pd.read_table(sample_file)['adapters']
rule trimming_cutadapt:
input:
data=expand("../data/{sample}.fastq", name = pd.read_table(sample_file)['names']),
lambda wildcards: ???
output:
"trimmed/ctadpt_{sample}.fastq"
shell:
"cutadapt -a {adapt}"
So I went back to try to understand using and defining the wildcards. So (among other things) I looked into the example snakefile and the example rules of Johannes. And of course into the man. Oh and the Thing about the zip function.
At least I don't get an error anymore that it can't deal with wildcards or whatever. Now it's just doing nothing. And I can't find out why because I don't get any information. Additionaly I marked some points which I don't understand.
#version two
configfile: "config_ChIP_Seq_Pipeline.yaml"
rule all:
input:
expand("../data/{sample}.fastq", sample=config["samples"])
#when to write the lambda or the expand in a rule all and when into the actual rule?
rule trimming_cutadapt:
input:
"../data/{sample}.fastq"
params:
adapt=lambda wildcards: config[wildcards.sample]["adapt"] #why do I have to write .samle? when I have to use wildcard.XXX in the shell part?
output:
"trimmed/ctadpt_{sample}.fastq"
shell:
"cutadapt -a {params.adapt}"
As a testfile I used this one.
My configfile in version 1:
sample_file: "sample.tab"
and the tab file:
samples names adapters
test_1 input GACCTA
and the configfile from version two:
samples:
- test_1
adapt:
- GTACGTAG
Thanks for your help and patients!
Cheers
You can look at this post to see how to store and access sample information.
Then you can look at Snakemake documentation here, more specifically at the zip function, which might help you as well.

How to fix 'File name too long' errors when using Snakemake

When using Snakemake, I store the values for my variables as part of the filenames (ex. "processed/count_{project}.tsv"). Recently, I've started using R formulas with many covariates as a variable. Now I get an error because the the filename is too long for the operating system. Has anyone else run into this issue and have any suggestions? Is there a canonical Snakemake approach for this problem?
Personally, I don't think it is a good idea to store information into the filename.
Rather, I would create a temp file in tabular or yaml format linking the file in question to covariates or other data. Then read this file in R or else to extract the relevant information.
One idea is to use paths instead since paths allowed to be longer.

Ignore includes with #pycparser and define multiple Subgraphs in #pydot

I am new to stackoverflow, but I got a lot of help until now, thanks to the community for that.
I'm trying to create a software showing me caller depandencys for legacycode.
I'parsing a directory with c code with pycparcer, and for each file i want to create a subgraph with pydot.
Two questions:
When parsing a c file, the parser references the #includes, an i get also functions in my AST, from the included files. How can i know, if the function is included, or originaly from this actual file/ or ignore the #includes??
For each file i want to create a subgraph, an then add all functions in this file to this subgraph. I don't know how many subgraphs i have to create...
I have a set of files, where each file is a frozenset with the functions of this file
somthing like this is pssible?
for files in SetOfFiles:
#how to create subgraph with name of files?
for function in files:
self.graph.add_node(pydot.Node(funktion)) #--> add node to subgraph "files"
I hope you got my challange... any ideas?
Thanks!
EDIT:
I solved the question about pydot, it was quiet easy... So I stay with my pycparser problem :(
for files in ListOfFuncs:
cluster_x = pydot.Cluster(files, label=files)
for functions in files:
cluster_x.add_node(pydot.Node(functions))
graph.add_subgraph(cluster_x)
I can address the pycparser part. The preprocessor leaves #line directives that specify which file & line code came for, and pycparser consumes those. You can get that information from the AST it creates (see tests for an example).

Inconsistent Behavior In A Batch File's For Statement

I've done very little with batch files but I'm trying to track down a strange bug I've been encountering on a legacy system.
I have a number of .exe files in particular folder. This script is supposed to duplicate them with a different file name.
Code From Batch File
for %%i in (*.exe) do copy \\networkpath\folder\%%i \\networkpath\folder\%%i.backup.exe
(Note: The source and destination folders are THE SAME)
Example Of Desired Behavior:
File1.exe --> Becomes --> File1.exe.backup.exe
File2.exe --> Becomes --> File2.exe.backup.exe
Now first, let me say that this is not the approach I would take. I know there are other (potentially more straight forward) ways to do this. I also know that you might wonder WHY on earth we care about creating a FileX.exe.backup.exe. But this script has been running for years and I'm told the problem only started recently. I'm trying to pinpoint the problem, not rewrite the code (even if it would be trivial).
Example Buggy Output:
File1.exe.backup.exe
File1.exe.backup.exe.backup.exe
File1.exe.backup.exe.backup.exe.backup.exe
File1.exe.backup.exe.backup.exe.backup.exe.backup.exe
File1.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe
File1.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe
etc...
File2.exe.backup.exe
File2.exe.backup.exe.backup.exe
File2.exe.backup.exe.backup.exe.backup.exe
File2.exe.backup.exe.backup.exe.backup.exe.backup.exe
File2.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe
File2.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe
Not knowing anything about batch files, I looked at this and figured that the condition of the for statement was being re-evaluated after each iteration - creating a (near) infinite loop of copying (I can see that, eventually, the copy will fail when the names get too long).
This would explain the behaviour I'm seeing. And when cleaned the directory in question so that it had only the original File1.exe file and ran the script it produced the bug code. The problem is that I CANNOT replicate the behaviour anywhere else!?!
When I create a folder locally with a few .exe files and run the script - I get the expected output. And yes, if I run it again, I get one instance of 'File1.exe.backup.exe.backup.exe' (and each time I run it again, it increases in length by one). But I cannot get it to enter the near-infinite loop case.
It's been driving me crazy.
The bug is occurring on a networked location - so I've tried to recreate it on one - but again, no success. Because it's a shared network location, I wondered if it could have something to do with other people accessing or modifying files in the folder and even introduced delays and wrote a tiny program to perform actions in the same folder - but without any success.
The documentation I can find on the 'for' statement doesn't really help, but all of the tests I've run seem to suggest that the in (*.exe) section is only evaluated once at the beginning of execution.
Does anyone have any suggestions for what might be going on here?
I agree with Andriy M's comment - it looks to be related to Windows 7 Batch Script 'For' Command Error/Bug
The following change should fix the problem:
for /f "eol=: delims=" %%i in ('dir /b *.exe') do copy \\networkpath\folder\%%i \\networkpath\folder\%%i.backup.exe
Any file that starts with a semicolon (highly unlikely, but it can happen) would be skipped with the default EOL of semicolon. To be safe you should set EOL to some character that could never start a file name (or any path). That is why I chose the colon - it cannot appear in a folder or file name, and can only appear after a drive letter. So it should always be safe.
Copy supports wildcard characters also in target path. You can use
copy \\networkpath\folder\*.exe \\networkpath\folder\*.backup.exe