How to use pysam.view to emulate all functions of samtools view - samtools

I am trying to use pysam.view() to filter out certain alignments from a BAM file. The problem I am facing is how to include several regions in the filter.
pysam.view() emulates the samtools view command which allows one to enter several regions separated by the space character, eg:
samtools view opts bamfile chr1:2010000-20200000 chr2:2010000-20200000
But the corresponding pysam.view call:
pysam.view(ops, bamfile, '1:2010000-20200000 2:2010000-20200000')
does not work. It does not return any alignments. I'm quite sure the problem lies in how to specify the list of regions, since the following command works fine:
pysam.view(ops, bamfile, '1:2010000-20200000')
and returns alignments.
My question is: does pysam.view support multiple regions and how does one specify this list? I have searched for documentation regarding this but not found anything.

The short answer to your question is that the format you'd use is
pysam.view(ops, bamfile, '1:2010000-20200000','2:2010000-20200000')
(Also note that the number indicating the end of each of your regions is ~10x larger than the beginning - it seems you might have intended 2010000-2020000 instead.)
I have tested it using the following code:
import pysam
my_bam_file = '/path/to/my/bam_file.bam'
alignments1 = pysam.view(my_bam_file, '1:2010000-4000000')
alignments2 = pysam.view(my_bam_file, '1:5000000-6000000')
alignments3 = pysam.view(my_bam_file, '1:2010000-4000000', '1:5000000-6000000')
print(len(alignments1) + len(alignments2) == len(alignments3))
[Output:] True
However, this way of extracting alignments is not very efficient, as the output you get is one large str, instead of individual alignments. To get a list of separate alignments instead, use the following code:
import pysam
my_bam_file = '/path/to/my/bam_file.bam'
imported = pysam.AlignmentFile(my_bam_file, mode = 'rb')
regions = ('1:2010000-20200000','2:2010000-20200000')
alignments = []
for region in regions:
bam = imported.fetch(region = region, until_eof = True)
alignments.extend([alignment for alignment in bam])
Each element of alignment then ends up being a pysam.AlignedSegment object, with which you can work further using the functions in pysam API.

Related

Can we pass dataframes between different notebooks in databricks and sequentially run multiple notebooks? [duplicate]

I have a notebook which will process the file and creates a data frame in structured format.
Now I need to import that data frame created in another notebook, but the problem is before running the notebook I need to validate that only for some scenarios I need to run.
Usually to import all data structures, we use %run. But in my case it should be combinations of if clause and then notebook run
if "dataset" in path": %run ntbk_path
its giving an error " path not exist"
if "dataset" in path": dbutils.notebook.run(ntbk_path)
this one I cannot get all the data structures.
Can someone help me to resolve this error?
To implement it correctly you need to understand how things are working:
%run is a separate directive that should be put into the separate notebook cell, you can't mix it with the Python code. Plus, it can't accept the notebook name as variable. What %run is doing - it's evaluating the code from specified notebook in the context of the current Spark session, so everything that is defined in that notebook - variables, functions, etc. is available in the caller notebook.
dbutils.notebook.run is a function that may take a notebook path, plus parameters and execute it as a separate job on the current cluster. Because it's executed as a separate job, then it doesn't share the context with current notebook, and everything that is defined in it won't be available in the caller notebook (you can return a simple string as execution result, but it has a relatively small max length). One of the problems with dbutils.notebook.run is that scheduling of a job takes several seconds, even if the code is very simple.
How you can implement what you need?
if you use dbutils.notebook.run, then in the called notebook you can register a temp view, and caller notebook can read data from it (examples are adopted from this demo)
Called notebook (Code1 - it requires two parameters - name for view name & n - for number of entries to generate):
name = dbutils.widgets.get("name")
n = int(dbutils.widgets.get("n"))
df = spark.range(0, n)
df.createOrReplaceTempView(name)
Caller notebook (let's call it main):
if "dataset" in "path":
view_name = "some_name"
dbutils.notebook.run(ntbk_path, 300, {'name': view_name, 'n': "1000"})
df = spark.sql(f"select * from {view_name}")
... work with data
it's even possible to do something like with %run, but it could require a kind of "magic". The foundation of it is the fact that you can pass arguments to the called notebook by using the $arg_name="value", and you can even refer to the values specified in the widgets. But in any case, the check for value will happen in the called notebook.
The called notebook could look as following:
flag = dbutils.widgets.get("generate_data")
dataframe = None
if flag == "true":
dataframe = ..... create datarame
and the caller notebook could look as following:
------ cell in python
if "dataset" in "path":
gen_data = "true"
else:
gen_data = "false"
dbutils.widgets.text("gen_data", gen_data)
------- cell for %run
%run ./notebook_name $generate_data=$gen_data
------ again in python
dbutils.widgets.remove("gen_data") # remove widget
if dataframe: # dataframe is defined
do something with dataframe

Many inputs to one output, access wildcards in input files

Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)

How to use Bioproject ID, for example, PRJNA12997, in biopython?

I have an Excel file in which are given more then 2000 organisms, where each one of them has a Bioproject ID associated (like PRJNA12997). The idea is to use these IDs to get the sequence for a later multiple alignment with other five sequences that I have in a text file.
Can anyone help me understand how I can do this using biopython? At least the part with the bioproject ID.
You can first get the info using Bio.Entrez:
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
# This call to efetch fails sometimes with a 400 error.
handle = Entrez.efetch(db="bioproject", id="PRJNA12997")
I've been trying, and Entrez.read(handle) doesn't seems to work. But if you do record_xml = handle.read() you'll get the XML entry for this record. In this XML you can get the ID for the organism, in this case 12997.
handle = Entrez.esearch(db="nuccore", term="12997[BioProject]")
search_results = Entrez.read(handle)
Now you can efecth from your search results. At this point you should use Biopython to parse whatever you will get in the efetch step, playing with the rettype http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
for result in search_results["IdList"]:
entry = Entrez.efetch(db="nuccore", id=result, rettype="fasta")
this_seq_in_fasta = entry.read()

scilab : index in variable name loop

i would like to read some images with scilab and i use the function imread like this
im01=imread('kodim01t.jpg');
im02=imread('kodim02t.jpg');
im03=imread('kodim03t.jpg');
im04=imread('kodim04t.jpg');
im05=imread('kodim05t.jpg');
im06=imread('kodim06t.jpg');
im07=imread('kodim07t.jpg');
im08=imread('kodim08t.jpg');
im09=imread('kodim09t.jpg');
im10=imread('kodim10t.jpg');
i would like to know if there is a way to do something like below in order to optimize the
for i = 1:5
im&i=imread('kodim0&i.jpg');
end
thanks in advance
I see two possible solutions using execstr or using some kind of list/matrix
Execstr
First create a string of the command to execute with msprintf and then execute this with execstr. Note that in the msprintf conversion the right amount of leading zeros are inserted by %0d format specifier descbribed here.
for i = 1:5
cmd=msprintf('im%d=imread(\'kodim%02d.jpg\');', i, i);
execstr(cmd);
end
List/Matrix
This is probably the more intuitive option using a indexable container such as list.
// This list could be generated using msprintf from example above
file_names_list = list("kodim01t.jpg", "kodim02t.jpg" ,"kodim03t.jpg");
// Create empty list to contain images
opened_images = list();
for i=1:length(file_names_list)
// Open image and insert it at end of list
opened_images($+1) = imread(file_names_list[i]);
end

Apache Pig: Merging list of attributes into a single tuple

I receive data in the form
id1|attribute1a,attribute1b|attribute2a|attribute3a,attribute3b,attribute3c....
id2||attribute2b,attribute2c|..
I'm trying to merge it all into a form where I just have a bag of tuples of an id field followed by a tuple containing a list of all my other fields merged together.
(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c...))
(id2,(attribute2b,attribute2c...))
Currently I fetch it like
my_data = load '$input' USING PigStorage(|) as
(id:chararray, attribute1:chararray, attribute2:chararray)...
then I've tried all combinations of FLATTEN, TOKENIZE, GENERATE, TOTUPLE, BagConcat, etc. to massage it into the form I want, but I'm new to pig and just can't figure it out. Can anyone help? Any open source UDF libraries are fair game.
Load each line as an entire string, and then use the features of the built-in STRPLIT UDF to achieve the desired result. This relies on there being no tabs in your list of attributes, and assumes that | and , are not to be treated any differently in separating out the different attributes. Also, I modified your input a little bit to show more edge cases.
input.txt:
id1|attribute1a,attribute1b|attribute2a|,|attribute3a,attribute3b,attribute3c
id2||attribute2b,attribute2c,|attribute4a|,attribute5a
test.pig:
my_data = LOAD '$input' AS (str:chararray);
split1 = FOREACH my_data GENERATE FLATTEN(STRSPLIT(str, '\\|', 2)) AS (id:chararray, attr:chararray);
split2 = FOREACH split1 GENERATE id, STRSPLIT(attr, '[,|]') AS attributes;
DUMP split2;
Output of pig -x local -p input=input.txt test.pig:
(id1,(attribute1a,attribute1b,attribute2a,,,attribute3a,attribute3b,attribute3c))
(id2,(,attribute2b,attribute2c,,attribute4a,,attribute5a))