Is there any way to pass a bed file or similar with target regions to samtools coverage? - samtools

I would like to get the coverage and meandepth of different regions from a bam file. I guess samtools coverage is a good way to do that but I wasn't able to find a way to pass a file with my target regions.
Is there any way to do that?

samtools coverage does not accept BED files, samtools bedcov does but the output is different.
As an alternative, you can try to embed samtools coverage inside a while loop that runs across the intervals (-r option) and store the output inside a file (myfile.cov in the examle below).
here is the bash code:
while read -r line
do
chr=$(echo $line | cut -d" " -f1)
start=$(echo $line | cut -d" " -f2)
end=$(echo $line | cut -d" " -f3)
samtools coverage -q5 -Q20 --ff UNMAP,SECONDARY,QCFAIL,DUP -r $chr:$start-$end sample.bam | grep -v "#" >> myfile.cov
done < path/to/my.bed
I hope they will implement the BED input soon because it is really useful.

Related

Process part of a line through the shell pipe

I would like to process part of each line of command output, leaving the rest untouched.
Problem
Let's say I have some du output:
❯ du -xhd 0 /usr/lib/gr*
3.2M /usr/lib/GraphicsMagick-1.3.40
584K /usr/lib/grantlee
12K /usr/lib/graphene-1.0
4.2M /usr/lib/graphviz
4.0K /usr/lib/grcrt1.o
224K /usr/lib/groff
Now I want to process each path with another command, for example running pacman -Qo on it, leaving the remainder of the line untouched.
Approach
I know I can use awk {print $2} to get only the path, and could probably use it in a convoluted for loop to weld it back together, but maybe there is an elegant way, ideally easy to type on the fly, producing this in the end:
3.2M /usr/lib/GraphicsMagick-1.3.40/ is owned by graphicsmagick 1.3.40-2
584K /usr/lib/grantlee/ is owned by grantlee 5.3.1-1
12K /usr/lib/graphene-1.0/ is owned by graphene 1.10.8-1
4.2M /usr/lib/graphviz/ is owned by graphviz 7.1.0-1
4.0K /usr/lib/grcrt1.o is owned by glibc 2.36-7
224K /usr/lib/groff/ is owned by groff 1.22.4-7
Workaround
This is the convoluted contraption I am living with for now:
❯ du -xhd 0 /usr/lib/gr* | while read line; do echo "$line $(pacman -Qqo $(echo $line | awk '{print $2}') | paste -s -d',')"; done | column -t
3.2M /usr/lib/GraphicsMagick-1.3.40 graphicsmagick
584K /usr/lib/grantlee grantlee,grantleetheme
12K /usr/lib/graphene-1.0 graphene
4.2M /usr/lib/graphviz graphviz
4.0K /usr/lib/grcrt1.o glibc
224K /usr/lib/groff groff
But multiple parts of it are pacman-specific.
du -xhd 0 /usr/lib/gr* | while read line; do echo "$line" | awk -n '{ORS=" "; print $1}'; pacman --color=always -Qo $(echo $line | awk '{print $2}') | head -1; done | column -t
3.2M /usr/lib/GraphicsMagick-1.3.40/ is owned by graphicsmagick 1.3.40-2
584K /usr/lib/grantlee/ is owned by grantlee 5.3.1-1
12K /usr/lib/graphene-1.0/ is owned by graphene 1.10.8-1
4.2M /usr/lib/graphviz/ is owned by graphviz 7.1.0-1
4.0K /usr/lib/grcrt1.o is owned by glibc 2.36-7
224K /usr/lib/groff/ is owned by groff 1.22.4-7
This is a more generic solution, but what if there are three columns of output and I want to process only the middle one?
It grows in complexity, and I thought there must be a simpler way avoiding duplication.
Use a bash loop
(
IFS=$'\t'
while read -r -a fields; do
fields[1]=$(pacman -Qo "${fields[1]}")
printf '%s\n' "${fields[*]}"
done
)
Use a simple shell loop.
du -xhd 0 /usr/lib/gr* |
while read -r size package; do
pacman --color=always -Qo "$package" |
awk -v sz="$size" '{
printf "%s is owned by %s\n", sz, $0 }'
done
If you want to split out parts of the output from pacman, Awk makes that easy to do; for example, the package name is probably in Awk's $1 and the version in $2.
(Sorry, don't have pacman here; perhaps edit your question to show its output if you need more details. Going forward, please take care to ask the actual question you need help with, so you don't have to move the goalposts by editing after you have received replies - this is problematic for many reasons, not least of which because the answers you already received will seem wrong or unintelligible if they no longer answer the question as it stands after your edit.)
These days, many tools have options to let you specify which fields exactly you want to output, and a formatting option to produce them in machine-readable format. The pacman man page mentions a --machinereadable option, though it does not seem to be of particular use here. Many modern tools will produce JSON, which can be unwieldy to handle in shell scripts, but easy if you have a tool like jq which understands JSON format (less convenient if the only available output format is XML; some tools will let you get the result as CSV, which is mildly clumsy but relatively easy to parse). Maybe also look for an option like --format for specifying how exactly to arrange the output. (In curl it's called -w/--write-out.)

ksh cmd one-liner to grep for several PIDs at once

I got a bunch of processes that I need to check CPU affinity for, so I got this one liner:
for i in `ps -Fae | grep proc_name| awk '{print $2}'`; do taskset -acp $i;done
but I have a problem, taskset shows all the child processes' pid too so I get a massive line of numbers along with their cpu affinity.
I want to pipe the above line into an egrep 'pid1|pid2' so I can filter out all the child processes.
I tried to this:
for i in `ps -Fae | grep proc_name| awk '{print $2}'`; do taskset -acp $i;done | xargs egrep 'ps -Fae | grep proc_name| awk '{print $2}''
but my ksh shell didn't like the awk brackets at all.
So I have two questions:
can taskset be changed to show only parent pid?
how do I write the last bit where I egrep only the parent pid?
Filter inside the loop:
for i in $(ps -Fae | grep proc_name| grep -v grep | awk '{print $2}'); do
taskset -acp "$i" | grep "$i"
done
It sounds like you're asking for this syntax if it were bash (see https://mywiki.wooledge.org/BashFAQ/001, I'm not sure what the equivalent robust read loop syntax is for ksh):
while IFS= read -r i; do
taskset -acp "$i"
done < <(ps -Fae | awk '/proc_name/{print $2}') |
grep -E 'pid1|pid2'
but that's pretty fragile, e.g. if pid1 appeared as a substring of some other pid. If you edit your question to provide concise, testable sample input (i.e. the output of ps -Fae and the associated output of taskset) plus the expected output then we can be of more help.

Extract user specified sequence from reverse strand of from FASTA file Using samtools

I have a list of regions with start and end points.
I used the samtools faidx ref.fa <region> command. This command gave me the forward strand sequence for that region.
In the samtools manual there is an option to extract reverse strand but I could not figure out how to use that.
Does anybody know how to run this command for reverse strand in samtools?
My regions are like:
LG2:124522-124572 (Forward)
LG3:250022-250072 (Reverse)
LG29:4822278-4822318 (Reverse)
LG12:2,595,915-2,596,240 (Forward)
LG16:5,405,500-5,405,828 (Reverse)
As you noticed, samtools has the option --reverse-complement (or -i) to output the sequence from the reverse strand.
As far as I know, samtools does not support a region notation which permits specifying the strand.
A quick solution would be to separate your region file into forward and reverse locations and run samtools twice.
The steps below are rather verbose, just so the steps are clear. It's fairly straight-forward to clean this up with process substitution in bash, for example.
# Separate the strand regions.
# Use grep and sed twice, or awk (below).
grep -F '(Forward)' regions.txt | sed 's/ (Forward)//' > forward-regions.txt
grep -F '(Reverse)' regions.txt | sed 's/ (Reverse)//' > reverse-regions.txt
# Above as an awk one-liner.
awk '{ strand=($2 == "(Forward)") ? "forward" : "reverse"; print $1 > strand"-regions.txt" }' regions.txt
# Run samtools, marking the strand as +/- in the FASTA output.
samtools faidx ref.fa -r forward-regions.txt --mark-strand sign -o forward-sequences.fa
samtools faidx ref.fa -r reverse-regions.txt --mark-strand sign -o reverse-sequences.fa --reverse-complement
# Combine the FASTA output to a single file.
cat forward-sequences.fa reverse-sequences.fa > sequences.fa
rm forward-sequences.fa reverse-sequences.fa
just want to mention that you probably need to update your samtools to the latest version if you met problem. In my case, samtools V1.2 didn't work, and V1.10 worked.

Script for swapping subfolders recursively

For days I am struggling with a problem.
I have a directory structure like:
Dir/Documents/PersonA/
Dir/Documents/PersonB/
Dir/Pictures/PersonA/
Dir/Pictures/PersonC/
Dir/Mails/PersonA/
Dir/Mails/PersonB/
And I would like to change it recursively with all content to:
Dir/PersonA/Documents/
Dir/PersonA/Pictures/
Dir/PersonA/Mails/
Dir/PersonB/Documents
Dir/PersonB/Mails
Dir/PersonC/Pictures
I tried some scripting, but got stuck because some of the dirs contain spaces, and because I am not good enough.....
In shell (bash, sh, zsh, etc.), it could be with for loop:
for person in $(cd Dir/Documents/ ; ls); do
for d in Documents Pictures Mails; do
mv "Dir/$d/${person}" "Dir/${person}/$d"
done
done
EDIT: added double-quote around mv arguments
EDIT2: for isn't kind enough to deal with spaces. (+ add mkdir)
/bin/ls -d Dir/*/* | while read e; do
persondir=$(echo $e | awk -F/ '{print $1"/"$3}')
subfolder=$(echo $e | awk -F/ '{print $1"/"$3"/"$2}')
mkdir -p "${persondir}"
mv "$e" "$subfolder"
done

Makefile variable not set from grep output

I am trying to set the variable COGLINE to be the output of my grep line (which is searching my config.json file for the regExthe "cogs"). When I execute the grep line it correctly outputs the proper line number, but when I echo the variable it comes up blank.
COGLINE = $(grep -n \"cogs\" ~/Desktop/Repos/pronghorn/config.json | cut -f1 -d:)
all:
grep -n \"cogs\" ~/Desktop/Repos/pronghorn/config.json | cut -f1 -d:
echo $(COGLINE)
Here is the output:
GlennMBP:test glenn$ make all
grep -n \"cogs\" ~/Desktop/Repos/pronghorn/config.json | cut -f1 -d:
2
echo
You can see that the line number is properly found as "2", but the variable comes up blank as if it were not set. What am I doing wrong?
grep is not a make function. That COGLINE = line is a make assignment.
You either need to use
COGLINE := $(shell grep -n \"cogs\" ~/Desktop/Repos/pronghorn/config.json | cut -f1 -d:)
if you want that run at make parse time and want it in a make variable.
Or
all:
COGLINE=$$(grep -n \"cogs\" ~/Desktop/Repos/pronghorn/config.json | cut -f1 -d:); \
echo "$${COGLINE}"
to run it at all recipe execution time and have it in a shell variable.
There are middle grounds as well but those are the two basic ideas.