Extract user specified sequence from reverse strand of from FASTA file Using samtools - samtools

I have a list of regions with start and end points.
I used the samtools faidx ref.fa <region> command. This command gave me the forward strand sequence for that region.
In the samtools manual there is an option to extract reverse strand but I could not figure out how to use that.
Does anybody know how to run this command for reverse strand in samtools?
My regions are like:
LG2:124522-124572 (Forward)
LG3:250022-250072 (Reverse)
LG29:4822278-4822318 (Reverse)
LG12:2,595,915-2,596,240 (Forward)
LG16:5,405,500-5,405,828 (Reverse)

As you noticed, samtools has the option --reverse-complement (or -i) to output the sequence from the reverse strand.
As far as I know, samtools does not support a region notation which permits specifying the strand.
A quick solution would be to separate your region file into forward and reverse locations and run samtools twice.
The steps below are rather verbose, just so the steps are clear. It's fairly straight-forward to clean this up with process substitution in bash, for example.
# Separate the strand regions.
# Use grep and sed twice, or awk (below).
grep -F '(Forward)' regions.txt | sed 's/ (Forward)//' > forward-regions.txt
grep -F '(Reverse)' regions.txt | sed 's/ (Reverse)//' > reverse-regions.txt
# Above as an awk one-liner.
awk '{ strand=($2 == "(Forward)") ? "forward" : "reverse"; print $1 > strand"-regions.txt" }' regions.txt
# Run samtools, marking the strand as +/- in the FASTA output.
samtools faidx ref.fa -r forward-regions.txt --mark-strand sign -o forward-sequences.fa
samtools faidx ref.fa -r reverse-regions.txt --mark-strand sign -o reverse-sequences.fa --reverse-complement
# Combine the FASTA output to a single file.
cat forward-sequences.fa reverse-sequences.fa > sequences.fa
rm forward-sequences.fa reverse-sequences.fa

just want to mention that you probably need to update your samtools to the latest version if you met problem. In my case, samtools V1.2 didn't work, and V1.10 worked.

Related

Delete everything before first pattern match with sed/awk

Let's say I have a line looking like this:
/Users/random/354765478/Tests/StoreTests/Base64Tests.swift
In this example, I would like to get the result:
Tests/StoreTests/Base64Tests.swift
How can I do if I want to get everything before the first pattern match (either Sources or Tests) using sed or awk?
I am using sed 's/^.*\(Tests.*\).*$/\1/' right now but it's falling:
echo '/Users/random/354765478/Tests/StoreTests/Base64Tests.swift' | sed 's/^.*\(Tests\)/\1/'
Tests.swift
Here's another example using Sources (which seems to work):
echo '/Users/random/741672469/Sources/Store/StoreDataSource.swift' | sed 's/^.*\(Sources\)/\1/'
Sources/Store/StoreDataSource.swift
I would like to get everything before the first, and not the last Sources or Tests pattern match.
Any help would be appreciated!
How can I do if I want to get everything before the first pattern match (either Sources or Tests).
Easier to use a grep -o here:
grep -Eo '(Sources|Tests)/.*' file
Tests/StoreTests/Base64Tests.swift
Sources/Store/StoreDataSource.swift
# where input file is
cat file
/Users/random/354765478/Tests/StoreTests/Base64Tests.swift
/Users/random/741672469/Sources/Store/StoreDataSource.swift
Breakdown:
Regex pattern (Sources|Tests)/.* match any text that starts with Sources/ or Tests/ until end of the line.
-E: enables extended regex mode
-o: prints only matched text instead of full line
Alternatively you may use this awk as well:
awk 'match($0, /(Sources|Tests)\/.*/) {
print substr($0, RSTART)
}' file
Tests/StoreTests/Base64Tests.swift
Sources/Store/StoreDataSource.swift
Or this sed:
sed -E 's~.*/((Sources|Tests)/.*)~\1~' file
Tests/StoreTests/Base64Tests.swift
Sources/Store/StoreDataSource.swift
With your shown samples please try following GNU grep. This will look for very first match of /Sources OR /Tests and then print values from these strings to till end of the value.
grep -oP '^.*?\/\K(Sources|Tests)\/.*' Input_file
Using sed
$ sed -E 's~([^/]*/)+((Tests|Sources).*)~\2~' input_file
Tests/StoreTests/Base64Tests.swift
would like to get everything before the first, and not the last
Sources or Tests pattern match.
First thing is to understand reason of that, you are using
sed 's/^.*\(Tests.*\).*$/\1/'
observe that * is greedy, i.e. it will match as much as possible, therefore it will always pick last Tests, if it would be non-greedy it would find first Tests but sed does not support this, if you are using linux there is good chance that you have perl command which does support that, let file.txt content be
/Users/random/354765478/Tests/StoreTests/Base64Tests.swift
then
perl -p -e 's/^.*?(Tests.*)$/\1/' file.txt
gives output
Tests/StoreTests/Base64Tests.swift
Explanation: -p -e means engage sed-like mode, alterations in regular expression made: brackets no longer require escapes, first .* (greedy) changed to .*? (non-greedy), last .* deleted as superfluous (observe that capturing group will always extended to end of line)
(tested in perl 5, version 30, subversion 0)

How to place quotes mark in ansible task with grep, awk, sed

My task search for config in CMD column to gather information what is directory of application config and also PID.
---
- hosts: all
pre_tasks:
- name: Check if process is running
become: yes
shell: 'ps -e --format="pid cmd" | grep process.cfg | sed -e "s/[[:space:]]\+/ /g"| grep -v color'
register: proces_out
output looks like this after this command:
32423 /var/local/bin/application -c /var/local/etc/process.cfg
But i think ansible have troubles with 2 greps in 1 command. I need them both because if i dont use reversed "grep -v color" this anoying thing appears "grep --color=auto ", i cant cut out PID that i need in another task which kills process because real process is in second line.
My second idea was to use AWK, which i think would be the best tool for this case, but if i use double quotation marks in --format parameter and in SED command and the single quotation mark in awk parameters they dont want to cooperate. Even if i keep them balanced they interfere with them selfs.
AWK idea:
shell: 'ps -e --format="pid cmd" | grep process.cfg | sed -e "s/[[:space:]]\+/ /g"| awk 'FNR == 2''
I want to ask for a hint what would be the best to avoid incompatibility in code and be able to use it after as a output in variable
## PID
{{ proces_out.stdout.split(' ')[0] }}
## application
{{ proces_out.stdout.split(' ')[1] }}
## config
{{ proces_out.stdout.split(' ')[3] }}
But i think ansible have troubles with 2 greps in 1 command
That is for sure not true
if i dont use reversed "grep -v color" this anoying thing appears "grep --color=auto ", i cant cut out PID that i need in another task which kills process because real process is in second line.
You are running into the classic case of the grep process matching its own regex, as will happen in a lot of "simple" cases. What you want is a regex that matches your string but does not match itself. In that example above it would be:
shell: 'ps -e --format="pid cmd" | grep process[.]cfg | sed -e "s/[[:space:]]\+/ /g"'
because process[.]cfg matches process.cfg but does not match process[.]cfg I also fixed your regex because in a regex, the . means any character, which doesn't appear to be what you really wanted to happen
With regard to that --color bit, you can likely can side-step that nonsense by using the full path to grep, which will cause bash to really execute the binary, versus some alias that uses --color=auto; I actually wouldn't have expected the colors to show up in an ansible run, because it's not the right $TERM but systems are weird
Thank you Matthew for that solution, but i found diffirent option to avoid unnessesery output.
So syntax is almost the same, but i added to --format addonational parameter ppid Parent process id, in most case i belive parent process always have number 1 in output which helps to sort it as i want to.
It look like this:
shell: >
ps -e --format="ppid pid cmd" |
grep process.cfg |
sed -e "s/[[:space:]]\+/ /g"
register: output_process
And output looks like this:
1 54345 /var/local/bin/application -c /var/local/etc/process.cfg
6435 6577 grep --color=auto process.cfg
Now its easy we can use ansible modules to sort it:
- name: Kill process
become: yes
shell: "kill {{ output_process.stdout_lines[0].split(' ')[2] }}"
What it does? it selects line 0 which is first line, splits output between spaces and selects 3rd phrase. In output theres :space: before ppid thats why PID is 3rd
Thank you again for your solution Matthew, it might be helpfull in another case.

Using grep-awk and sed in one-row-command result in a "No such file or directory" error

..And I know why:
I have a xml document with lots of information inside. I need to extract what I need and eventually print them on a new file.
The xml (well, part of it.. rows just keeps repeating)
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toMartha/"
errordir="%home%/../../../home/samba/user/Outbound/toMartha/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Martha"
interval="600"
defaults="sender.name=me_myself, receiver.name=Martha"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toJosh/"
errordir="%home%/../../../home/samba/user/Outbound/toJosh/error"
sentdir="%home%/../../../home/samba/user/data/Sent/Josh"
interval="600"
defaults="sender.name=me_myself, receiver.name=Josh"
sendfilename="true"
mimetype="application/standard"/>
<module classname="org.openas2.processor.receiver.AS2DirectoryPollingModule"
outboxdir="%home%/../../../home/samba/user/Outbound/toPamela/"
errordir="%home%/../../../home/samba/user/Outbound/toPamela/error"
interval="600"
defaults="sender.name=me_myself, receiver.name=Pamela"
sendfilename="true"
mimetype="application/standard"/>
I need to extract the folder after "Outbound" and clean it from quotes or slashes.
Also, I need to exclude the "/error" so I get only 1 result for each of them.
My command is:
grep -o -v "/error" "Outbound/" config.xml | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g" > /tmp/sync_users
The error is: grep: Outbound/: No such file or directory which of course means that I'm giving to grep too many arguments (?) - If i remove the -v "/error" it would work but would print also the names with "/error".
Can someone help me?
EDIT:
As some pointed out in their example (thanks for the time you put in), I'd need to extract these words based on the sample above:
toMartha
toJosh
toPamela
could be intersting to use sed in this case
sed -e '\#/Outbound/#!d' -e '\#/error"$#d' -e 's#.*/Outbound/##;s#/\{0,1\}"$##' Config.xml
awk version, assuming (for last print) that your line is always 1 folder below Outbound as shown
awk -F '/' '$0 !~ /\/Outbound\// || /\/error"$/ {next} {print $(NF-1)}' Config.xml
Loose the grep altogether:
$ awk '/outboxdir/{gsub(/^.+Outbound\/|\/" *\r?$/,""); print}' file
toMartha
toJosh
toPamela
/^outboxdir/ /outboxdir/only process records that have start with outboxdir on them
gsub remove unwanted parts of the record
added space removal at the end of record and CRLF fix for Windows originated files
To give grep multiples patterns they have to be separated by newlines or specified by multiples pattern option (-e, F,.. ). However -v invert the match as a whole, you can't invert only one.
For what you're after you can use PCRE (-P argument) for the lookaround ability:
grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' config.xml
Regex demo here
The regex try to
match something not a slash at least once, the [^\/]+
preceded by Outbound/ the positive lookbehind (?<=Outbound\/)
and not followed by something ending with /error, the negative lookahead (?!.*\/error)
With your first sample input:
$ grep -o -P '(?<=Outbound\/)[^\/]+(?!.*\/error)' test.txt
toMartha
toJosh
toPamela
How about:
grep -i "outbound" your_file | awk -F"Outbound/" '{print $2}' | sed -e 's/error//' -e 's/\/\"//' | uniq
Should work :)
You can use match in gawkand capturing group in regex
awk 'match($0, /^.*\/Outbound\/([^\/]+)\/([^\/]*)\/?"$/, a){
if(a[2]!="error"){print a[1]}
}' config.xml
you get,
toMartha
toJosh
toPamela
grep can accept multiple patterns with the -e option (aka --regexp, even though it can be used with --fixed-strings too, go figure). However, -v (--invert-match) applies to all of the patterns as a group.
Another solution would be to chain two calls to grep:
grep -v "/error" config.xml | grep "Outbound/" | awk -F"Outbound/" '{print $2}' | sed -e "s/\/\"//g"

How to access an online txt file with AWK?

I would like to use an online database instead of a local file in AWK.
For instance:
awk 'END{print NR}' somelocalfile.txt
returns number of lines inside the file.
Now my question is, how can I calculate number of lines in an online txt file like this one? I prefer one-liner command.
I can wget and then apply awk command localy on it, but I think there can be more efficient approach.
I would suggest to use wget:
wget -qO - http://path.com/tofile.txt | awk 'END{print NR}'
q means quiet, so you won't have any terminal outputs from wget. -O is the output which is set to stdout with the '-O -'.

Remove character from awk result

I'm running an awk command the prints out an output with a ":" in the result. How can I remove that? Is there a way to do the whole awk command in one?
The command I'm running is:
fdisk -l | '/Disk/{print $2
Which gives:
/dev/sda:
Thanks
This should do the trick:
fdisk -l | awk -F'[ :]+' '/^Disk \// {print $2}'
/dev/sda
Explanation:
-F'[ :]+' sets the field Separator to a space or colon, as long as there are more than one.
And I match /^Disk \/, to prevent some false positives (the forward slash needs to be escaped by a backslash).
For a list of all /dev/{disks} you can try using lsblk with the -o {flags}, and you don't need to be SU either... N.B. 'PATH' is a column header
lsblk -o PATH
That'll give you all disks and partitions (including loop partitions) as you'll see from "fdisk -l".
There's a lot more information that 'lsblk -o {flags}' will give you... Try this one for fun (and google/man for more)...
lsblk -o NAME,LABEL,PATH,MOUNTPOINT