How to add double quotes in a specific column - awk

How to add double quotes to the gene_id?
My original file:
##gtf-version 3
Bany_Scaf1 maker gene 201136 207903 . + . Alias "maker-Bany_Scaf1-snap-gene-2.23"; Dbxref "InterPro:IPR019774" "Pfam:PF00351"; ID Bany_03723; Name Bany_03723; Ontology_term "GO:0016714" "GO:0055114"; gene_id Bany_03723
Bany_Scaf1 maker transcript 201136 207903 . + . Alias "maker-Bany_Scaf1-snap-gene-2.23-mRNA-1"; Dbxref "InterPro:IPR019774" "Pfam:PF00351"; ID "Bany_03723-RA"; Name "Bany_03723-RA"; Ontology_term "GO:0016714" "GO:0055114"; Parent Bany_03723; _AED "0.06"; _QI "45|1|1|1|1|1|7|425|530"; _eAED "0.06"; gene_id Bany_03723; original_biotype mrna; transcript_id "Bany_03723-RA"
Bany_Scaf1 maker exon 201136 201304 . + . ID "Bany_03723-RA:1"; Parent "Bany_03723-RA"; gene_id Bany_03723; transcript_id "Bany_03723-RA"
Bany_Scaf1 maker exon 202687 202770 . + . ID "Bany_03723-RA:2"; Parent "Bany_03723-RA"; gene_id Bany_03723; transcript_id "Bany_03723-RA"
Bany_Scaf1 maker exon 202886 202921 . + . ID "Bany_03723-RA:3"; Parent "Bany_03723-RA"; gene_id Bany_03723; transcript_id "Bany_03723-RA"
Bany_Scaf1 maker exon 203004 203820 . + . ID "Bany_03723-RA:4"; Parent "Bany_03723-RA"; gene_id Bany_03723; transcript_id "Bany_03723-RA"
Bany_Scaf1 maker exon 206097 206223 . + . ID "Bany_03723-RA:5"; Parent "Bany_03723-RA"; gene_id Bany_03723; transcript_id "Bany_03723-RA"
Bany_Scaf1 maker exon 206649 206878 . + . ID "Bany_03723-RA:6"; Parent "Bany_03723-RA"; gene_id Bany_03723; transcript_id "Bany_03723-RA"
Bany_Scaf1 maker exon 207304 207903 . + . ID "Bany_03723-RA:7"; Parent "Bany_03723-RA"; gene_id Bany_03723; transcript_id "Bany_03723-RA"
I hope to change all gene_id Bany_xxxxxx to gene_id "Bany_xxxxxx".
I tried this:
sed -E 's#(Parent|gene_id|ID) ([0-9A-Za-z.]+)#\1 \"\2\"#g'
But the double quotes were added in the wrong place, like:
gene_id "Bany"_03723
What should I do...

with sed
$ sed -E 's/gene_id ([^;]+)/gene_id "\1"/' file
find the next word delimited with ; to gene_id and quote it. Assumes space between them. If tab change to \t

Related

Python chess: Check for passed pawns

In a chess position, I wish to check whether any passed pawn exists for white.
Is it possible to do so using the python-chess library? If not, how can I implement it?
def checkForPassedPawn(position: chess.Board, side_to_move: chess.Color):
# ... check for passed pawn
# return a boolean value
I could not find any built-in method that detects passed pawns.
You'll have to look at the pawn positions yourself. There are many ways to do that. For instance, you could take the board's string representation as a starting point:
r n b q k b n r
p p . . . p p p
. . . . . . . .
. . p P p . . .
. . . . . P . .
. . . . . . . .
P P P P . . P P
R N B Q K B N R
This is the kond of string you get with str(position).
Then you could put each column in a separate list:
lines = str(position).replace(" ", "").splitlines()
columns = list(zip(*lines))
This gives you:
[
('r', 'p', '.', '.', '.', '.', 'P', 'R'),
('n', 'p', '.', '.', '.', '.', 'P', 'N'),
('b', '.', '.', 'p', '.', '.', 'P', 'B'),
('q', '.', '.', 'P', '.', '.', 'P', 'Q'),
('k', '.', '.', 'p', '.', '.', '.', 'K'),
('b', 'p', '.', '.', 'P', '.', '.', 'B'),
('n', 'p', '.', '.', '.', '.', 'P', 'N'),
('r', 'p', '.', '.', '.', '.', 'P', 'R')
]
It the current player is white, you then can check for the left most "P" in each tuple where it has a "p" more left to it, either in the current tuple, the previous one, or the next one.
For the black player, you would use a similar logic and it might be useful to first reverse the tuples in that case.
Here is an implementation of that idea:
import chess
def checkForPassedPawn(position: chess.Board, side_to_move: chess.Color):
selfpawn = "pP"[side_to_move]
otherpawn = "Pp"[side_to_move]
lines = str(position).replace(" ", "").splitlines()
if side_to_move == chess.BLACK:
lines.reverse()
# turn rows into columns and vice versa
columns = list(zip(*lines))
for colnum, col in enumerate(columns):
if selfpawn in col:
rownum = col.index(selfpawn)
if (otherpawn not in col[:rownum]
and (colnum == 0 or otherpawn not in columns[colnum-1][:rownum])
and (colnum == 7 or otherpawn not in columns[colnum+1][:rownum])):
return f"{'abcdefgh'[colnum]}{rownum+1}"
position = chess.Board()
position.push_san("e4")
position.push_san("d5")
position.push_san("f4")
position.push_san("e5")
position.push_san("exd5")
position.push_san("c5") # Now white pawn at d5 is a passed pawn
print(position)
passedpawn = checkForPassedPawn(position, chess.WHITE)
print("passed white pawn:", passedpawn)
position.push_san("d4")
position.push_san("e4") # Now black pawn at e4 is a passed pawn
print(position)
passedpawn = checkForPassedPawn(position, chess.BLACK)
print("passed black pawn:", passedpawn)
Output:
r n b q k b n r
p p . . . p p p
. . . . . . . .
. . p P p . . .
. . . . . P . .
. . . . . . . .
P P P P . . P P
R N B Q K B N R
passed white pawn: d4
r n b q k b n r
p p . . . p p p
. . . . . . . .
. . p P . . . .
. . . P p P . .
. . . . . . . .
P P P . . . P P
R N B Q K B N R
passed black pawn: e4

Why does pddl find the solution incomplete

Below is a small part of a river crossing problem written in pddl. I tried to find the solution in two different tools (editor.planning.domains and stripsfiddle.herokuapp.com) but both of them gave the same result.
;domain;
(define (domain RiverCrossing)
(:requirements :strips :typing)
(:types
Farmer Fox - passengers
)
(:predicates
(onLeftBank ?p - passengers)
(onRightBank ?p - passengers)
)
(:action crossRiverLR
:parameters (?f - Farmer)
:precondition ( and (onLeftBank ?f))
:effect( and (onRightBank ?f) )
)
(:action crossRiverRL
:parameters (?f - Farmer)
:precondition ( and (onRightBank ?f))
:effect( and (onLeftBank ?f) )
)
(:action crossRiverLRf
:parameters ( ?fx - Fox ?f - Farmer)
:precondition ( and (onLeftBank ?f) (onLeftBank ?fx) )
:effect( and (onRightBank ?fx) (onRightBank ?f) )
)
(:action crossRiverRLf
:parameters (?f - Farmer ?fx - Fox)
:precondition ( and (onRightBank ?f) (onRightBank ?fx) )
:effect( and (onLeftBank ?f) (onLeftBank ?fx) )
)
)
Problem
(define (problem RCP)
(:domain RiverCrossing)
(:objects
farmer - Farmer
fox - Fox
)
(:init
(onRightBank farmer) (onLeftBank fox)
)
(:goal
(and
(onLeftBank farmer) (onRightBank fox)
)
)
)
Both of the compilers give the same result; Farmer does not go to LeftBank:
Solution found in 2 steps!
1. crossRiverRL farmer
2. crossRiverLRf fox farmer
Can anyone help me figure out the point I am missing?
Thanks in advance,
I figured out that the problem is not negating the previous situation (OnLeftBank) after setting the next situation (OnRightBank).
Below is the sample correction which I applied to all effects;
(:action crossRiverLR
:parameters (?f - Farmer)
:precondition ( and (onLeftBank ?f))
:effect( and (onRightBank ?f)
(not (onLefttBank ?f)) ; **** adding this solved the problem. ****
)
)

print line + next two lines with awk if next two lines matches

I have a file that has an entry for a transcript and then the following line(s) are the associated exons. Sometimes this may be one exon and so one subsequent line, sometimes there are 'n' exons and so 'n' subsequent lines like so :
1 Cufflinks transcript 63846957 63847511
1 Cufflinks exon 63846957 63847511
1 Cufflinks transcript 63851691 63852040
1 Cufflinks exon 63851691 63852040
2 Cufflinks transcript 8442356 8443964
2 Cufflinks exon 8442356 8442368
2 Cufflinks exon 8443768 8443964
2 Cufflinks exon 8444000 8444578
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
I would like to print out the the transcript and associated exon lines only if there are two exons after the transcript. For this example there would only be the last three lines extracted (one transcript line and two exon lines).
How can this be done with awk?
You can save up lines in an array, then print them once you are sure about the number of exons.
#!/usr/bin/awk -f
BEGIN {
number_of_exons = 0;
}
END {
print_if_two_exons();
}
$3 == "transcript" {
print_if_two_exons();
transcript = $0;
}
$3 == "exon" {
exons[number_of_exons++] = $0;
}
function print_if_two_exons() {
if (transcript && number_of_exons == 2) {
print transcript;
for (i = 0; i < number_of_exons; i++) {
print exons[i];
}
}
delete exons;
number_of_exons = 0;
}
Output:
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
$ cat tst.awk
/transcript/ { prt() }
{ buf = buf $0 ORS; ++cnt }
END { prt() }
function prt() {
if ( cnt == 3 ) {
printf "%s", buf
}
buf = ""
cnt = 0
}
$ awk -f tst.awk file
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
$ cat awk-script
function set_all(s,t,e) {
exon=e;tran=t;str=s
}
/transcript/{set_all($0,1,0)}
/exon/{
if(tran){
if(exon<2)
set_all(str"\n"$0,tran,exon+1)
else
set_all("",0,0)
} else
set_all("",0,0)
}
END {
print str
}
$ awk -f awk-script file
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
Very Straightforward method, and I'll explain it as followed,
Set variable exon and tran to record the consecutive show up counts of exon and transcript, respectively
Declare a function set_all to set the value for str, exon, and tran
You can use a PCRE to do this.
Demo
In ruby:
$ ruby -e 'buf=$<.read
buf.scan(/.*transcript.*\n+.*exon.*\n.*exon.*\n(?=(?:.*transcript)|\z)/)
.each { |m| puts m }'
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
Perl:
$ perl -0777 -lane 'while (/(.*transcript.*\n+.*exon.*\n+.*exon.*\n+)(?=(?:.*transcript)|\z)/g) {print $1;}' file
Similar in Python, GNU grep, etc

awk to update value in field of out file using contents of another

In the out.txt below I am trying to use awk to update the contents of $9. The out.txt is created by the awk before the pipe |. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file2. When a match ( there will always be one) is found the $3 value of that file2 is used to update $9 of out.txt seperated by a :. So the original +6 in out.txt would be +6:NM_005101.3. The awk below is close but has syntax errors after the | that I can not seem to fix. Thank you :).
out.txt tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
file2 space-delimited
2 ISG15 NM_005101.3 948846-948956 949363-949919
desired output `tab-delimited'
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .
Description
lines 1, 3, 5 `$9` updated with`: ` and value of `$3` in `file2`
line 2 and 4 are skipped as these do not have a `+` or `-` in them
awk
awk -v extra=50 -v OFS='\t' '
NR == FNR {
count[$2] = $1
for(i = 1; i <= $1; i++) {
low[$2, i] = $(2 + 2 * i)
high[$2, i] = $(3 + 2 * i)
mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
}
next
}
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
for(i = 1; i <= count[$8]; i++)
if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
if($4 > mid[$8, i]) {
sign = "+"
value = high[$8, i]
}
else {
sign = "-"
value = low[$8, i]
}
diff = (value > $4) ? value - $4 : $4 - value
$9 = (diff > 50) ? ">50" : (sign diff)
break
}
if(i > count[$8]) {
$9 = ">50"
}
}
1
' FS='[- ]' file2 FS='\t' file1 | awk if($6 == "-" || $6 == "+") printf ":" ; 'FNR==NR {a[$2]=$3; next} a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
As far as I can tell, your awk code is OK and your bash usage is wrong.
FS='[- ]' file2 FS='\t' file1 |
awk if($6 == "-" || $6 == "+")
printf ":" ;
'FNR==NR {a[$2]=$3; next}
a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
I don't know what that's supposed to do. This for sure, though: on the second line, the awk code needs to be quoted (awk 'if(....). The bash error message stems from the fact that bash is interpreting the (unquoted) awk code, and ( is not a valid shell-script token after if.

Compare two files and print matching lines with some lines after match

I have two files file1.txt and file2.txt.
file1.txt
DS496218 40654 42783
DS496218 40654 42783
DS496218 40654 42783
file2.txt
###
DS496108 ena gene 99942 102567 . -
DS496128 ena mRNA 99942 102567 . -
DS496118 ena three_prime_UTR 99942 100571
###
DS496218 ena gene 40654 42783 . -
DS496108 ena mRNA 99942 102567 . -
DS496108 ena three_prime_UTR 99942 100571
###
DS496128 ena gene 99942 102567 . -
DS496133 ena mRNA 99942 102567 . -
DS496139 ena three_prime_UTR 99942 100571
###
I want to match column 1,2 and 3 of file1.txt with column 1,4 and 5 of file2.txt. If it matches print the matching line with the following lines till ### but don't print ###. I tried it with 'awk' command in
awk -F'\t' 'NR==FNR{c[$1$2$3]++;next};c[$1$4$5] > 0' file1.txt file2.txt > out.txt.
Without seeing your expected output it's a guess but it sounds like this is what you want:
awk '
NR==FNR { a[$1,$2,$3]; next }
($1,$4,$5) in a { found=1 }
/^###/ { found=0 }
found
' file1 file2