I have a file that has an entry for a transcript and then the following line(s) are the associated exons. Sometimes this may be one exon and so one subsequent line, sometimes there are 'n' exons and so 'n' subsequent lines like so :
1 Cufflinks transcript 63846957 63847511
1 Cufflinks exon 63846957 63847511
1 Cufflinks transcript 63851691 63852040
1 Cufflinks exon 63851691 63852040
2 Cufflinks transcript 8442356 8443964
2 Cufflinks exon 8442356 8442368
2 Cufflinks exon 8443768 8443964
2 Cufflinks exon 8444000 8444578
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
I would like to print out the the transcript and associated exon lines only if there are two exons after the transcript. For this example there would only be the last three lines extracted (one transcript line and two exon lines).
How can this be done with awk?
You can save up lines in an array, then print them once you are sure about the number of exons.
#!/usr/bin/awk -f
BEGIN {
number_of_exons = 0;
}
END {
print_if_two_exons();
}
$3 == "transcript" {
print_if_two_exons();
transcript = $0;
}
$3 == "exon" {
exons[number_of_exons++] = $0;
}
function print_if_two_exons() {
if (transcript && number_of_exons == 2) {
print transcript;
for (i = 0; i < number_of_exons; i++) {
print exons[i];
}
}
delete exons;
number_of_exons = 0;
}
Output:
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
$ cat tst.awk
/transcript/ { prt() }
{ buf = buf $0 ORS; ++cnt }
END { prt() }
function prt() {
if ( cnt == 3 ) {
printf "%s", buf
}
buf = ""
cnt = 0
}
$ awk -f tst.awk file
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
$ cat awk-script
function set_all(s,t,e) {
exon=e;tran=t;str=s
}
/transcript/{set_all($0,1,0)}
/exon/{
if(tran){
if(exon<2)
set_all(str"\n"$0,tran,exon+1)
else
set_all("",0,0)
} else
set_all("",0,0)
}
END {
print str
}
$ awk -f awk-script file
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
Very Straightforward method, and I'll explain it as followed,
Set variable exon and tran to record the consecutive show up counts of exon and transcript, respectively
Declare a function set_all to set the value for str, exon, and tran
You can use a PCRE to do this.
Demo
In ruby:
$ ruby -e 'buf=$<.read
buf.scan(/.*transcript.*\n+.*exon.*\n.*exon.*\n(?=(?:.*transcript)|\z)/)
.each { |m| puts m }'
2 Cufflinks transcript 8258988 8259803
2 Cufflinks exon 8258988 8259271
2 Cufflinks exon 8259370 8259803
Perl:
$ perl -0777 -lane 'while (/(.*transcript.*\n+.*exon.*\n+.*exon.*\n+)(?=(?:.*transcript)|\z)/g) {print $1;}' file
Similar in Python, GNU grep, etc
Related
I have a tab limited file as like this
chr20 102 K245 A T 56.0 AC.02 AC=0.1;DC=45;AC_old=452;DP=21;sample=kj;sample_name=DKl;New_sample=rdf
chr10 8742 JH245 G T 86.0 AC.742 AC=2.1;DC=75;AC_old=42;DP=1;sample=KHS;sample_name=WEKl;New_sample=ASEf
chrX 2302 XS245 G A 786.0 AC.452 AC=8;DC=5;AC_old=4A2;DP=5;sample=SED;sample_name=MHNSKl;New_sample=rdf
And Need to extract only AC,DC,sample as like this
chr20 102 K245 A T 56.0 AC.02 AC=0.1 DC=45 sample=kj
chr10 8742 JH245 G T 86.0 AC.742 AC=2.1 DC=75 sample=KHS
chrX 2302 XS245 G A 786.0 AC.452 AC=8 DC=5 sample=SED
I have tried with grep as like this, but not served the purpose
grep -wF "AC|DC|sample" < file.txt
Could you please try following, written and tested with your shown samples only in GNU awk.
awk '
match($0,/AC\.[0-9]+/){
val1=value=""
val1=substr($0,1,RSTART+RLENGTH)
num=split($NF,arr,";")
for(i=1;i<=num;i++){
if(arr[i]~/^(AC=|DC=|sample=)/){
value=(value?value OFS:"")arr[i]
}
}
print val1,value
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/AC\.[0-9]+/){ ##using match function which matches regex AC\.[0-9]+ here.
val1=value="" ##Nullifying val1 and value here.
val1=substr($0,1,RSTART+RLENGTH) ##val1 is having sub string of matched regex.
num=split($NF,arr,";") ##Splitting last field into arr here.
for(i=1;i<=num;i++){ ##Going through all values of last field.
if(arr[i]~/^(AC=|DC=|sample=)/){ ##Checking condition if last field is either AC= OR DC= OR sample= here.
value=(value?value OFS:"")arr[i] ##Create value which has array item value in it.
}
}
print val1,value ##Printing val1 and value here.
}
' Input_file ##mentioning Input_file name here.
You can use
awk -F\; '$1 ~ /AC|DC|sample/{print $1 OFS $2 OFS $5}' file
Here,
-F\; sets the field separator to ;
$1 ~ /AC|DC|sample/ only takes lines having AC, DC or sample in Field 1
{print $1 OFS $2 OFS $5} - prints Field 1, 2 and 5 with spaces as separators.
See the online demo:
s='chr20 102 K245 A T 56.0 AC.02 AC=0.1;DC=45;AC_old=452;DP=21;sample=kj;sample_name=DKl;New_sample=rdf
chr10 8742 JH245 G T 86.0 AC.742 AC=2.1;DC=75;AC_old=42;DP=1;sample=KHS;sample_name=WEKl;New_sample=ASEf
chrX 2302 XS245 G A 786.0 AC.452 AC=8;DC=5;AC_old=4A2;DP=5;sample=SED;sample_name=MHNSKl;New_sample=rdf'
awk -F\; '$1 ~ /AC|DC|sample/{print $1 OFS $2 OFS $5}' <<< "$s"
Output:
chr20 102 K245 A T 56.0 AC.02 AC=0.1 DC=45 sample=kj
chr10 8742 JH245 G T 86.0 AC.742 AC=2.1 DC=75 sample=KHS
chrX 2302 XS245 G A 786.0 AC.452 AC=8 DC=5 sample=SED
You may use this awk:
awk -F '[\t;]+' -v OFS='\t' '{s=""; for (i=1; i<=6; ++i) s = (i == 1 ? "" : s OFS) $i; for (i=6; i<=NF; ++i) if ($i ~ /^([AD]C|sample)[=.]/) s = s OFS $i; print s}' file
chr20 102 K245 A T 56.0 AC.02 AC=0.1 DC=45 sample=kj
chr10 8742 JH245 G T 86.0 AC.742 AC=2.1 DC=75 sample=KHS
chrX 2302 XS245 G A 786.0 AC.452 AC=8 DC=5 sample=SED
A more readable version:
awk -F '[\t;]+' -v OFS='\t' '
{
s = ""
for (i=1; i<=6; ++i)
s = (i == 1 ? "" : s OFS) $i
for (i=6; i<=NF; ++i)
if ($i ~ /^([AD]C|sample)[=.]/)
s = s OFS $i
print s
}' file
In the below awk I am trying to extract and compare each substring in $4 that stars with p.. If the first three letters is the same as the last three (there is a digit in between) then that p. is updated to p.(3 letters)(digit)(=) --- the () are only to show that there are 3 enteries and are not needed. If the 3 letters are different then that line is unchanged. In the below file line 1 in an example. In my actual data there are about 10,000 rows wth about 50 columns, but $4 is the only one that will have these values in ut, that is te p. The format of the p. will always be three letters followed by a 1-4 digit # followed by 3 more letters. The awk attempt below I think will extract each p. and split on the ;, but I am not sure how to compare to check if the three letters are the same. Thank you :).
file tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110Glu;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110Glu
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
desired output tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
awk
awk '
BEGIN { OFS="\t" }
$4 ~ /:NM/ {
ostring=""
# split $4 by ";" and cycle through them
nNM=split($4,NM,";")
for (n=1; n<=nNM; n++) {
if (n>1) ostring=(ostring ";") # append ";"
if (match(NM[n],/p[.].*/)) {
# copy up to "p."
ostring=(ostring substr(NM[n],1,RSTART+1))
# Get the substring after "p."
VAL=substr(NM[n],RSTART+2)
# Get its length
lenVAL=length(VAL)
# store aa array
aa=[{while(length($4)=3){print substr($044,1,3);gsub(/^./,"")}]}' file
Extended GNU awk solution:
awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' file
The output:
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
Time performance measurement:
Input testfile:
$ wc -l testfile
10000 testfile
time(awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' testfile >/dev/null)
real 0m0.269s
user 0m0.256s
sys 0m0.000s
time(awk 'BEGIN { FS=OFS="\t" }
NR>1 {
head = ""
tail = $4
while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
tail = substr(tail,RSTART+RLENGTH)
}
$4 = head tail
}
{ print }' testfile >/dev/null)
real 0m0.470s
user 0m0.416s
sys 0m0.008s
With GNU awk for the 3rd arg to match():
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR>1 {
head = ""
tail = $4
while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
tail = substr(tail,RSTART+RLENGTH)
}
$4 = head tail
}
{ print }
$ gawk -f tst.awk file
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
I have this file:
$ head -n 4 badRegionFromHWE.merged
seqnames start end width strand
chr1 144118070 145868461 1750392 *
chr7 100049516 101110026 1060511 *
chr7 141508887 142999071 1490185 *
$
I want to not print out the header line and print column 1,2,3 separated by tabs. So I wrote this:
awk 'OFS="\t";NR>1{print$1,$2,$3}' badRegionFromHWE.merged | head
seqnames start end width strand
chr1 144118070 145868461 1750392 *
chr1 144118070 145868461
chr7 100049516 101110026 1060511 *
chr7 100049516 101110026
chr7 141508887 142999071 1490185 *
chr7 141508887 142999071
It doesn't do what I wanted it to do!
The assignment OFS="\t" evaluates to true (non-zero, non-empty) on every line, so it prints every line. You should enclose the expression in a BEGIN block:
awk 'BEGIN { OFS="\t" } NR > 1 { print$1, $2, $3 }' badRegionFromHWE.merged
In the out.txt below I am trying to use awk to update the contents of $9. The out.txt is created by the awk before the pipe |. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file2. When a match ( there will always be one) is found the $3 value of that file2 is used to update $9 of out.txt seperated by a :. So the original +6 in out.txt would be +6:NM_005101.3. The awk below is close but has syntax errors after the | that I can not seem to fix. Thank you :).
out.txt tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
file2 space-delimited
2 ISG15 NM_005101.3 948846-948956 949363-949919
desired output `tab-delimited'
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .
Description
lines 1, 3, 5 `$9` updated with`: ` and value of `$3` in `file2`
line 2 and 4 are skipped as these do not have a `+` or `-` in them
awk
awk -v extra=50 -v OFS='\t' '
NR == FNR {
count[$2] = $1
for(i = 1; i <= $1; i++) {
low[$2, i] = $(2 + 2 * i)
high[$2, i] = $(3 + 2 * i)
mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
}
next
}
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
for(i = 1; i <= count[$8]; i++)
if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
if($4 > mid[$8, i]) {
sign = "+"
value = high[$8, i]
}
else {
sign = "-"
value = low[$8, i]
}
diff = (value > $4) ? value - $4 : $4 - value
$9 = (diff > 50) ? ">50" : (sign diff)
break
}
if(i > count[$8]) {
$9 = ">50"
}
}
1
' FS='[- ]' file2 FS='\t' file1 | awk if($6 == "-" || $6 == "+") printf ":" ; 'FNR==NR {a[$2]=$3; next} a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
As far as I can tell, your awk code is OK and your bash usage is wrong.
FS='[- ]' file2 FS='\t' file1 |
awk if($6 == "-" || $6 == "+")
printf ":" ;
'FNR==NR {a[$2]=$3; next}
a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
I don't know what that's supposed to do. This for sure, though: on the second line, the awk code needs to be quoted (awk 'if(....). The bash error message stems from the fact that bash is interpreting the (unquoted) awk code, and ( is not a valid shell-script token after if.
I am trying to match all the file 1 names in file 2 and average them if there is a match. The field where the match will be is $5 before the | symbol and the average is the sum of $7 that matches $4. Thank you :).
file 1
AGRN
CYP2J2
file 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 1 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 2 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 3 2
chr1 957571 957852 chr1:957571 AGRN-7|gc=61.2 1 148
chr1 957571 957852 chr1:957571 AGRN-7|gc=61.2 2 149
chr1 957571 957852 chr1:957571 AGRN-7|gc=61.2 3 151
chr1 60381600 60381782 chr1:60381600 CYP2J2-1596|gc=40.7 153 274
chr1 60381600 60381782 chr1:60381600 CYP2J2-1596|gc=40.7 154 273
Desired output (tab-delimited)
chr1:955543 AGRN-6 2
chr1:957571 AGRN 149.3
chr1:60381600 CYP2J2-1596 153.5
I have tried so far:
awk '
FNR==NR{d[$0]; next;}
{
for(k in d){
pat="(^|;)"k":";
if($5 ~ pat){
print;
break;
}
}
}' file 1 file2 > output.bed
The awk does run but the output file, as of now, is 0 bytes. Thank you :).
The script should look like this:
test.awk
BEGIN {
FS="[ \t|]*"
}
# Read search terms from file1 into 's'
FNR==NR {
s[$0]
next
}
{
# Check if $5 matches one of the search terms
for(i in s) {
if($5 ~ i) {
# Store first two fields for later usage
a[$5]=$1
b[$5]=$2
# Add $9 to total of $9 per $5
t[$5]+=$8
# Increment count of occurences of $5
c[$5]++
next
}
}
}
END {
# Calculate average and print output for all search terms
# that has been found
for( i in t ) {
avg = t[i] / c[i]
printf("%s:%s\t%s\t%s\n", a[i], b[i], i, avg)
}
}
Call it like:
awk -f test.awk file1 file2
Btw, the third avg in your expected output is wrong. The output should look like this:
chr1:955543 AGRN-6 2
chr1:957571 AGRN-7 149.333
chr1:60381600 CYP2J2-1596 273.5