changing the appearance of awk output - awk

I used the following code to extract protein residues from text files.
awk '{
if (FNR == 1 ) print ">" FILENAME
if ($5 == 1 && $4 > 30) {
printf $3
}
}
END { printf "\n"}' protein/*.txt > seq.txt
I got the following output when I used the above code.
>1abd
MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR>1axc
RQTSMTDFYHSKRRLIFS>1bxc
RQTSMTDFYHSKRRLIFSPRR>1axF
RQTSMTDFYHSKRR>1qqt
ARPYQGVRVKEPVKELLRRKRG
I would like to get the output as shown below.How do I change the above code to get the following output?
>1abd
MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR
>1axc
RQTSMTDFYHSKRRLIFS
>1bxc
RQTSMTDFYHSKRRLIFSPRR
>1axF
RQTSMTDFYHSKRR
>1qqt
ARPYQGVRVKEPVKELLRRKRG

This might work for you:
awk '{
if (FNR == 1 ) print newline ">" FILENAME
if ($5 == 1 && $4 > 30) {
newline="\n";
printf $3
}
}
END { printf "\n"}' protein/*.txt > seq.txt

With gawk version 4, you can write:
gawk '
BEGINFILE {print ">" FILENAME}
($5 == 1 && $4 > 30) {printf "%s", $3}
ENDFILE {print ""}
' filename ...
http://www.gnu.org/software/gawk/manual/html_node/BEGINFILE_002fENDFILE.html#BEGINFILE_002fENDFILE

Related

Print rows where one column is the same but another is different

Given files test1 and test2:
$ cat test*
alert_name,id,severity
bar,2,1
foo,1,0
alert_name,id,severity
foo,1,9
bar,2,1
I want to find rows where name is the same but severity has changed (ie foo) and print the change. I have got this far using awk:
awk 'BEGIN{FS=","} FNR>1 && NR==FNR { a[$1]; next } ($1 in a) && ($3 != a[$3]) {printf "Alert %s severity from %s to %s\n", $1, a[$3], $3}' test1 test2
which prints:
Alert foo severity from to 9
Alert bar severity from to 1
So the match is wrong, and I can't print a[$3].
You may try this awk:
awk -F, '$1 in sev && $3 != sev[$1] {
printf "Alert %s severity from %s to %s\n", $1, sev[$1], $3
}
{sev[$1] = $3}' test*
Alert foo severity from 0 to 9
mawk 'BEGIN { _+=(_^=FS=OFS=",")+_ }
FNR == NR || +$_<=+___[__=$!!_] ? !_*(___[$!!_]=$_) : \
$!_ = "Alert "__ " severity from "___[__]" to " $_' files*.txt
Alert foo severity from 0 to 9

Find "complete cases" with awk

Using awk, how can I output the lines of a file that have all fields non-null without manually specifying each column?
foo.dat
A||B|X
A|A|1|
|1|2|W
A|A|A|B
Should return:
A|A|A|B
In this case we can do:
awk -F"|" -v OFS="|" '$1 != "" && $2 != "" && $3 != "" && $4 != "" { print }' foo.dat
But is there a way to do this without specifying each column?
You can loop over all fields and skip the record if any of the fields are empty:
$ awk -F'|' '{ for (i=1; i<=NF; ++i) { if (!$i) next } }1' foo.dat
A|A|A|B
if (!$i) is "if field i is not non-empty", and 1 is short for "print the line", but it is only hit if next was not executed for any of the fields of the current line.
Another in awk:
$ awk -F\| 'gsub(/[^|]+(\||$)/,"&")==NF' file
A|A|A|B
print record if there are NF times | terminating (non-empty, |-excluding) strings.
awk '!/\|\|/&&!/\|$/&&!/^\|/' file
A|A|A|B

awk Print Skipping a field

In the case where type is "" print the 3rd field out of sequence and then print the whole line with the exception of the 3rd field.
Given a tab separated line a b c d e the idea is to print ab<tab>c<tab>a<tab>b<tab>d<tab>e
Setting $3="" seems to cause the subsequent print statement to lose the tab field separators and so is no good.
# $1 = year $2 = movie
BEGIN {FS = "\t"}
type=="" {printf "%s\t%s\t", $2 $1,$3; $3=""; print}
type!="" {printf "%s\t<%s>\t", $2 $1,type; print}
END {print ""}
Sticking in a for loop which I like a lot less as a solution results in a blank file.
# $1 = year $2 = movie
BEGIN {FS = "\t"}
type=="" {printf "%s\t%s\t%s\t%s\t", $2 $1,$3,$1,$2; for (i=4; i<=NF;i++) printf "%s\t",$i}
type!="" {printf "%s\t<%s>\t", $2 $1,type; print}
END {print ""}
You need to set the OFS to a tab instead of it's default single blank char and you don't want to just set $3 to a bank char as then you'll get 2 tabs between $2 and $4.
$ cat tst.awk
BEGIN {FS = OFS = "\t"}
{
if (type == "") {
val = $3
for (i=3; i<NF; i++) {
$i = $(i+1)
}
NF--
}
else {
val = "<" type ">"
}
print $2 $1, val, $0
}
$
$ awk -f tst.awk file | tr '\t' '-'
ba-c-a-b-d-e
$
$ awk -v type="foo" -f tst.awk file | tr '\t' '-'
ba-<foo>-a-b-c-d-e
The |tr '\t' '-' is obviously just added to make visible where the tabs are.
If decrementing NF doesn't work in your awk to delete the last field in the record, replace it with sub(/\t[^\t]+$/,"").
One way
awk '{$3=""}1' OFS="\t" infile|column -t
explanation
{$3=""} set column to nil
1 same as print, print the line.
OFS="\t"set Output Field Separator Variable to tab, maybe you needn't it, next commandcolumn -t` make the format again.
column -t columnate lists with tabs.

awk to improve command print Match and Non-Match case:

Would like to read and compare first field from two files then print
Match Lines from Both the files - ( Available in f11.txt and f22.txt) -> Op_Match.txt
Non- Match Lines from f11.txt ( Available in f11.txt Not-Available in f22.txt)-> Op_NonMatch_f11.txt
Non- Match Lines from f22.txt ( Available in f22.txt Not-Available in f11.txt)-> Op_NonMatch_f22.txt
Using below 3 separate commands to achieve the above scenario's .
f11.txt
10,03-APR-14,abc
20,02-JUL-13,def
10,19-FEB-14,abc
20,02-AUG-13,def
10,22-JAN-07,abc
10,29-JUN-07,abc
40,11-SEP-13,ghi
f22.txt
50,DL,3000~4332,ABC~XYZ
10,DL,5000~2503,ABC~XYZ
30,AL,2000~2800,DEF~PQZ
To Match Lines from Both the files:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1] = $0; next} ($1 in a) {print $0,a[$1]}' f22.txt f11.txt> Op_Match.txt
10,03-APR-14,abc,10,DL,5000~2503,ABC~XYZ
10,19-FEB-14,abc,10,DL,5000~2503,ABC~XYZ
10,22-JAN-07,abc,10,DL,5000~2503,ABC~XYZ
10,29-JUN-07,abc,10,DL,5000~2503,ABC~XYZ
To Non- Match Lines from f11.txt:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1] = $0; next} !($1 in a) {print $0}' f22.txt f11.txt > Op_NonMatch_f11.txt
20,02-JUL-13,def
20,02-AUG-13,def
40,11-SEP-13,ghi
To Non- Match Lines from f22.txt:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1] = $0; next} !($1 in a) {print $0}' f11.txt f22.txt > Op_NonMatch_f22.txt
50,DL,3000~4332,ABC~XYZ
30,AL,2000~2800,DEF~PQZ
Using above 3 separate commands to achieve the mentioned scenario’s. Is there any simplest way to avoid 3 different commands? Any Suggestions ...!!!
Something like this, untested:
awk '
BEGIN{ FS=OFS="," }
NR==FNR {
fname1 = FILENAME
keys[NR] = $1
recs[NR] = $0
key2nrs[$1] = ($1 in key2nrs ? key2nrs[$1] RS : "") NR
next
}
{
if ($1 in key2nrs) {
split (key2nrs[$1],nrs,RS)
for (i=1; i in nrs; i++) {
print recs[nrs[i]], $0 > "Op_Match.txt"
}
matched[$1]
}
else {
print > ("Op_NonMatch_" FILENAME ".txt")
}
}
END {
for (i=1; i in recs; i++) {
if (! (keys[i] in matched) ) {
print recs[i] > ("Op_NonMatch_" fname1 ".txt")
}
}
}
' f11.txt f22.txt
The main difference between this and Kent and Etans answers is that theirs assume that the $1 in f22.txt can only appear once within that file while the above would work if, say, 10 occurred as the first field on multiple lines of f22.txt.
The other difference is that the above will output lines in the same order that they occurred in the input files while the other answers will output some of them in random order based on how they're stored internally in a hash table.
I haven't checked #EdMorton's answer but he will quite likely have gotten it right.
My solution (which looks slightly less generic than his at first glance) is:
awk -F, '
FNR==NR {
a[$1]=$0;
next
}
($1 in a){
print $0,a[$1] > "Op_Match.txt"
am[$1]++
}
!($1 in a) {
print $0 > "Op_NonMatch_f11.txt"
}
END {
for (i in a) {
if (!(i in am)) {
print a[i] > "Op_NonMatch_f22.txt"
}
}
}
' f22.txt f11.txt
here is one:
awk -F, -v OFS="," 'NR==FNR{a[$1]=$0;next}
$1 in a{print $0,a[$1]>("common.txt");c[$1];next}
{print $0>("NonMatchFromFile1.txt")}
END{for(x in a)
if(!(x in c))
print a[x]>("NonMatchFromFile2.txt")}' f2 f1
with this, you will get 3 files: common.txt, nonmatchfromFile1.txt and nonMatchfromfile2.txt

If statement in GAWK gives an error

I have this piece of code:
gawk '{if (match($5,/hola/,a) && $6=="hola") {print $2"\t"$1"\t"$2"\t"$1"\t"$3} else if `(match($5,/(_[joxT]+\.[0-9]*)/,a) && match($6,/(_[joxG]+\.[0-9]*)/,b)) {print $2""a[1]"\t"$1""b[1]} else (match($5,/(_[joxT]+\.[0-9]*)/,a) && $6=="hola") {print "hola"}}' pasted`
I'm getting this error:
gawk: cmd. line:1: {if (match($5,/hola/,a) && $6=="hola") {print $2"\t"$1"\t"$2"\t"$1"\t"$3} else if (match($5,/(_[joxT]+\.[0-9]*)/,a) && match($6,/(_[joxG]+\.[0-9]*)/,b)) {print $2""a[1]"\t"$1""b[1]} else (match($5,/(_[joxT]+\.[0-9]*)/,a)) {print $1}}
gawk: cmd. line:1: ^ syntax error
Do you know where the error is?
Thanks.
Take pity on the next guy to maintain your code and indent. Not every program needs to be expressed on one line.
gawk '
BEGIN {OFS = '\t'}
{
if ($5 ~ /hola/ && $6 == "hola") {
print $2, $1, $2, $1, $3
}
else if (match($5, /(_[joxT]+\.[0-9]*)/, a) && match($6, /(_[joxG]+\.[0-9]*)/, b)) {
print $2 a[1], $1 b[1]
}
else if ($5 ~ /(_[joxT]+\.[0-9]*)/ && $6 == "hola") {
print "hola"
}
}
' pasted
Here, only using match() when you need to capture part of the match.
gawk '{if (match($5,/hola/,a) && $6=="hola") {print $2"\t"$1"\t"$2"\t"$1"\t"$3} else if `(match($5,/(_[joxT]+\.[0-9]*)/,a) && match($6,/(_[joxG]+\.[0-9]*)/,b)) {print $2""a[1]"\t"$1""b[1]} else if (match($5,/(_[joxT]+\.[0-9]*)/,a) && $6=="hola") {print "hola"}}' pasted`