can awk replace fields based on separate specification file? - awk

I have an input file like this:
SomeSection.Foo
OtherSection.Foo
OtherSection.Goo
...and there is another file describing which object(s) belong to each section:
[SomeSection]
Blah
Foo
[OtherSection]
Foo
Goo
The desired output would be:
SomeSection.2 // that's because Foo appears 2nd in SomeSection
OtherSection.1 // that's because Foo appears 1st in OtherSection
OtherSection.2 // that's because Goo appears 2nd in OtherSection
(The numbers and names of sections and objects are variable)
How would you do such a thing in awk?
Thanks in advance,
Adrian.

One possibility:
Content of script.awk (with comments):
## When 'FNR == NR', the first input file is in process.
## If line begins with '[', get the section string and reset the position
## of its objects.
FNR == NR && $0 ~ /^\[/ {
object = substr( $0, 2, length($0) - 2 )
pos = 0
next
}
## This section process the objects of each section. It saves them in
## an array. Variable 'pos' increments with each object processed.
FNR == NR {
arr_obj[object, $0] = ++pos
next
}
## This section process second file. It splits line in '.' to find second
## part in the array and prints all.
FNR < NR {
ret = split( $0, obj, /\./ )
if ( ret != 2 ) {
next
}
printf "%s.%d\n", obj[1], arr_obj[ obj[1] SUBSEP obj[2] ]
}
Run the script (important the order of input files, object.txt has sections with objects and input.txt the calls):
awk -f script.awk object.txt input.txt
Result:
SomeSection.2
OtherSection.1
OtherSection.2
EDIT to a question in comments:
I'm not an expert but I will try to explain how I understand it:
SUBSEP is a character to separate indexes in an array when you want to use different values as key. By default is \034, although you can modify it like RS or FS.
In instruction arr_obj[object, $0] = ++pos the comma joins all values with the value of SUBSEP, so in this case would result in:
arr_obj[SomeSection\034Blah] = 1
At the end of the script I access to the index using explicity that variable arr_obj[ obj[1] SUBSEP obj[2], but with same meaning that arr_obj[object, $0] in previous section.
You can also access to each part of this index splitting it with SUBSEP variable, like this:
for (key in arr_obj) { ## Assign 'string\034string' to 'key' variable
split( key, key_parts, SUBSEP ) ## Split 'key' with the content of SUBSEP variable.
...
}
with a result of:
key_parts[1] -> SomeSection
key_parts[2] -> Blah

this awk line should do the job:
awk 'BEGIN{FS="[\\.\\]\\[]"}
NR==FNR{ if(NF>1){ i=1; idx=$2; }else{ s[idx"."$1]=i; i++; } next; }
{ if($0 in s) print $1"."s[$0] } ' f2 input
see test below:
kent$ head input f2
==> input <==
SomeSection.Foo
OtherSection.Foo
OtherSection.Goo
==> f2 <==
[SomeSection]
Blah
Foo
[OtherSection]
Foo
Goo
kent$ awk 'BEGIN{FS="[\\.\\]\\[]"}
NR==FNR{ if(NF>1){ i=1; idx=$2; }else{ s[idx"."$1]=i; i++; } next; }
{ if($0 in s) print $1"."s[$0] } ' f2 input
SomeSection.2
OtherSection.1
OtherSection.2

Related

extract info from a tag using awk

I have multi columns file and i want to extract some info in column 71.
I want to extract using tags which the value can be anything, for example i want to just extract AC=* ; AF=* , where the value can be anything .
I found similar question and gave it a try but it didn't work
Extract columns with values matching a specific pattern
Column 71 looks like this:
AC=14511;AC_AFR=382;AC_AMR=1177;AC_Adj=14343;AC_EAS=5;AC_FIN=427;AC_Het=11813;AC_Hom=1265;AC_NFE=11027;AC_OTH=97;AC_SAS=1228;AF=0.137;AN=106198;AN_AFR=8190;AN_AMR=10424;AN_Adj=99264;AN_EAS=7068;AN_FIN=6414;AN_NFE=51090;AN_OTH=658;AN_SAS=15420;BaseQRankSum=1.73;ClippingRankSum=-1.460e-01;DB;DP=1268322;FS=0.000;GQ_MEAN=190.24;GQ_STDDEV=319.67;Het_AFR=358;Het_AMR=1049;Het_EAS=5;Het_FIN=399;Het_NFE=8799;Het_OTH=83;Het_SAS=1120;Hom_AFR=12;Hom_AMR=64;Hom_EAS=0;Hom_FIN=14;Hom_NFE=1114;Hom_OTH=7;Hom_SAS=54;InbreedingCoeff=0.0478;MQ=60.00;MQ0=0;MQRankSum=0.037;NCC=270;POSITIVE_TRAIN_SITE;QD=21.41;ReadPosRankSum=0.212;VQSLOD=4.79;culprit=MQ;DP_HIST=30|3209|1539|1494|30007|7938|4130|2038|1310|612|334|185|97|60|31|25|9|11|7|33,0|66|339|1048|2096|2665|2626|1832|1210|584|323|179|89|54|31|22|7|9|4|15;GQ_HIST=84|66|56|82|3299|568|617|403|250|319|436|310|28566|2937|827|834|451|186|217|12591,15|15|13|16|25|11|22|28|18|38|52|31|65|76|39|83|93|65|97|12397;CSQ=T|ENSG00000186868|ENST00000334239|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11502.1|ENSP00000334886|TAU_HUMAN|B4DSE3_HUMAN|UPI0000000C16||||2/8||ENST00000334239.8:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000570299|Transcript|intron_variant&non_coding_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|processed_transcript||||||||||2/6||ENST00000570299.1:n.262-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000340799|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000340438|TAU_HUMAN||UPI000004EEE6||||3/10||ENST00000340799.5:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000262410|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000262410|TAU_HUMAN||UPI0000EE80B7||||4/13||ENST00000262410.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000446361|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11500.1|ENSP00000408975|TAU_HUMAN||UPI000004EEE5||||2/9||ENST00000446361.3:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000574436|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000460965|TAU_HUMAN||UPI000002D754||||3/10||ENST00000574436.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571987|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000458742|TAU_HUMAN||UPI0000EE80B7||||3/12||ENST00000571987.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000415613|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45715.1|ENSP00000410838|TAU_HUMAN||UPI0001AE66E9||||3/13||ENST00000415613.2:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571311|Transcript|intron_variant&NMD_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|nonsense_mediated_decay|||ENSP00000460048||I3L2Z2_HUMAN|UPI00025A2E6E||||4/4||ENST00000571311.1:c.*176-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000535772|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000443028|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||4/10||ENST00000535772.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000576518|Transcript|stop_gained|5499|7|3|K/*|Aag/Tag|rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000458621||I3L170_HUMAN&B4DSE3_HUMAN|UPI0001639A7C|||1/7|||ENST00000576518.1:c.7A>T|ENSP00000458621.1:p.Lys3Ter|T:0.1171|||||||||15792962|||||POSITION:0.00682261208576998&ANN_ORF:-255.6993&MAX_ORF:-255.6993|PHYLOCSF_WEAK|ANC_ALLELE|LC,T|ENSG00000186868|ENST00000420682|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000413056|TAU_HUMAN||UPI000004EEE6||||2/9||ENST00000420682.2:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000572440|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2790|||||rs754512|1||1|MAPT|HGNC|6893|retained_intron|||||||||1/1|||ENST00000572440.1:n.2790A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000351559|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000303214|TAU_HUMAN||UPI000002D754||||4/11||ENST00000351559.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000344290|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|YES|CCDS45715.1|ENSP00000340820|TAU_HUMAN||UPI0001AE66E9||||4/14||ENST00000344290.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000347967|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000302706|TAU_HUMAN|B4DSE3_HUMAN|UPI0000173D91||||4/10||ENST00000347967.5:c.32-100A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000431008|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000389250|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||3/9||ENST00000431008.3:c.308-94A>T||T:0.1171|||||||||15792962||||||||
The code that i tried:
awk '{
for (i = 1; i <= NF; i++) {
if ($i ~ /AC|AF/) {
printf "%s %s ", $i, $(i + 1)
}
}
print ""
}'
I keep getting syntax error.
output wanted :
AC=14511;AF=0.137
Whenever you have name=value pairs, it's usually simplest to first create an array that maps names to values (n2v[] below) and then you can just access the values by their names.
$ cat file
AC=1;AC_AFR=2;AF=3 AC=4;AC_AFR=5;AF=6
$ cat tst.awk
{
delete n2v
split($2,tmp,/[;=]/)
for (i=1; i in tmp; i+=2) {
n2v[tmp[i]] = tmp[i+1]
}
prt("AC")
prt("AF")
}
function prt(name) { print name, "=", n2v[name] }
$ awk -f tst.awk file
AC = 4
AF = 6
Just change $2 to $71 for your real input.
Something like this should do it (in Gnu awk due to switch):
$ awk '{split($71,a,";");for(i in a )if(a[i]~/^AF/) print a[i]}' foo
AF=0.137
You split the field $71 by ;s, loop thru the array you split to looking for desired match. For multiple matches use switch:
$ awk '{
split($0,a,";");
for(i in a )
switch(a[i]) {
case /^AF=/:
b=b a[i] OFS;
break;
case /^AC=/:
b=b a[i] OFS;
break
}
sub(/.$/,"\n",b);
printf b
}' foo
AC=14511 AF=0.137
EDIT: Now it buffers output to a variable and prints it in the end. You can control the separator with OFS.

gsub for substituting translations not working

I have a dictionary dict with records separated by ":" and data fields by new lines, for example:
:one
1
:two
2
:three
3
:four
4
Now I want awk to substitute all occurrences of each record in the input
file, eg
onetwotwotwoone
two
threetwoone
four
My first awk script looked like this and works just fine:
BEGIN { RS = ":" ; FS = "\n"}
NR == FNR {
rep[$1] = $2
next
}
{
for (key in rep)
grub(key,rep[key])
print
}
giving me:
12221
2
321
4
Unfortunately another dict file contains some character used by regular expressions, so I have to substitute escape characters in my script. By moving key and rep[key] into a string (which can then be parsed for escape characters), the script will only substitute the second record in the dict. Why? And how to solve?
Here's the current second part of the script:
{
for (key in rep)
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
print
}
All scripts are run by awk -f translate.awk dict input
Thanks in advance!
Your fundamental problem is using strings in regexp and backreference contexts when you don't want them and then trying to escape the metacharacters in your strings to disable the characters that you're enabling by using them in those contexts. If you want strings, use them in string contexts, that's all.
You won't want this:
gsub(regexp,backreference-enabled-string)
You want something more like this:
index(...,string) substr(string)
I think this is what you're trying to do:
$ cat tst.awk
BEGIN { FS = ":" }
NR == FNR {
if ( NR%2 ) {
key = $2
}
else {
rep[key] = $0
}
next
}
{
for ( key in rep ) {
head = ""
tail = $0
while ( start = index(tail,key) ) {
head = head substr(tail,1,start-1) rep[key]
tail = substr(tail,start+length(key))
}
$0 = head tail
}
print
}
$ awk -f tst.awk dict file
12221
2
321
4
Never mind for asking....
Just some missing parentheses...?!
{
for (key in rep)
{
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
}
print
}
works like a charm.

Extract specific data from a file with awk

I have a text file as shown below. I would like to extract the .pdb IDs and its corresponding chains. How is this possible with awk?
>4HSU:A|PDBID|CHAIN|SEQUENCE
PLGSRKCEKAGCTATCPVCFASASERCAKNGY
PKAFMADQQL
>4HSU:B|PDBID|CHAIN|SEQUENCE
PLGSPEFSERGSKSPLKRAQETE
>4HSU:C|PDBID|CHAIN|SEQUENCE
ARTMQTARKSTGGKAPRKQLATKAARKSAP
>4HT3:A|PDBID|CHAIN|SEQUENCE
MERYENLFAQLNDRREGAF
>4HT3:B|PDBID|CHAIN|SEQUENCE
MTTLLNPYFGEFGGMYVPQ
>4I0W:A|PDBID|CHAIN|SEQUENCE
MENKAKVGIDFINTIPKQILTSLIEQYSPNNGEIELVVLYGDNFLRFKNSVDVIGAKVEDLGYGFGILII
>4I0W:B|PDBID|CHAIN|SEQUENCE
AYDSNRASCIPSVWNNYNLTGEGILVGFLDT
>4I0W:D|PDBID|CHAIN|SEQUENCE
AYDSNRASCIPSVWNNYNLTGEGILVGFLLPLGDTITSGGWRIIVRKLNNYEGYFDIWLPIAEGLN
ERTRFLQPSVYNTLGIPATVEGVIS
`
Desired output:
4HSU A B C
4HT3 A B
4I0W A B D
kent$ awk -F'[>:|]' '/^>/{a[$2]=a[$2] OFS $3}END{for(x in a)print x,a[x]}' file
4I0W A B D
4HSU A B C
4HT3 A B
I am satisfied with my FS value: >:| like a cute face!
Looks as though you want the output of everything in the original order; so, it takes some indirection to take care of this. All the below works in POSIX AWK as requested (or at least gawk with LINT = 1) and has the addtional feature of keeping track of what is seen to eliminate duplicates.
#! /usr/bin/awk -f
BEGIN {
FS="[>:|]"
split("", t) # table of output
split("", r) # row number in table for a ID
split("", seen) # keeps track of duplicates
row=0
}
/^>/ && !($2 SUBSEP $3 in seen) {
if ($2 in r) {
i=r[$2]
t[i] = t[i] OFS $3
} else {
r[$2] = row
t[row++] = $2 OFS $3
}
seen[$2, $3] = 1
}
END {
for (i=0; i<row; i++)
print t[i]
}

awk | update field number after comparing field from other file

Input File1: file1.txt
MH=919767,918975
DL=919922
HR=919891,919394,919812
KR=919999,918888
Input File2: file2.txt
aec,919922783456,a5,b3,,,asf
abc,918975583456,a1,b1,,,abf
aeci,919998546783,a2,b4,,,wsf
Output File
aec,919922783456,a5,b3,DL,,asf
abc,918975583456,a1,b1,MH,,abf
aeci,919998546783,a2,b4,NOMATCH,,wsf
Notes
Need to compare phone number (Input file2.txt - 2nd field - initial 6 digit only) within Input file1.txt - 2nd field with "=" separted). If there is match in intial 6 digit of phone number, then OUTPUT should contain 2 digit code from file (Input file1) into output in 5th field
File1.txt is having single code (for example MH) for mupltiple phone number intials.
If you have GNU awk, try the following. Run like:
awk -f script.awk file1.txt file2.txt
Contents of script.awk:
BEGIN {
FS="[=,]"
OFS=","
}
FNR==NR {
for(i=2;i<=NF;i++) {
a[$1][$i]
}
next
}
{
$5 = "NOMATCH"
for(j in a) {
for (k in a[j]) {
if (substr($2,0,6) == k) {
$5 = j
}
}
}
}1
Alternatively, here's the one-liner:
awk -F "[=,]" 'FNR==NR { for(i=2;i<=NF;i++) a[$1][$i]; next } { $5 = "NOMATCH"; for(j in a) for (k in a[j]) if (substr($2,0,6) == k) $5 = j }1' OFS=, file1.txt file2.txt
Results:
aec,919922783456,a5,b3,DL,,asf
abc,918975583456,a1,b1,MH,,abf
aeci,919998546783,a2,b4,NOMATCH,,wsf
If you have an 'old' awk, try the following. Run like:
awk -f script.awk file1.txt file2.txt
Contents of script.awk:
BEGIN {
# set the field separator to either an equals sign or a comma
FS="[=,]"
# set the output field separator to a comma
OFS=","
}
# for the first file in the arguments list
FNR==NR {
# loop through all the fields, starting at field two
for(i=2;i<=NF;i++) {
# add field one and each field to a pseudo-multidimensional array
a[$1,$i]
}
# skip processing the rest of the code
next
}
# for the second file in the arguments list
{
# set the default value for field 5
$5 = "NOMATCH"
# loop though the array
for(j in a) {
# split the array keys into another array
split(j,b,SUBSEP)
# if the first six digits of field two equal the value stored in this array
if (substr($2,0,6) == b[2]) {
# assign field five
$5 = b[1]
}
}
# return true, therefore print by default
}1
Alternatively, here's the one-liner:
awk -F "[=,]" 'FNR==NR { for(i=2;i<=NF;i++) a[$1,$i]; next } { $5 = "NOMATCH"; for(j in a) { split(j,b,SUBSEP); if (substr($2,0,6) == b[2]) $5 = b[1] } }1' OFS=, file1.txt file2.txt
Results:
aec,919922783456,a5,b3,DL,,asf
abc,918975583456,a1,b1,MH,,abf
aeci,919998546783,a2,b4,NOMATCH,,wsf
Try something like:
awk '
NR==FNR{
for(i=2; i<=NF; i++) A[$i]=$1
next
}
{
$5="NOMATCH"
for(i in A) if ($2~"^" i) $5=A[i]
}
1
' FS='[=,]' file1 FS=, OFS=, file2

using awk to replace a string by another string (mapping table contained in a file)

I would like to know if it's possible to do the following thing only using
awk:
I am searching some regex in a file Fand I want to replace the string (S1) that matches
the regex by another string (S2). Of course, it's easy to do that with awk. But ... my
problem is that the value of S2 has to be obtained from another file that maps
S1 to S2.
Example :
file F:
abcd 168.0.0.1 qsqsjsdfjsjdf
sdfsdffsd
168.0.0.2 sqqsfjqsfsdf
my associative table in another file
168.0.0.1 foo
168.0.0.2 bar
I want to get:
this result:
abcd foo qsqsjsdfjsjdf
sdfsdffsd
bar sqqsfjqsfsdf
Thanks for help !
edit: if my input file is a bit different, like this (no space before IP address):
file F:
abcd168.0.0.1 qsqsjsdfjsjdf
sdfsdffsd
168.0.0.2 sqqsfjqsfsdf
i can't use $1, $2 variables and search in the associative array.
I tried something like that (based on birei proposition) but it did not work :
FNR < NR {
sub(/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/, assoc [ & ] );
print
}
Is there a way to search the matched string in the associative array (assoc[ & ] seems to
be not valid) ?
One way. It's self-explanatory. Save data from the associative table in an array and in second file check for each field if it matches any key of the array:
awk '
FNR == NR {
assoc[ $1 ] = $2;
next;
}
FNR < NR {
for ( i = 1; i <= NF; i++ ) {
if ( $i in assoc ) {
$i = assoc[ $i ]
}
}
print
}
' associative_file F
Output:
bcd foo qsqsjsdfjsjdf
sdfsdffsd
bar sqqsfjqsfsdf
EDIT: Try following awk script for IPs without spaces with their surrounding words. It's similar to previous one, but now it searches in the array and try to find an IP in any place of the line (default $0 for gsub) and substitutes it.
awk '
FNR == NR {
assoc[ $1 ] = $2;
next;
}
FNR < NR {
for ( key in assoc ) {
gsub( key, assoc[ key ] )
}
print
}
' associative_file F
Assuming infile with the conten of your second example of file F, output would be:
abcdfoo qsqsjsdfjsjdf
sdfsdffsd
bar sqqsfjqsfsdf