awk Joining n fields with delimiter - awk

How can I use awk to join various fields, given that I don't know how many of them I have? For example, given the input string
aaa/bbb/ccc/ddd/eee
I use -F'/' as delimiter, do some manipulation on aaa, bbb, ccc, ddd, eee (altering, removing...) and I want to join it back to print something line
AAA/bbb/ddd/e
Thanks

... given that I don't know how many of them I have?
Ah, but you do know how many you have. Or you will soon, if you keep reading :-)
Before giving you a record to process, awk will set the NF variable to the number of fields in that record, and you can use for loops to process them (comments aren't part of the script, I've just put them there to explain):
$ echo pax/is/a/love/god | awk -F/ '{
gsub (/god/,"dog",$5); # pax,is,a,love,dog
$4 = ""; # pax,is,a,,dog
$6 = $5; # pax,is,a,,dog,dog
$5 = "rabid"; # pax,is,a,,rabid,dog
printf $1; # output "pax"
for (i = 2; i <= NF; i++) { # output ".<field>"
if ($i != "") { # but only for non-blank fields (skip $4)
printf "."$i;
}
}
printf "\n"; # finish line
}'
pax.is.a.rabid.dog
This shows manipulation of the values, as well as insertion and deletion.

The following will show you how to process each field and do some example manipulations on them.
The only caveat of using the output field separator OFS is that "deleted" fields will still have delimiters as shown in the output below; however it makes the code much simpler if you can live with that.
awk '
BEGIN{FS=OFS="/"}
{
for(i=1;i<=NF;i++){
if($i == "aaa")
$i=toupper($i)
else if($i ~ /c/)
$i=""
else if($i ~ /^eee$/)
$i="e"
}
}1' <<<'aaa/bbb/ccc/ddd/eee'
Output
AAA/bbb//ddd/e

This might work for you:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'BEGIN{FS=OFS="/"}{sub(/../,"",$4);NF=4;print}'
aaa/bbb/ccc/d
To delete fields not at the end use a function to shuffle the values:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'func d(n){for(x=n;x<=NF-1;x++){y=x+1;$x=$y}NF--};BEGIN{FS=OFS="/"}{d(2);print}'
aaa/ccc/ddd/eee
Deletes the second field.

awk -F'/' '{ # I'd suggest to add them to an array, like:
# for (i=1;i<=NF;i++) {a[i]=$i }
# Now manipulate your elements in the array
# then finally print them:
n = asorti(a, dest)
for (i=1;i<=n;i++) { output+=dest[i] "/") }
print gensub("/$","","g",output)
}' INPUTFILE
Doing it this way you can delete elements as well. Note deleting an item can be done like delete array[index].

Related

Using awk to analyze log file to identify blocks and to extract information

I am trying to figure out a way to use awk to analyze my log files from an old application. The log file contains processing information from the application but the structure is a bit messy. But it has a structure like this:
some random text
...
BLOCK-BEGIN bla bla INFO1:VAL1
variable lines of text
INFO2:VAL2
variable lines of text
POSSIBLE-BLOCK-END-PHRASE1
...
some random text
INFO3:not-desired-val5
...
BLOCK-BEGIN bla bla INFO1:VAL3
variable lines of text
INFO2:VAL4
variable lines of text
POSSIBLE-BLOCK-END-PHRASE2
...
What I want to do is to first identify the blocks. In this example above, there are two blocks with same block beginning but different endings. Within each block, I want to extract then few information, i.e. INFO1,INFO2 in the example. The desired output in this case would be:
VAL1,VAL2
VAL3,VAL4
I know some basic of awk. Therefore, any solutions or hints are highly welcome. Thanks
Update: my first attempt
awk '/BLOCK-BEGIN/{printf substr($4,7)",";for (i = 0 ; i < NF; i++) getline; if($0 ~ '/^INFO2/') print substr($0,7)}'
The output is:
VAL1,VAL2
VAL3,VAL4
But is there a better way to do it? Any suggestions?
$ awk -v OFS=',' '
(split($NF,a,/:/) == 2) && sub(/^INFO/,"",a[1]) {
info[a[1]] = a[2]
if ( a[1] == 2 ) {
print info[1], info[2]
}
}
' file
VAL1,VAL2
VAL3,VAL4
Regarding the code you posted in your question:
printf substr($4,7)"," - never do printf <input data> as it'll fail when your input contains printf formatting characters, always do printf "%s", <input data> instead so that could should be written printf "%,",substr($4,7).
getline - there's aonly a few specific situations where getline is the right approach and when it is you have to write it securely. This isn't the right situation and it's not written securely. See awk.freeshell.org/AllAboutGetline.
for (i = 0 ; i < NF; i++) all field numbers, array indices, and string character positions in awk start at 1, not 0, so write your code to match to you don't trip over thinking arrays or anything else start at zero - for (i = 1 ; i <= NF; i++).
'foo... $0 ~ '/^INFO2/' ...bar' those inner 's are terminating the awk script body and so exposing what's between them to the shell for interpretation. Never do that. In this case idk why you thought you needed them as your code should just be 'foo... $0 ~ /^INFO2/ ...bar'.
With your shown samples only, please try following awk code.
awk -F'INFO[0-9]+:' '
/BLOCK-BEGIN/{
if(val2 && val1){
print val1","val2
}
val1=val2=""
val1=$NF
next
}
/^INFO[0-9]+:/{
val2=(val2?val2 ",":"") $NF
}
END{
if(val2 && val1){
print val1","val2
}
}
' Input_file

extract info from a tag using awk

I have multi columns file and i want to extract some info in column 71.
I want to extract using tags which the value can be anything, for example i want to just extract AC=* ; AF=* , where the value can be anything .
I found similar question and gave it a try but it didn't work
Extract columns with values matching a specific pattern
Column 71 looks like this:
AC=14511;AC_AFR=382;AC_AMR=1177;AC_Adj=14343;AC_EAS=5;AC_FIN=427;AC_Het=11813;AC_Hom=1265;AC_NFE=11027;AC_OTH=97;AC_SAS=1228;AF=0.137;AN=106198;AN_AFR=8190;AN_AMR=10424;AN_Adj=99264;AN_EAS=7068;AN_FIN=6414;AN_NFE=51090;AN_OTH=658;AN_SAS=15420;BaseQRankSum=1.73;ClippingRankSum=-1.460e-01;DB;DP=1268322;FS=0.000;GQ_MEAN=190.24;GQ_STDDEV=319.67;Het_AFR=358;Het_AMR=1049;Het_EAS=5;Het_FIN=399;Het_NFE=8799;Het_OTH=83;Het_SAS=1120;Hom_AFR=12;Hom_AMR=64;Hom_EAS=0;Hom_FIN=14;Hom_NFE=1114;Hom_OTH=7;Hom_SAS=54;InbreedingCoeff=0.0478;MQ=60.00;MQ0=0;MQRankSum=0.037;NCC=270;POSITIVE_TRAIN_SITE;QD=21.41;ReadPosRankSum=0.212;VQSLOD=4.79;culprit=MQ;DP_HIST=30|3209|1539|1494|30007|7938|4130|2038|1310|612|334|185|97|60|31|25|9|11|7|33,0|66|339|1048|2096|2665|2626|1832|1210|584|323|179|89|54|31|22|7|9|4|15;GQ_HIST=84|66|56|82|3299|568|617|403|250|319|436|310|28566|2937|827|834|451|186|217|12591,15|15|13|16|25|11|22|28|18|38|52|31|65|76|39|83|93|65|97|12397;CSQ=T|ENSG00000186868|ENST00000334239|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11502.1|ENSP00000334886|TAU_HUMAN|B4DSE3_HUMAN|UPI0000000C16||||2/8||ENST00000334239.8:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000570299|Transcript|intron_variant&non_coding_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|processed_transcript||||||||||2/6||ENST00000570299.1:n.262-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000340799|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000340438|TAU_HUMAN||UPI000004EEE6||||3/10||ENST00000340799.5:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000262410|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000262410|TAU_HUMAN||UPI0000EE80B7||||4/13||ENST00000262410.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000446361|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11500.1|ENSP00000408975|TAU_HUMAN||UPI000004EEE5||||2/9||ENST00000446361.3:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000574436|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000460965|TAU_HUMAN||UPI000002D754||||3/10||ENST00000574436.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571987|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000458742|TAU_HUMAN||UPI0000EE80B7||||3/12||ENST00000571987.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000415613|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45715.1|ENSP00000410838|TAU_HUMAN||UPI0001AE66E9||||3/13||ENST00000415613.2:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571311|Transcript|intron_variant&NMD_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|nonsense_mediated_decay|||ENSP00000460048||I3L2Z2_HUMAN|UPI00025A2E6E||||4/4||ENST00000571311.1:c.*176-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000535772|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000443028|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||4/10||ENST00000535772.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000576518|Transcript|stop_gained|5499|7|3|K/*|Aag/Tag|rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000458621||I3L170_HUMAN&B4DSE3_HUMAN|UPI0001639A7C|||1/7|||ENST00000576518.1:c.7A>T|ENSP00000458621.1:p.Lys3Ter|T:0.1171|||||||||15792962|||||POSITION:0.00682261208576998&ANN_ORF:-255.6993&MAX_ORF:-255.6993|PHYLOCSF_WEAK|ANC_ALLELE|LC,T|ENSG00000186868|ENST00000420682|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000413056|TAU_HUMAN||UPI000004EEE6||||2/9||ENST00000420682.2:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000572440|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2790|||||rs754512|1||1|MAPT|HGNC|6893|retained_intron|||||||||1/1|||ENST00000572440.1:n.2790A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000351559|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000303214|TAU_HUMAN||UPI000002D754||||4/11||ENST00000351559.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000344290|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|YES|CCDS45715.1|ENSP00000340820|TAU_HUMAN||UPI0001AE66E9||||4/14||ENST00000344290.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000347967|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000302706|TAU_HUMAN|B4DSE3_HUMAN|UPI0000173D91||||4/10||ENST00000347967.5:c.32-100A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000431008|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000389250|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||3/9||ENST00000431008.3:c.308-94A>T||T:0.1171|||||||||15792962||||||||
The code that i tried:
awk '{
for (i = 1; i <= NF; i++) {
if ($i ~ /AC|AF/) {
printf "%s %s ", $i, $(i + 1)
}
}
print ""
}'
I keep getting syntax error.
output wanted :
AC=14511;AF=0.137
Whenever you have name=value pairs, it's usually simplest to first create an array that maps names to values (n2v[] below) and then you can just access the values by their names.
$ cat file
AC=1;AC_AFR=2;AF=3 AC=4;AC_AFR=5;AF=6
$ cat tst.awk
{
delete n2v
split($2,tmp,/[;=]/)
for (i=1; i in tmp; i+=2) {
n2v[tmp[i]] = tmp[i+1]
}
prt("AC")
prt("AF")
}
function prt(name) { print name, "=", n2v[name] }
$ awk -f tst.awk file
AC = 4
AF = 6
Just change $2 to $71 for your real input.
Something like this should do it (in Gnu awk due to switch):
$ awk '{split($71,a,";");for(i in a )if(a[i]~/^AF/) print a[i]}' foo
AF=0.137
You split the field $71 by ;s, loop thru the array you split to looking for desired match. For multiple matches use switch:
$ awk '{
split($0,a,";");
for(i in a )
switch(a[i]) {
case /^AF=/:
b=b a[i] OFS;
break;
case /^AC=/:
b=b a[i] OFS;
break
}
sub(/.$/,"\n",b);
printf b
}' foo
AC=14511 AF=0.137
EDIT: Now it buffers output to a variable and prints it in the end. You can control the separator with OFS.

awk append lines if certain field matches

I'm an absolute beginner to awk and would like some help with this.
I have this data:
FOO|BAR|1234|A|B|C|D|
FOO|BAR|1234|E|F|G|H|
FOO|BAR|5678|I|J|K|L|
FOO|BAR|5678|M|N|O|P|
FOO|BAR|5678|Q|R|S|T|
Desired output:
FOO|BAR|1234|A|B|C|D|E|F|G|H|
FOO|BAR|5678|I|J|K|L|M|N|O|P|Q|R|S|T|
Basically I have to append some fields to the lines where column 3 matches.
Appreciate any responses, thanks a lot!
Another way:
awk -F"|" '$3 in a{
a[$3]=a[$3]"|"$4"|"$5"|"$6"|"$7;
next
}
{ a[$3]=$0
}
END {
for ( i in a) {
print a[i]
}
}'
$ awk -f chain.awk < data
FOO|BAR|1234|A|B|C|D|E|F|G|H|
FOO|BAR|5678|I|J|K|L|M|N|O|P|Q|R|S|T|
$ cat chain.awk
BEGIN {FS = "|"}
$3==old {for(i = 4; i <= NF; i++) saved = saved (i>4?"|":"") $i}
$3!=old {if(old) print saved ; saved = $0 ; old = $3}
END {print saved}
$
BEGIN we set the field separator
$3==old we append the fields $4 ... $NF to the saved data, joining the fields with | except for the first one (note that there is a last, null field)
$3!=old we print the saved data (except for the first record, when old is false) and we restart the mechanism
END we still have saved data in our belly, we have to print it

Why does awk "not in" array work just like awk "in" array?

Here's an awk script that attempts to set difference of two files based on their first column:
BEGIN{
OFS=FS="\t"
file = ARGV[1]
while (getline < file)
Contained[$1] = $1
delete ARGV[1]
}
$1 not in Contained{
print $0
}
Here is TestFileA:
cat
dog
frog
Here is TestFileB:
ee
cat
dog
frog
However, when I run the following command:
gawk -f Diff.awk TestFileA TestFileB
I get the output just as if the script had contained "in":
cat
dog
frog
While I am uncertain about whether "not in" is correct syntax for my intent, I'm very curious about why it behaves exactly the same way as when I wrote "in".
I cannot find any doc about element not in array.
Try !(element in array).
I guess: awk sees not as an uninitialized variable, so not is evaluated as an empty string.
$1 not == $1 "" == $1
I figured this one out. The ( x in array ) returns a value, so to do "not in array", you have to do this:
if ( x in array == 0 )
print "x is not in the array"
or in your example:
($1 in Contained == 0){
print $0
}
In my solution for this problem I use the following if-else statement:
if($1 in contained);else{print "Here goes your code for \"not in\""}
Not sure if this is anything like you were trying to do.
#! /bin/awk
# will read in the second arg file and make a hash of the token
# found in column one. Then it will read the first arg file and print any
# lines with a token in column one not matching the tokens already defined
BEGIN{
OFS=FS="\t"
file = ARGV[1]
while (getline &lt file)
Contained[$1] = $1
# delete ARGV[1] # I don't know what you were thinking here
# for(i in Contained) {print Contained[i]} # debuging, not just for sadists
close (ARGV[1])
}
{
if ($1 in Contained){} else { print $1 }
}
In awk commande line I use:
! ($1 in a)
$1 pattern
a array
Example:
awk 'NR==FNR{a[$1];next}! ($1 in a) {print $1}' file1 file2

Use Awk to Print every character as its own column?

I am in need of reorganizing a large CSV file. The first column, which is currently a 6 digit number needs to be split up, using commas as the field separator.
For example, I need this:
022250,10:50 AM,274,22,50
022255,11:55 AM,275,22,55
turned into this:
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Let me know what you think!
Thanks!
It's a lot shorter in perl:
perl -F, -ane '$,=","; print split("",$F[0]), #F[1..$#F]' <file>
Since you don't know perl, a quick explanation. -F, indicates the input field separator is the comma (like awk). -a activates auto-split (into the array #F), -n implicitly wraps the code in a while (<>) { ... } loop, which reads input line-by-line. -e indicates the next argument is the script to run. $, is the output field separator (it gets set iteration of the loop this way, but oh well). split has obvious purpose, and you can see how the array is indexed/sliced. print, when lists as arguments like this, uses the output field separator and prints all their fields.
In awk:
awk -F, '{n=split($1,a,""); for (i=1;i<=n;i++) {printf("%s,",a[i])}; for (i=2;i<NF;i++) {printf("%s,",$i)}; print $NF}' <file>
I think this might work. The split function (at least in the version I am running) splits the value into individual characters if the third parameter is an empty string.
BEGIN{ FS="," }
{
n = split( $1, a, "" );
for ( i = 1; i <= n; i++ )
printf("%s,", a[i] );
sep = "";
for ( i = 2; i <= NF; i++ )
{
printf( "%s%s", sep, $i );
sep = ",";
}
printf("\n");
}
here's another way in awk
$ awk -F"," '{gsub(".",",&",$1);sub("^,","",$1)}1' OFS="," file
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Here's a variation on a theme. One thing to note is it prints the remaining fields without using a loop. Another is that since you're looping over the characters in the first field anyway, why not just do it without using the null-delimiter feature of split() (which may not be present in some versions of AWK):
awk -F, 'BEGIN{OFS=","} {len=length($1); for (i=1;i<len; i++) {printf "%s,", substr($1,i,1)}; printf "%s", substr($1,len,1);$1=""; print $0}' filename
As a script:
BEGIN {FS = OFS = ","}
{
len = length($1);
for (i=1; i<len; i++)
{printf "%s,", substr($1, i, 1)};
printf "%s", substr($1, len, 1)
$1 = "";
print $0
}