Printing blank lines when looping associative array in AWK - awk

I am not clear why blank lines are being printed instead of their correct values from day[] array in AWK.
BEGIN{
day[1]="Sunday"
day["first"]="Sunday"
day[2]="Monday"
day["second"]="Monday"
day[4]="Wednesday"
day["fourth"]="Wednesday"
day[3]="Tuesday"
day["third"]="Tuesday"
for (i in day)
{
print $i
print day[$i]
}
}
Explicity printing out individual array elements yield the expected values as follows:
BEGIN{
day[1]="Sunday"
day["first"]="Sunday"
day[2]="Monday"
day["second"]="Monday"
day[4]="Wednesday"
day["fourth"]="Wednesday"
day[3]="Tuesday"
day["third"]="Tuesday"
print day[1]
print day["first"]
print day[2]
print day["second"]
print day[3]
print day["third"]
print day[4]
print day["fourth"]
}
I am running Linux fedora 5.12.11-300.
Many thanks in advance,
Mary

You shouldn't use $ while printing i or array value as it refers to value of field in awk language, use following instead. Also you need not to use 2 times print statements, you could use single print with newline in it too.
awk '
BEGIN{
day[1]="Sunday"
day["first"]="Sunday"
day[2]="Monday"
day["second"]="Monday"
day[4]="Wednesday"
day["fourth"]="Wednesday"
day[3]="Tuesday"
day["third"]="Tuesday"
for (i in day)
{
print i ORS day[i]
}
}'
Improved version of awk: Also you need to to use 2 statements for same value, you can define them in single assignment way. Even with different indexes having same values it should work, that will save few lines of code :)
awk '
BEGIN{
day[1]=day["first"]="Sunday"
day[2]=day["second"]="Monday"
day[4]=day["fourth"]="Wednesday"
day[3]=day["third"]="Tuesday"
for (i in day)
{
print i OFS day[i]
}
}'

A less verbose way of doing this is by splitting input strings, e.g.:
awk -v days='Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday' \
-v cardinal='First,Second,Third,Fourth,Fifth,Sixth,Seventh' '
BEGIN {
split(days, days_ar, /,/)
split(cardinal, cardinal_ar, /,/)
for (i=1; i<=7; i++)
print cardinal_ar[i] " = " days_ar[i]
}' | column -t
Output:
First = Monday
Second = Tuesday
Third = Wednesday
Fourth = Thursday
Fifth = Friday
Sixth = Saturday
Seventh = Sunday

Thank you so much to all those who have contributed to answering my call for help.
I chose the first answer because I am interested finding out the cause of why the array values weren't being printed. RavinderSingh13 correctly identified the reason which was due to the use of $ with variable i. I have mistakenly treated Awk as shell scripting. Below is the code that printed the array values without including $ variable:
awk '
BEGIN{
day[1]=day["first"]="Sunday"
day[2]=day["second"]="Monday"
day[3]=day["third"]="Tuesday"
day[4]=day["fourth"]="Wednesday"
for (i in day)
{
print i,day[i]
}
}'
One interesting point that I have learnt from Awk scripting, is that only the BEGIN section would always run, with/without reading an input file(s). On the other hand, the body / END / both sections would not run unless the Awk statement is accompanied by at least an input file.

Related

How to find and match an exact string in a column using AWK?

I'm having trouble on matching an exact string that I want to find in a file using awk.
I have the file called "sup_groups.txt" that contains:
(the structure is: "group_name:pw:group_id:user1<,user2>...")
adm:x:4:syslog,adm1
admins:x:1006:adm2,adm12,manuel
ssl-cert:x:122:postgres
ala2:x:1009:aceto,salvemini
conda:x:1011:giovannelli,galise,aceto,caputo,haymele,salvemini,scala,adm2,adm12
adm1Group:x:1022:adm2,adm1,adm3
docker:x:998:manuel
now, I want to extract the records that have in the user list the user "adm1" and print the first column (the group name), but you can see that there is a user called "adm12", so when i do this:
awk -F: '$4 ~ "adm1" {print $1}' sup_groups.txt
the output is:
adm
admins
conda
adm1Group
the command of course also prints those records that contain the string "adm12", but I don't want these lines because I'm interested only on the user "adm1".
So, How can I change this command so that it just prints the lines 1 and 6 (excluding 2 and 5)?
thank you so much and sorry for my bad English
EDIT: thank you for the answers, u gave me inspiration for the solution, i think this might work as well as your solutions but more simplified:
awk -F: '$4 ~ "adm,|adm1$|:adm1," {print $1}' sup_groups.txt
basically I'm using ORs covering all the cases and excluding the "adm12"
let me know if you think this is correct
1st solution: Using split function of awk. With your shown samples, please try following awk code.
awk -F':' '
{
num=split($4,arr,",")
for(i=1;i<=num;i++){
if(arr[i]=="adm1"){
print
}
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here setting field separator as : here.
{
num=split($4,arr,",") ##Using split to split 4th field into array arr with delimiter of ,
for(i=1;i<=num;i++){ ##Running for loop till value of num(total elements of array arr).
if(arr[i]=="adm1"){ ##Checking condition if arr[i] value is equal to adm1 then do following.
print ##printing current line here.
}
}
}
' Input_file ##Mentioning Input_file name here.
2nd solution: Using regex and conditions in awk.
awk -F':' '$4~/^adm1,/ || $4~/,adm1,/ || $4~/,adm1$/' Input_file
OR if 4th field doesn't have comma at all then try following:
awk -F':' '$4~/^adm1,/ || $4~/,adm1,/ || $4~/,adm1$/ || $4=="adm1"' Input_file
Explanation: Making field separator as : and checking condition if 4th field is either equal to ^adm1,(starting adm1,) OR its equal to ,adm1, OR its equal to ,adm1$(ending with ,adm1) then print that line.
This should do the trick:
$ awk -F: '"," $4 "," ~ ",adm1," { print $1 }' file
The idea behind this is the encapsulate both the group field between commas such that each group entry is encapsulated by commas. So instead of searching for adm1 you search for ,adm1,
So if your list looks like:
adm2,adm12,manuel
and, by adding commas, you convert it too:
,adm2,adm12,manuel,
you can always search for ,adm1, and find the perfect match .
once u setup FS per task requirements, then main body becomes barely just :
NF = !_ < NF
or even more straight forward :
{m,n,g}awk —- --NF
=
{m,g}awk 'NF=!_<NF' OFS= FS=':[^:]*:[^:]*:[^:]*[^[:alpha:]]?adm[0-9]+.*$'
adm
admins
conda
adm1Group

AWK FPAT not working as expected for string parsing

I have to parse a very large length string (from stdin). It is basically a .sql file. I have to get data from it. I am working to parse the data so that I can convert it into csv. For this, I am using awk. For my case, A sample snippet (of two records) is as follows:
b="(abc#xyz.com,www.example.com,'field2,(2)'),(dfr#xyz.com,www.example.com,'field0'),"
echo $b|awk 'BEGIN {FPAT = "([^\\)]+)|('\''[^'\'']+'\'')"}{print $1}'
In my regex, I am saying that split on ")" bracket or if single quotes are found then ignore all text until last quote is found. But my output is as follows:
(abc#xyz.com,www.example.com,'field2,(2
I am expecting this output
(abc#xyz.com,www.example.com,'field2,(2)'
Where is the problem in my code. I am search a lot and check awk manual for this but not successful.
My first answer below was wrong, there is an ERE for what you're trying to do:
$ echo "$b" | awk -v FPAT="[(]([^)]|'[^']*')*)" '{for (i=1; i<=NF; i++) print $i}'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
Original answer, left as a different approach:
You need a 2-pass approach first to replace all )s within quoted fields with something that can't already exist in the input (e.g. RS) and then to identify the (...) fields and put the RSs back to )s before printing them:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
FPAT = "[(][^)]*)"
$0 = $0
for (i=1; i<=NF; i++) {
gsub(RS,")",$i)
print $i
}
FS = FS
}
'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
The above is gawk-only due to FPAT (or we could have used gawk patsplit()), with other awks you'd used a while-match()-substr() loop:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
while ( match($0,/[(][^)]*)/) ) {
field = substr($0,RSTART,RLENGTH)
gsub(RS,")",field)
print field
$0 = substr($0,RSTART+RLENGTH)
}
}
'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
Written and tested with your shown samples in GNU awk. This could be done in simple field separator setting, try following once, where b is your shell variable which has your shown value in it.
echo "$b" | awk -F'\\),\\(' '{print $1}'
(abc#xyz.com,www.example.com,'field2,(2)'
Explanation: Simply setting field separator of awk program to \\),\\( for your input and printing first field of it.
Similar regex approach as Ed has suggested but I usually prefer using RS and RT over FPAT:
b="(abc#xyz.com,www.example.com,'field2,(2)'),(dfr#xyz.com,www.example.com,'field0'),"
awk -v RS="[(]('[^']*'|[^)])*[)]" 'RT {print RT}' <<< "$b"
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
if you wanna do it close to one pass, maybe try this
{mawk/mawk2/gawk} 'BEGIN { OFS = FS = "\047"; ORS = RS = "\n";
XFS = "\376\004\377";
XRS = "\051" ORS;
} ! /[\051]/ { print; next; } { for (x=1; x <= NF; x += 2) {
gsub(/[\051][^\050]*/, XFS, $(x)); } } gsub(XFS, XRS) || 1'
I did it this way with 2 gsubs just in case it starts sending rows below with unintended consequences. \051 = ")", \050 is the open one.
further enhanced it by telling it to instantly print and move on if no close brackets are even found (so nothing to split at all)
It only loops over odd-numbered fields once i split it by the single quote \047 (cuz even numbered ones are precisely the ones within a pair of single quotes you want to avoid chopping at).
As for XFS, just pick any combination of your choice using bytes that are almost impossible to encounter. If you want to play it safe, you can test for whether XFS exists in that row, and use some alternative combo. It's basically to insert a delimiter into the middle of the row that wouldn't run afoul with actual input data. It's not fool proof per se, but the likelihood of running into a combination of UTF16 Byte order mark and ASCII control characters is reasonably low.
(and if you encounter XFS, it's likely you already have corrupted data to begin with, since a 300 series octal must be followed by 200 series ones to be valid UTF8)
This way, i wouldn't need FPAT at all.
*updated with " || 1" towards the end as a safety catch-all, but shouldn't really be needed.

awk to extract and print first occurrence of patterns

I am trying to use awk to extract and print the first ocurrence of NM_ and the portion after theNP_ starting with p.. A : is printed instead of the "|" for each. The input file is tab-delimeted, but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. There maybe multiple NM or NP in my actual data of over 5000 lines, however only the first occurence of each is extracted and printed. I am still a little unclear on the RSTART and RLENGHTH concepts but, using line 1 as an example from the input:
The NM variable would be NM_020469.2
The NP variable would be :p.Gly268Arg
I have included comments as well. Thank you :).
input
Input Variant HGVS description(s) Errors and warnings
rs41302905 NC_000009.11:g.136131316C>T|NM_020469.2:c.802G>A|NP_065202.2:p.Gly268Arg
rs8176745 NC_000009.11:g.136131347G>A|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=
desired output
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
awk
awk -F'[\t|]' 'NR>1{ # define FS as tab and `|` to split each, and skip header line
r=$1; nm=np=""; # create variable r with $1 and 2 variables (one for nm and the other for np, setting them to null)
for(i=2;i<=NF;i++) { # start a loop from line2 and itterate
if ($i~/^NM_/) nm=$i; # extract first NM_ in line and read into i
else if ($i~/^NP_/) np=substr($i,index($i,":")); # extract NP_ and print portion after : (including :)
if (nm && np) { print r,nm np; break } # print desired output
}
}' input
Awk solution:
awk -F'[\t|]' 'NR>1{
r=$1; nm=np="";
for(i=2;i<=NF;i++) {
if ($i~/^NM_/) nm=$i;
else if ($i~/^NP_/) np=substr($i,index($i,":"));
if (nm && np) { print r,nm np; break }
}
}' input
'NR>1 - start processing from the 2nd record
r=$1; nm=np="" - initialization of the needed variables
for(i=2;i<=NF;i++) - iterating through the fields (starting from the 2nd)
if ($i~/^NM_/) nm=$i - capturing NM_... item into variale nm
else if ($i~/^NP_/) np=substr($i,index($i,":")) - capturing NP_... item into variale np (starting from : till the end)
if (nm && np) { print r,nm np; break } - if both items has been captured - print them and break the loop to avoid further processing
The output:
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
Could you please try following and let me know if this helps too.
awk '{
match($0,/NM_[^|]*/);
nm=substr($0,RSTART,RLENGTH);
match($0,/NP_([^|]|[^$])*/);
np=substr($0,RSTART,RLENGTH);
split(np, a,":");
if(nm && np){
print $1,nm ":" a[2]
}
}
' Input_file
Output will be as follows.
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
PS: Since your sample Input_file doesn't have TAB in them so you could add "\t" after awk in case your Input_file is TAB delimited and if you want to have output as TAB delimited too, add OFS="\t" before Input_file.
Short GNU awk solution (with match function):
awk 'match($0,/(NM_[^|]+).*NP_[^:]+([^[:space:]|]+)/,a){ print $1,a[1] a[2] }' input
The output:
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
Given your posted sample input, this is all you need to produce your desired output:
$ awk -F'[\t|]+' 'NR>1{sub(/[^:]+/,"",$4); print $1, $3 $4}' file
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
If that's not all you need then provide more truly representative input/output.
Another alternative awk proposal.
awk 'NR>1{sub(/\|/," ")sub(/\|NP_065202.2/,"");print $1,$3,$4}' file
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

How to use multiple passes with gawk?

I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?
Here's my .awk file:
pass == 1
{
print "pass1 is", pass;
}
pass == 2
{
if(pass == 2)
print "pass2 is", pass;
}
Here's my output (input file is just "hello):
hello
pass1 is 1
pass1 is 2
hello
pass2 is 2
Here's my command line:
gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt
I'd appreciate any help.
An (g)awk solution might look like this:
awk 'FNR == NR{print "1st pass"; next}
{print "second pass"}' x.txt x.txt
(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):
awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
$1==max' x.txt x.txt
The output for x.txt:
6,5
2,6
5,7
6,9
is
6,5
6,9
How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.
So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.
But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)
awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile
Or, spaced out for easier reading:
BEGIN {
FS=","
}
$1 > max {
delete list # empty the array
n=0 # reset the array counter
max=$1 # set a new max
}
max==$1 {
list[++n]=$0 # record the line in our array
}
END {
for(i=1;i<=n;i++) { # print the array in order of found lines.
print list[i]
}
}
With the same input data that F.Knorr tested with, I get the same results.
The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.
This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.
The issue here is that newlines matter to awk.
# This does what I should have done:
pass==1 {print "pass1 is", pass;}
pass==2 {if (pass==2) print "pass2 is", pass;}
# This is the code in my question:
# When pass == 1, do nothing
pass==1
# On every condition, do this
{print "pass1 is", pass;}
# When pass == 2, do nothing
pass==2
# On every condition, do this
{if (pass==2) print "pass2 is", pass;}
Using pass==1, pass==2 isn't as elegant, but it works.

How to use grep via awk?

I am trying to apply grep on just a few strings from a huge file. But, I'd like to pass that line to the grep command via the awk script. I also want the output redirected to the script.
I've an awk script that reads in records from a file. I want grep to be applied on only a few of the records. The current record, $0, will be the text on which grep is to be used.
How do I do the same? Currently, I'm trying this -
system("grep --count -w 'GOOD' \n" $0)
But, it doesn't seem to work. What should I be using?
In Gnu Awk you could use \< and \> to match beginning and end of a word, so
gawk '/\<GOOD\>/{++i} END{print i}'
will do the same as
grep -wc 'GOOD' file
If you want to count the total number of occurances (not only the number of lines, but also occurances within a given line/record) of the word GOOD you could use FPAT in Gnu Awk version 4, as
gawk 'BEGIN { FPAT="\\<GOOD\\>"; RS="^$" } { print NF }' file
If you want to count the number of exact matches of the phrase GOOD DI in a given record, for instance record number 3, you could use
gawk 'NR==3 { print patsplit($0,a,/GOOD DI/) }' file
Your question is not very clear, and it would help if you showed some of your input file, your entire script that you have so far and also the output you want to achieve.
In the mean time, as there is nothing in your question to suggest anything to the contrary, you could do the following:
awk 'somescript' somefile | grep --count -w 'GOOD DI'
You cannot apply grep on a text string, which is what you are doing. If you really need to use grep/system something like following would be needed:
system("echo '"$0"' | grep --count -w 'foo'")
But this is no good either as count only counts lines on which it occurs not the number of times on a line which is what you are after. Or so it seems.
If you use the regex as a split seperator you get the number of split occurences +1.
So following will work:
awk '{printf FNR; a=split($0,myarray,/.OOD/); print " "a-1}' file.txt
This would print each linenumber with the number of times your regex occured. (in this case ".OOD". Representing GOOD, FOOD, MOOD etc)
you can do it the old fashion way
awk 'BEGIN{count=0} {
for( i=1;i<=NF; i++) {
if( $i == "GOOD" ){
++count
}
}
}END {
print count
}' file