Awk Field number of matched pattern - awk

I was wondering if there's a built in command in awk to get the field number of the phrase that you just matched.
Banana is yellow.
awk {
/yellow/{ for (i=1;i<=NF;i++) if($i ~/yellow/) print $i}'
Is there a way to avoid writing the loop?

Your command doesn't work when I test it. Here's my version:
echo "banana is yellow" | awk '{for (i=1;i<=NF;i++) if($i ~/yellow/) print i}'
The output is :
3
As far as I know, there's no such built-in feature, to improve your command, the pattern match /yellow/ at the beginning is not necessary, and also $i will print the matching field other than the field number that you need.
Alternatively, you can use an array to store each field and its corresponding index number, and then print field by arr["yellow"]

If the input string is a oneline string you can set the record delimiter to the field delimiter. Doing so you can use NR to print the position:
awk 'BEGIN{RS=FS}/yellow/{print NR}' <<< 'banana is yellow'
3

Related

awk: counting fields in a variable

Given a string like {running_db_nodes,[ejabberd#host002,ejabberd#host001]}, , how could the number of comma-delimited strings in square brackets be counted?
The useful substring can be extracted with gensub:
awk '/running_db_nodes/ {print gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1)}' .
A naive approach with NF gets fields from the original input string:
awk -F, '/running_db_nodes/ {nodes=gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1); print NF}'
How could the number of fields in a variable like nodes in the last example be extracted?
You can set your FS to characters [ and ], then split your $2 to an array and capture the count of elements returned from split():
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk -F"[][]" '{print split($2,a,",")}'
2
With your shown samples only and with shown attempts please try following awk code.
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk '
{
gsub(/.*\[|\].*$/,"")
print gsub(/,/,"&")+1
}
'
Explanation: Simple explanation would be:
gsub(/.*\[|\].*$/,""): Globally substituting everything from starting to till [ AND substituting from [ to till end of value with NULL in current line.
print gsub(/,/,"&")+1: Globally substituting , with itself(just to count it) and adding 1 to it and printing it as pre requirement.
A naive approach with NF gets fields from the original input string
gensub does not change string it is working on, you might use sub (or gsub) which will alter string it is working at which will alter relevant built-in variables values that is
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}" | awk 'BEGIN{FS=","}{sub(/^.*\[/,"");sub(/].*$/,"");print NF}'
gives output
2
Explanation: use sub to delete everything before [ and [, then ] and everything behind it, print number of fields.
(tested in GNU Awk 5.0.1)

Countif like function in AWK with field headers

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ
What i want is:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,
,
,
,
,
,
very similar question to this one
similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
then
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
cnt[$6]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
}
' file3
This generates:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

Replace case of 2nd column of dataset within awk?

Im trying command
awk 'BEGIN{FS=","}NR>1{tolower(substr($2,2))} {print $0}' emp.txt
on below data but not working
- M_ID,M_NAME,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,AJIT,D003,8-Mar-07,8-Sep-07,70000
M004,SHARVARI,D004,28-Mar-07,28-Mar-08,120000
M005,ADITYA,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999
Expected output
M_ID,M_NAME,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,Ajit,D003,8-Mar-07,8-Sep-07,70000
M004,Sharvari,D004,28-Mar-07,28-Mar-08,120000
M005,Aditya,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999
With your shown samples in GNU awk please try following awk code. Its using GNU awk's match function, where I am using regex (^[^,]*,.)([^,]*)(.*) which is creating 3 capturing groups and storing values into an array named arr(whose indexes are 1,2,3 and so on depending upon number of capturing groups created). Then if this condition is fine then printing array elements where using tolower function to Lower the spellings on 2nd element of arr to get expected output.
awk '
FNR==1{
print
next
}
match($0,/(^[^,]*,.)([^,]*)(.*)/,arr){
print arr[1] tolower(arr[2]) arr[3]
}
' Input_file
You need to assign the result of tolower() to something, it doesn't operate in place. And in this case, you need to concatenate it with the first character of the field and assign that back to the field.
$2 = substr($2, 1, 1) tolower(substr($2, 2));
To get comma separators in the output file, you need to set OFS. So you need:
BEGIN {OFS=FS=","}
mawk, gawk, or nawk :
awk 'BEGIN { _+=_^=FS=OFS="," } NR<_ || $_ = substr( toupper($(+_)=\
tolower($_)), --_,_) substr($++_,_)'
M_ID,M_name,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,Ajit,D003,8-Mar-07,8-Sep-07,70000
M004,Sharvari,D004,28-Mar-07,28-Mar-08,120000
M005,Aditya,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999

awk / gawk printf when variable format string, changing zero to dash

I have a table of numbers I am printing in awk using printf.
The printf accomplishes some truncation for the numbers.
(cat <<E\OF
Name,Where,Grade
Bob,Sydney,75.12
Sue,Sydney,65.2475
George,Sydney,84.6
Jack,Sydney,35
Amy,Sydney,
EOF
)|gawk 'BEGIN{FS=","}
FNR==1 {print("Name","Where","Grade");next}
{if ($3<50) {$3=0}
printf("%s,%s,%d \n",$1,$2,$3)}'
This produces:
Name Where Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,0
Amy,Sydney,0
What I want is to display scores which are less than 50, or missing, as a dash ("-").
Name Where Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,-
Amy,Sydney,-
This requires the 3rd string format in printf change from %d to %s.
So in some rows, the third column should be a value, and in some rows, the third column should be a string. How can I tell this to GAWK? Or should I just pipe through another awk to re-format?
$ gawk 'BEGIN{FS=","}
FNR==1 {print("Name","Where","Grade");next}
{if ($3<50) {$3="-"} else {$3=sprintf("%d", $3)}
printf("%s,%s,%s \n",$1,$2,$3)}' ip.txt
Name Where Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,-
Amy,Sydney,-
use if-else to assign value to $3 as needed
sprintf allows to assign result of formatting to a variable
for this case, you could use int function as well
now printf will have %s for $3 as well
Assuming you missed the commas for the header and space after third column is not needed, you could do this with a simple one-liner
$ awk -F, -v OFS=, 'NR>1{$3 = $3 < 50 ? "-" : int($3)} 1' ip.txt
Name,Where,Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,-
Amy,Sydney,-
?: ternary operator is alternate for if-else
1 is an awk idiom to print contents of $0

Print every nth column of a file

I have a rather big file with 255 coma separated columns and I need to print out every third column only.
I was trying something like this
awk '{ for (i=0;i<=NF;i+=3) print $i }' file
but that doesn't seem to be the solution, since it prints to only one long column. Anybody can help? Thanks
Here is one way to do this.
The script prog.awk:
BEGIN {FS = ","} # field separator
{for (i = 1; i <= NF; i += 3) printf ("%s%c", $i, i + 3 <= NF ? "," : "\n");}
Invocation:
awk -f prog.awk <input.csv >output.csv
Example input.csv:
1,2,3,4,5,6,7,8,9,10
11,12,13,14,15,16,17,18,19,20
Example output.csv:
1,4,7,10
11,14,17,20
It behaves like that because by default awk splits fields in spaces. You have to tell it to split them with commas, and it's done using the FS variable or the -F switch. Besides that, first field is number one. The zero is the whole line, so also change the initial value of the for loop:
awk -F',' '{ for (i=1;i<=NF;i+=3) print $i }' file