Looping through fields in awk - awk

Suppose I have the following three lines in a text file:
I have a dog
The dog is goood
The cat runs well
Now I need to go through the file. And print the lines where the word dog occurs along with the field no in which it occurs. I need to accomplish this through awk.
Is there any way by which while processing a line I can sequentially increase the field number, something like the following:
more abc.txt | awk ' j = $NF for (i =1 ; i<= j ; ++i) if ( $N$i == "dog") n= $N0" "$i '
How to loop through the fields of a line in awk?

awk '{for(i=1; i<=NF; i++) {if($i=="dog") print $0,i}}' file
Output:
I have a dog 4
The dog is goood 2
I assume that each line contains the searched string only once.

$NF holds last field value, i is a number and $i refers to a field value on that number. $N$i means field number 0 (which is whole line, since N isn't initialized) concatenated to value of field number i. You are doing almost every thing wrong. Try:
more abc.txt | awk '{for (i =1; i<=NF ; i++) if ($i == "dog") print $0 i}'

Solution:
awk '/dog/ {for(i=NF;i>=1;i--) {if($i~/dog/) {$0=i":"$0}} print}' file
Input file:
I have a dog
The dog is a good doggie
The cat runs well
Output:
4:I have a dog
2:6:The dog is a good doggie
Features:
First checks whether the line contains the desired text before cycling through the fields (although I don't think this provides much of a speedup)
Not only finds fields that are identical to the desired text, but also fields that contain it
Prints the field number of all fields in the line that match the desired text

Related

Print lines containing the same second field for more than 3 times in a text file

Here is what I am doing.
The text file is comma separated and has three field,
and I want to extract all the line containing the same second field
more than three times.
Text file (filename is "text"):
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
6,keyword2,content5
6,keyword2,content5
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
My command is like below. cat the whole text file inside awk and grep with the second field of each line and count the number of the line.
If the number of the line is greater than 2, print the whole line.
The command:
awk -F "," '{ "cat text | grep "$2 " | wc -l" | getline var; if ( 2 < var ) print $0}' text
However, the command output contains only first three consecutive lines,
instead of printing also last three lines containing "keyword1" which occurs in the text for six times.
Result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
My expected result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
Can somebody tell me what I am doing wrong?
It is relatively straight-forward to make just two passes over the file. In the first pass, you count the number of occurrences of each value in column 2. In the second pass, you print out the rows where the value in column 2 occurs more than your threshold value of 3 times.
awk -F, 'FNR == NR { count[$2]++ }
FNR != NR { if (count[$2] > 3) print }' text text
The first line of code handles the first pass; it counts the occurrences of each different value of the second column.
The second line of code handles the second pass; if the value in column 2 was counted more than 3 times, print the whole line.
This doesn't work if the input is only available on a pipe rather than as a file (so you can't make two passes over the data). Then you have to work much harder.

awk: split a column of delimited text in a row into lines

I have a file with five columns and the second column has delimited text. I want to split that delimited text dedup it and print into lines. I'm able to do it with the commands below. I want to make a awk script. Can anyone help me.
awk -F"\t" 'NR>1{print $2}' <input file> | awk -F\| '{for (i = 0; ++i <= NF;) print $i}' | awk '!x[$0]++'
Input file:
test hello|good|this|will|be 23421 test 4543
test2 good|would|may|can 43234 test2 3421
Output:
hello
good
this
will
be
would
may
can
You could use this single awk one-liner:
$ awk '{split($2,a,"|");for(i in a)if(!seen[a[i]]++)print a[i]}' file
will
be
hello
good
this
can
would
may
The second field is split into the array a on the | character. Each element of a is printed if it isn't already in seen, which will only be true on the first occurrence.
Note that the order of the keys is undefined.
To preserve the order, you can use this:
$ awk '{n=split($2,a,"|");for(i=1;i<=n;++i)if(!seen[a[i]]++)print a[i]}' file
split returns the number of elements in the array a, which you can use to loop through them in the order they appeared.
I wrote exactly Tom's answer before I saw it. If you want to maintain the order of the words as they are seen, it's a little more work:
awk '
{
n = split($2, a, "|")
for (i=1; i<=n; i++)
if (!(a[i] in seen)) {
# the hash to store the unique keys
seen[a[i]] = 1
# the array to store the keys in order
words[++count] = a[i]
}
}
END {for (i=1; i<=count; i++) print words[i]}
' file
hello
good
this
will
be
would
may
can
Here is how I would have done it:
awk '{n=split($2,a,"|");for (i=1;i<=n;i++) print a[i]}' file
hello
good
this
will
be
good
would
may
can
Or this way (this may change the order of the outdata, but for some reason I am not sure about, it works fine here):
awk '{split($2,a,"|");for(i in a) print a[i]}' file
hello
good
this
will
be
good
would
may
can
Or if you do not like duplicate output:
awk '{split($2,a,"|");for(i in a) if (!f[a[i]]++) print a[i]}' file
hello
good
this
will
be
would
may
can

Selecting a field after a string using awk

I'm very new to awk having just been introduced to it over the weekend.
I have a question that I'm hoping someone may be able to help me with.
How would one select a field that follows a specific string?
How would I expand this code to select more than one field following a specific string?
As an example, for any given line in my text file I have something like
2 of 10 19/4/2014 school name random text distance 800m more random text time 2:20:22 winner someonefast.
Some attributes are very consistent so I can easily extract these fields. For example 2, 10 and the date. However, there is often a lot of variable text before the next field that I wish to extract. Hence the question. Using awk can I extract the next field following a string? For example I'm interested in the fields following the /distance/ or /time/ string in combination with $1, $3, $4, $5.
Your help will be greatly appreciated.
Andy
Using awk you can select the field following a string. Here is an example:
echo '2 of 10 19/4/2014 school name random text distance 800m more random text time 2:20:22 winner someonefast.' |
awk '{
for(i=1; i<=NF; i++) {
if ( i ~ /^[1345]$/ ) {
extract = (extract ? extract FS $i : $i)
}
if ( $i ~ /distance|time/ ) {
extract = (extract ? extract FS $(i+1): $(i+1))
}
}
print extract
}'
2 10 19/4/2014 school 800m 2:20:22
What we are doing here is basically allowing awk to split on default delimiter. We create a for loop to iterate over all fields. NF stores number of fields for a given line. So we start from 1 and go all the way to the end.
In our first conditional block, we just inspect the field number. If it is 1 or 3 or 4 or 5, we create a variable called extract which concatenates the values of these fields separated by the field separator.
In our second conditional block, we check if the value of the field is either distance or time. If it is we again append to our variable but this time instead of the current value, we do $(i+1) which is basically the value of the next field or you can say value of a field that follows a specific string.
When you have name = value situations like you do here, it's best to create an array that maps the names to the values and then just print the values for the names you're interested in, e.g.:
$ awk '{for (i=1;i<=NF;i++) v[$i]=$(i+1); print $1, $3, $4, $5, v["distance"], v["time"]}' file
2 10 19/4/2014 school 800m 2:20:22
Basic:
awk '{
for (i = 6; i <= NF; ++i) {
if ($i == "distance") distance = $(i + 1)
if ($i == "time") time = $(i + 1)
}
print $1, $3, $4, $5, distance, time
}' file
Output:
2 10 19/4/2014 school 800m 2:20:22
But it's not enough to get all other significant texts which is still part of the school name after $5. You should add another condition.
The better solution is to have another delimiter besides spaces like tabs and use \t as FS.

awk associative array with pattern as index

what would be the best solution to have awk store searched pattern along with lines where it was found in an array.. do i need a shell script for that or it can be done using only awk..
so for example if i search for word 'guitar', it makes an array which holds info that that word was found on line 13, 18 and 89 for example?
awk '/home/ {
array[$0] = NR
}
END {
for(i in array) print i, array[i] }' 1-1000.txt
for example this would print lines that matched along with number where they were found.. but i need not $0 but that 'home' pattern, as index of associative array which would hold lines as values.. but then again there is problem how to have multiple values for that one index??
It is important to know that keys are unique. So, if you intend to store search pattern as key and line number as value then value will get overwritten by the last line the pattern was seen.
So a good way to do this would be:
awk '{a[NR]=$1} END {for (k in a) if(a[k]=="monkey") print k}' textile
Output:
[jaypal:~] cat textfile
monkey
donkey
apple
monkey
dog
cat
mat
horse
monkey
[jaypal:~] awk '{a[NR]=$1} END {for (k in a) if(a[k]=="monkey") print k}' textfile
4
9
1
If you need to iterate over line to look for a particular pattern and store it then you can use the for loop to inspect each word of the line and once your word is found store that as an array.
awk '{ for (i=1;i<=NF;i++) if($i=="pattern") arry[NR]=$i } END {. . .}' inputfile
Update based on comments:
To iterate over two files (where one is being used as lookup and second to search for lines matching lookups).
awk 'NR==FNR{a[NR]=$1; next} {for (x in a) if ($0 ~ a[x]) print $0 " found because of --> " a[x]}' textile text2
Test:
[jaypal:~] cat text2
monkeydeal
nodeal
apple is a good fruit
[jaypal:~] awk 'NR==FNR{a[NR]=$1; next} { for (x in a) if ($0 ~ a[x]) print $0 " found on line number " FNR " because of --> " a[x]}' textfile text2
it is a good monkeydeal found on line number 1 because of --> monkey
it is a good monkeydeal found on line number 1 because of --> monkey
it is a good monkeydeal found on line number 1 because of --> monkey
apple is a good fruit found on line number 3 because of --> apple

Awk: printing undetermined number of columns

I have a file that contains a number of fields separated by tab. I am trying to print all columns except the first one but want to print them all in only one column with AWK. The format of the file is
col 1 col 2 ... col n
There are at least 2 columns in one row.
Sample
2012029754 901749095
2012028240 901744459 258789
2012024782 901735922
2012026032 901738573 257784
2012027260 901742004
2003062290 901738925 257813 257822
2012026806 901741040
2012024252 901733947 257493
2012024365 901733700
2012030848 901751693 260720 260956 264843 264844
So I want to tell awk to print column 2 to column n for n greater than 2 without printing blank lines when there is no info in column n of that row, all in one column like the following.
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
This is the first time I am using awk, so bear with me. I wrote this from command line which works:
awk '{i=2;
while ($i ~ /[0-9]+/)
{
printf "%s\n", $i
i++
}
}' bth.data
It is more of a seeking approval than asking a question whether it is the right way of doing something like this in AWK or is there a better/shorter way of doing it.
Note that the actual input file could be millions of lines.
Thanks
Is this what you want as output?
awk '{for(i=2; i<=NF; i++) print $i}' bth.data
gives
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
NF is one of several pre-defined awk variables. It indicates the number of fields on a given input line. For instance, it is useful if you want to always print out the last field in a line print $NF. Or of course if you want to iterate through all or part of the fields on a given line to the end of the line.
Seems like awk is the wrong tool. I would do:
cut -f 2- < bth.data | tr -s '\t' '\n'
Note that with -s, this avoids printing blank lines as stated in the original problem.