awk associative array with pattern as index - awk

what would be the best solution to have awk store searched pattern along with lines where it was found in an array.. do i need a shell script for that or it can be done using only awk..
so for example if i search for word 'guitar', it makes an array which holds info that that word was found on line 13, 18 and 89 for example?
awk '/home/ {
array[$0] = NR
}
END {
for(i in array) print i, array[i] }' 1-1000.txt
for example this would print lines that matched along with number where they were found.. but i need not $0 but that 'home' pattern, as index of associative array which would hold lines as values.. but then again there is problem how to have multiple values for that one index??

It is important to know that keys are unique. So, if you intend to store search pattern as key and line number as value then value will get overwritten by the last line the pattern was seen.
So a good way to do this would be:
awk '{a[NR]=$1} END {for (k in a) if(a[k]=="monkey") print k}' textile
Output:
[jaypal:~] cat textfile
monkey
donkey
apple
monkey
dog
cat
mat
horse
monkey
[jaypal:~] awk '{a[NR]=$1} END {for (k in a) if(a[k]=="monkey") print k}' textfile
4
9
1
If you need to iterate over line to look for a particular pattern and store it then you can use the for loop to inspect each word of the line and once your word is found store that as an array.
awk '{ for (i=1;i<=NF;i++) if($i=="pattern") arry[NR]=$i } END {. . .}' inputfile
Update based on comments:
To iterate over two files (where one is being used as lookup and second to search for lines matching lookups).
awk 'NR==FNR{a[NR]=$1; next} {for (x in a) if ($0 ~ a[x]) print $0 " found because of --> " a[x]}' textile text2
Test:
[jaypal:~] cat text2
monkeydeal
nodeal
apple is a good fruit
[jaypal:~] awk 'NR==FNR{a[NR]=$1; next} { for (x in a) if ($0 ~ a[x]) print $0 " found on line number " FNR " because of --> " a[x]}' textfile text2
it is a good monkeydeal found on line number 1 because of --> monkey
it is a good monkeydeal found on line number 1 because of --> monkey
it is a good monkeydeal found on line number 1 because of --> monkey
apple is a good fruit found on line number 3 because of --> apple

Related

bash - compare two columns of one file with one column of second file and print matches

I have two different files with around 1000 lines each that are structured like this:
file1: (First Name; Last Name; Address)
Mike;Tyson;First Street 2
Tom;Boyden;Second Street 6
Tom;Cruise;Third Street 9
Mike;Myers;Second Street 4
file2: (First Name Last Name; E-Mail; ID) OR (Last Name First Name;E-Mail; ID)
Mike Tyson;mike#tyson.com;45753
Cruise Tom;tom#cruise.com;23562
Jennifer Lopez;jennifer#lopez.com;92746
Brady Tom;tom#brady.com;27583
I would like to compare the first two columns of file1 with the ENTIRE first column of file2. If both entries of file1 are present in the first column of file2 (in either order) I want to print the matched line of file1. Then search for the second line of file1 and again compare it to the entire column of file2 and so on.
In file2 the order can be (First Name Last Name) OR (Last Name First Name) and I want to print the matched line in both cases.
Expected Output:
Mike;Tyson;First Street 2
Tom;Cruise;Third Street 9
I'm happy with a solution using awk, grep or anything else.
I've tried the solution of a similar question but the output is empty:
awk -F';' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' file1 file2
Thanks
$ awk -F'[ ;]' '
{ key=($1 > $2 ? $1 FS $2 : $2 FS $1) }
NR==FNR { a[key]; next }
key in a
' file1 file2
Mike Tyson;mike#tyson.com;45753
Cruise Tom;tom#cruise.com;23562
The above uses the common, idiomatic approach to generating a consistent key regardless of the order in which the key components appear by sorting the components before concatenating them to create the key value. When there's only 2 components as in this case a simple comparison is the only sorting required.
Here's why sorting the components of the key is the right approach. Imagine you have 3 components, $1, $2, and $3, instead of just 2. With the approach of testing every combination you need this:
NR==FNR { a[$1,$2,$3]; next }
($1,$2,$3) in a || ($1,$3,$2) in a || ($2,$1,$3) in a ||
($2,$3,$1) in a || ($3,$1,$2) in a || ($3,$2,$1) in a
Try writing that condition for $1 through $4 :-).
By contrast if you use the approach of sorting the components you need this (using GNU awk for built in sort functions for convenience) which is MUCH harder to get wrong (e.g. by forgetting a combination in the comparison):
NR==FNR {
split($1 FS $2 FS $3,flds)
asort(flds)
key = flds[1]
for (i=2; i in flds; i++) {
key = key FS flds[i]
}
a[key]
next
}
key in a
Now imagine if you wanted to use $1 through $10 in any order. The "test every combination of components approach" becomes an untenable nightmare while the "sort the components to create the key" approach just means trivially adding fields to the list in the first split() argument.
Could you please try following.
awk '
FNR==NR{
array[$1,$2]
next
}
(($1,$2) in array) || (($2,$1) in array)
' FS="[ ;]" Input_file2 FS=";" Input_file1
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be true when file2 is being read.
array[$1,$2] ##Creating array with index $1,$2 here.
next ##next will skip all further statement from here.
}
(($1,$2) in array) || (($2,$1) in array) ##Checking condition if $1,$2 OR $2,$1 is present in array then it will print the line from Input_file1.
' FS="[ ;]" file2 FS=";" file1 ##Set field separator space or semi-colon for file2 AND set field separator as ; for file1 here.

Looping through fields in awk

Suppose I have the following three lines in a text file:
I have a dog
The dog is goood
The cat runs well
Now I need to go through the file. And print the lines where the word dog occurs along with the field no in which it occurs. I need to accomplish this through awk.
Is there any way by which while processing a line I can sequentially increase the field number, something like the following:
more abc.txt | awk ' j = $NF for (i =1 ; i<= j ; ++i) if ( $N$i == "dog") n= $N0" "$i '
How to loop through the fields of a line in awk?
awk '{for(i=1; i<=NF; i++) {if($i=="dog") print $0,i}}' file
Output:
I have a dog 4
The dog is goood 2
I assume that each line contains the searched string only once.
$NF holds last field value, i is a number and $i refers to a field value on that number. $N$i means field number 0 (which is whole line, since N isn't initialized) concatenated to value of field number i. You are doing almost every thing wrong. Try:
more abc.txt | awk '{for (i =1; i<=NF ; i++) if ($i == "dog") print $0 i}'
Solution:
awk '/dog/ {for(i=NF;i>=1;i--) {if($i~/dog/) {$0=i":"$0}} print}' file
Input file:
I have a dog
The dog is a good doggie
The cat runs well
Output:
4:I have a dog
2:6:The dog is a good doggie
Features:
First checks whether the line contains the desired text before cycling through the fields (although I don't think this provides much of a speedup)
Not only finds fields that are identical to the desired text, but also fields that contain it
Prints the field number of all fields in the line that match the desired text

subtracting values in one column based on another column

I have input file as follows
100A 2000
100B 150
100C 800
100A 1000
100B 100
100C 300
I want to subtract values in column 2 for each uniq value in column 1
so the out put should look like
100A 1000
100B 50
100C 500
I have tried
awk '{if(!a[$1])a[$1]=$2; else a[$1]=$2-a[$1]}END{ for(i in a)print i" " a[i]}' file
but the out put is :
100A 0
100B 0
100C 0
please advise
So many (slight) variations on the same theme.
awk '
!($1 in a) {a[$1]=$2; next}
{a[$1]-=$2}
END {for (i in a) printf "%s %d\n",i,a[i]}
' input.txt
Stack it up as a one-liner if you like.
Remember that awk structure consists of multiple condition { statement } pairs, so you can sometimes express your requirements more elegantly than using an if..else. (Not saying that this is the case here - this is a simple enough awk script that it probably doesn't matter, unless you're a purist. :] )
Also, beware of testing for values the way you've done in the condition in your if in the question. Note that a[$1] both tests whether the value at that array index is non-zero and causes the index to exist with a null value if it didn't previously exist. If you want to check for index existence, use $1 in a.
Update based on a comment on your question...
If you want to subtract the last from the first entry, ignoring the ones in between, then you need to keep a record of both your firsts and your lasts. Something like this might suffice.
awk '
!($1 in a){a[$1]=$2;next}
{b[$1]=$2}
END {for(i in b)if(i in a)print i,a[i]-b[i]}
' input.txt
Note that as Ed mentioned, this produces output in random order. If you want the output ordered, you'll need an additional array to track of the order. For example, this will use order that items are first seen:
awk '
!($1 in a) {
a[$1]=$2;
o[++n]=$1;
next
}
{
b[$1]=$2
}
END {
for (n=1;n<=length(o);n++)
print o[n],a[o[n]]-b[o[n]]
}
' i
Note that the length() function being used to determine the number of elements in an array is not universal amongst dialects of awk, but it does work in both gawk and one-true-awk (used in FreeBSD and others).
This awk one-liner does the job:
awk '{if($1 in a)a[$1]=a[$1]-$2;else a[$1]=$2}
END{for(x in a) print x, a[x]}' file
In awk. Using conditional operator for value placing/subtraction to keep it tight:
$ awk '{ a[$1]+=($1 in a?-$2:$2) } END{ for(i in a)print i, a[i] }' file
100A 1000
100B 50
100C 500
Explained:
{
a[$1]+=($1 in a?-$2:$2) # if $1 in a already, subtract from it
# otherwise add value to it
}
END {
for(i in a) # go thru all a
print i, a[i] # and print keys and values
}
Given the sample input you provided, all you need is:
$ awk '$1 in a{print $1, a[$1]-$2} {a[$1]=$2}' file
100A 1000
100B 50
100C 500
If that's not all you need then provide more truly representative sample input/output that includes the cases where that's not good enough.
You can use this awk:
awk 'a[$1]{a[$1]=a[$1]-$2; next} {a[$1]=$2} END{for(v in a){print v, a[v]}}' file

Awk Field number of matched pattern

I was wondering if there's a built in command in awk to get the field number of the phrase that you just matched.
Banana is yellow.
awk {
/yellow/{ for (i=1;i<=NF;i++) if($i ~/yellow/) print $i}'
Is there a way to avoid writing the loop?
Your command doesn't work when I test it. Here's my version:
echo "banana is yellow" | awk '{for (i=1;i<=NF;i++) if($i ~/yellow/) print i}'
The output is :
3
As far as I know, there's no such built-in feature, to improve your command, the pattern match /yellow/ at the beginning is not necessary, and also $i will print the matching field other than the field number that you need.
Alternatively, you can use an array to store each field and its corresponding index number, and then print field by arr["yellow"]
If the input string is a oneline string you can set the record delimiter to the field delimiter. Doing so you can use NR to print the position:
awk 'BEGIN{RS=FS}/yellow/{print NR}' <<< 'banana is yellow'
3

awk: split a column of delimited text in a row into lines

I have a file with five columns and the second column has delimited text. I want to split that delimited text dedup it and print into lines. I'm able to do it with the commands below. I want to make a awk script. Can anyone help me.
awk -F"\t" 'NR>1{print $2}' <input file> | awk -F\| '{for (i = 0; ++i <= NF;) print $i}' | awk '!x[$0]++'
Input file:
test hello|good|this|will|be 23421 test 4543
test2 good|would|may|can 43234 test2 3421
Output:
hello
good
this
will
be
would
may
can
You could use this single awk one-liner:
$ awk '{split($2,a,"|");for(i in a)if(!seen[a[i]]++)print a[i]}' file
will
be
hello
good
this
can
would
may
The second field is split into the array a on the | character. Each element of a is printed if it isn't already in seen, which will only be true on the first occurrence.
Note that the order of the keys is undefined.
To preserve the order, you can use this:
$ awk '{n=split($2,a,"|");for(i=1;i<=n;++i)if(!seen[a[i]]++)print a[i]}' file
split returns the number of elements in the array a, which you can use to loop through them in the order they appeared.
I wrote exactly Tom's answer before I saw it. If you want to maintain the order of the words as they are seen, it's a little more work:
awk '
{
n = split($2, a, "|")
for (i=1; i<=n; i++)
if (!(a[i] in seen)) {
# the hash to store the unique keys
seen[a[i]] = 1
# the array to store the keys in order
words[++count] = a[i]
}
}
END {for (i=1; i<=count; i++) print words[i]}
' file
hello
good
this
will
be
would
may
can
Here is how I would have done it:
awk '{n=split($2,a,"|");for (i=1;i<=n;i++) print a[i]}' file
hello
good
this
will
be
good
would
may
can
Or this way (this may change the order of the outdata, but for some reason I am not sure about, it works fine here):
awk '{split($2,a,"|");for(i in a) print a[i]}' file
hello
good
this
will
be
good
would
may
can
Or if you do not like duplicate output:
awk '{split($2,a,"|");for(i in a) if (!f[a[i]]++) print a[i]}' file
hello
good
this
will
be
would
may
can