Using AWK to check column in file1 against file2 - awk

Im having some difficulties with AWK in comparing the contents of one file with another.
File1.txt
142317216-|--|-tree-|-apple-|-|--
150232802-|--|-plant-|-sugar-|-granular|--
153947334-|--|-flower-|-daisy-|-single|--
153188646-|--|-soil-|-earth-|-|--
File2.txt
apple,99817
sugar,75844
daisy,34566
earth,75544
Using "-" as the separator I can pull the information from column 7.
awk 'BEGIN { FS="-";} {print $7;}' file1.txt
Output
apple
sugar
daisy
earth
My full command to check if column7 within file1,txt exists in file2.txt.
awk 'BEGIN {FS="-";} NR==FR{a[$1]=$7;next} {FS=",";} $1 in a ' file1.txt file2.txt
Get column7, then change separator to "," and check $1 against variable a.
This shows no results and I'm struggling to get my head around the syntax to understand why. Could any perhaps give me some pointers.

You don't show the output you expect and you didn't include non-matching (or duplicate) values in your files so it's a guess but this MAY be what you want:
$ awk 'NR==FNR{file2[$1];next} {print ($7 in file2 ? "present:" : "absent:"), $7}' FS=',' file2 FS='-' file1
present: apple
present: sugar
present: daisy
present: earth
This situation is one reason why it's possible to set variables in the file list - to change their value between files.
Since you're just starting to learn awk - get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Related

I need to sum all the values in a column across multiple files

I have a directory with multiple csv text files, each with a single line in the format:
field1,field2,field3,560
I need to output the sum of the fourth field across all files in a directory (can be hundreds or thousands of files). So for an example of:
file1.txt
field1,field2,field3,560
file2.txt
field1,field2,field3,415
file3.txt
field1,field2,field3,672
The output would simply be:
1647
I've been trying a few different things, with the most promising being an awk command that I found here in response to another user's question. It doesn't quite do what I need it to do, and I am an awk newb so I'm unsure how to modify it to work for my purpose:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt
This correctly outputs 975.
However if I try pass it a 3rd file, rather than add field 4 from all 3 files, it adds file1 to file2, then file1 to file3:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt file3.txt
975
1232
Can anyone show me how I can modify this awk statement to accept more than two files or, ideally because there are thousands of files to sum up, an * to output the sum of the fourth field of all files in the directory?
Thank you for your time and assistance.
A couple issues with the current code:
NR==FNR is used to indicate special processing for the 1st file; in this case there is no processing that is 'special' for just the 1st file (ie, all files are to be processed the same)
an array (eg, a[NR]) is used to maintain a set of values; in this case you only have one global value to maintain so there is no need for an array
Since you're only looking for one global sum, a bit more simpler code should suffice:
$ awk -F',' '{sum+=$4} END {print sum+0}' file{1..3}.txt
1647
NOTES:
in the (unlikely?) case all files are empty, sum will be undefined so print sum will display a blank link; sum+0 insures we print 0 if sum remains undefined (ie, all files are empty)
for a variable number of files file{1..3}.txt can be replaced with whatever pattern will match on the desired set of files, eg, file*.txt, *.txt, etc
Here we go (no need to test NR==FNR in a concatenation):
$ cat file{1,2,3}.txt | awk -F, '{count+=$4}END{print count}'
1647
Or same-same 🇹🇭 (without wasting some pipe(s)):
$ awk -F, '{count+=$4}END{print count}' file{1,2,3}.txt
1647
$ perl -MList::Util=sum0 -F, -lane'push #a,$F[3];END{print sum0 #a}' file{1..3}.txt
1647
$ perl -F, -lane'push #a,$F[3];END{foreach(#a){ $sum +=$_ };print "$sum"}' file{1..3}.txt
1647
$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

What does this Awk expression mean

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.
The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

Compare two big files with awk

I took reference from following link for comparing two files :
Compare files with awk
awk 'NR==FNR{a[$1];next}$1 in a{print $2}' file1 file2
It prints 2nd column of file2, if 1st column of file2 found in file1.
But my requirement is little different. how to print 2nd column of file1 if 1st column of file2 found in associative array (built with 1st column of file1) ?
With this:
awk 'NR==FNR{a[$1]=$2;next}$1 in a{print a[$1]}' file1 file2
With this way you assign a value to each array element of array a.
For a line with fields foo bar, you actually create a[foo]=bar.
If you later give a command {print a[foo]} it will print bar (it's assigned value)
The previous {a[$1];next} creates an array with name a and index $1,but value is null; It is a stortcut of a[$1]="".
The whole thing works in awk, because awk has an easy way to look up indexes in an array using $1 in a{print something}. This is an awk if then shortcut.
It is the same like {if ($1 in a) {print something}}. The great about this is that the part $1 in a refers to array a indexes and not array values.

Using file redirects to input a variable search pattern to awk

I'm attempting to write a small script in bash. The script's purpose is to pull out a search pattern from file1.txt and to print the line number of the matching search from file2.txt. I know the exact place of the pattern that I want in file1.txt, and I can pull that out quite easily with sed and awk e.g.
sed -n 3p file1.txt | awk '{print $4}'
The part that I'm having trouble with is passing that information again to awk to use as a search pattern in file2.txt. Something along the lines of:
awk '/search_pattern/{print NR}' file2.txt
I was able to get this code working in two lines of code by storing the output of the first line as a variable, and passing that variable to awk in the second line,
myVariable=`sed -n 3p file1.txt | awk '{print $4}'`
awk '/'"$myVariable"'/{print NR}' file2.txt
but this seems "inelegant". I was hoping there was a way to do this in one line of code using file redirects (or something similar?). Any help is greatly appreciated!
You can avoid sed | awk with
awk 'NR==3{print $4; exit 0}' file1.txt
You can do your search with:
search=$(awk 'NR==3{print $4; exit 0}' file1.txt)
awk -v search="$search" '$0 ~ search { print NR }' file2.txt
You could even write that all on one line, but I don't recommend that; clarity is more important than brevity.
In principle, you could use:
awk 'NR==3{search = $4; next} FNR!=NR && $0 ~ search {print NR}' file1.txt file2.txt
This scans file1.txt and finds the search pattern; then it scans file2.txt and finds the lines that match. One line — even moderately clear. There'll be lots of matches if there isn't a column 4 on line 3 of file1.txt.

Using each line of awk output as grep pattern

I want to find every line of a file that contains any of the strings held in a column of a different file.
I have tried
grep "$(awk '{ print $1 }' file1.txt)" file2.txt
but that just outputs file2.txt in its entirety.
I know I've done this before with a pattern I found on this site, but I can't find that question anymore.
I see in the OP's comment that maybe the question is no longer a question. However, the following slight modification will handle the blank line situation. Just add a check to make sure the line has at least one field:
grep "$(awk '{if (NF > 0) print $1}' file1)" file2
And if the file with the patterns is simply a set of patterns per line, then a much simpler version of it is:
grep -f file1 file2
That causes grep to use the lines in file1 as the patterns.
THere is no need to use grep when you have awk
awk 'FNR==NR&&NF{a[$0];next}($1 in a)' file2 file1
$(awk '{ print $1 }' file1.txt) | grep text > file.txt