extract values from a text file with awk - awk

I would like to extract column1 from the text files based on the values of column2. I need to print column1 only if the column2 is greater than 20.I also need to print the name of the file with the output. How can I do this with awk?
file1.txt
alias 23
samson 10
george 24
file2.txt
andrew 12
susan 16
david 25
desired output
file1
alias
george
file2
david

awk '{ if($2 > 20) { print FILENAME " " $1 } }' <files>

This might work for you:
awk '$2>20{print $1}' file1 file2
if you want file names and prettier printing:
awk 'FNR==1{print FILENAME} $2>20{print " ",$1}' file1 file2

awk '$2>20{if(file!=FILENAME){print FILENAME;file=FILENAME}print}' file1 file2
see below:
> awk '$2>20{if(file!=FILENAME){print FILENAME;file=FILENAME}print}' file1 file2
file1
alias 23
george 24
file2
david 25

Related

How to get the filenumber that is being processing by an awk script?

Suppose I have 2 or more files being processed by an awk script.
$ cat file1
a
b
c
$ cat file2
d
e
How do I get the number of the file being processed? Is the a built-in awk for that?
I want to have a script with the behavior of the one bellow. What could I use as my
SOMEVARIABLE?
$ awk '{print FILENAME, NR, FNR, SOMEVARIABLE, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
EDIT: Since OP needs output in a specific format and DO NOT want only count of file so adding following solution now, which should consider empty files count too.(tested and written in GNU awk)
awk '
FNR==1{
FNUM++
}
{
print FILENAME, NR, FNR, FNUM, $0
}
ENDFILE{
if(FNUM==prev){
FNUM++
print FILENAME, 0, 0, FNUM, "Empty file"
}
prev=FNUM
}' file1 file2
Output for 1 Input_file1 and empty Input_file2 comes as follows.
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 0 0 2 Empty file
Solutions when one wants to know total number of files processed by awk command:
1st solution: Could you please try following, using GNU awk(considering that you don't want to count empty files here).
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
2nd solution: In case you only want to know number of files passed to awk command then try following.
awk 'END {print ARGC-1}' Input_file1 Input_file2
Explanation of above codes above with examples: Let's say following are the Input_files, where Input_file1 is having contents and Input_file2 is empty file as follows:
cat Input_file1
a
b
c
> Input_file2
Now when we run command ARGC we get output as 2 files.
awk 'END {print ARGC-1}' Input_file1 Input_file2
2
Now when I run my 1st command it gives 1 file since it is not counting empty file.
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
1
Well... I managed to do it as following:
$ awk 'BEGIN{FNUM=0} FNR==1{FNUM++} {print FILENAME, NR, FNR, FNUM, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
I guess there is no built-in variable to help with that, so I created the variable FNUM (for file number). If there is a solution with a built-in variable, please give me a better answer.

awk: how do you ensure consistent column numbering when there are blank columns?

my_file is like this:
SELECTED NAME AGE
* adam 30
bob 70
I'd like to output:
adam
bob
however, if I try: cat my_file|awk '{print $2}' it outputs
NAME
adam
70
Any suggestions on how you get awk to account for a blank column?
with gawk field widths
$ awk -v FIELDWIDTHS='11 8 3' '{print $2}' file
NAME
adam
bob
Could you please try following.
awk '{printf("%s%s",/^ +/?$1:$2,ORS)}' Input_file
Output will be as follows.
NAME
adam
bob

Comparing two lists and printing select columns from each list

I want to compare two lists and print some columns from one, and some from the other if two match. I suspect I'm close but I suppose it's better to check..
1st file: Data.txt
101 0.123
145 0.119
242 0.4
500 0.88
2nd File: Map.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
So, if I want to compare the 3rd column of file2 against the 1st of file1 and print the 1st column of file2 and all of file1, I tried awk 'NR==FNR {a[$3];next}$1 in a{print$0}' file2 file1
but that only prints matches in file1. I tried adding x=$1 in the awk. i.e. awk 'NR==FNR {x=$1;a[$3];next}$1 in a{print x $0} file2 file1 but that saves only one value of $1 and outputs that value every line. I also tried adding $1 into a[$3], which is obviously wrong thus giving zero output.
Ideally I'd like to get this output:
blue 145 0.119
ted 500 0.88
which is the 1st column of file2 and the 3rd column of file2 matched to 1st column of file1, and the rest of file1.
You had it almost exactly in your second attempt. Just instead of assigning the value of $1 to a scalar you can stash it in the array for later use.
awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
$ cat file1.txt
101 0.123
145 0.119
242 0.4
500 0.88
$ cat file2.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
$ awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
blue 101 0.123
ted 500 0.88

What are NR and FNR and what does "NR==FNR" imply?

I am learning file comparison using awk.
I found syntax like below,
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
I couldn't understand what is the significance of NR==FNR in this?
If I try with FNR==NR then also I get the same output?
What exactly does it do?
In Awk:
FNR refers to the record number (typically the line number) in the current file.
NR refers to the total record number.
The operator == is a comparison operator, which returns true when the two surrounding operands are equal.
This means that the condition NR==FNR is normally only true for the first file, as FNR resets back to 1 for the first line of each file but NR keeps on increasing.
This pattern is typically used to perform actions on only the first file. It works assuming that the first file is not empty, otherwise the two variables would continue to be equal while Awk was processing the second file.
The next inside the block means any further commands are skipped, so they are only run on files other than the first.
The condition FNR==NR compares the same two operands as NR==FNR, so it behaves in the same way.
Look for keys (first word of line) in file2 that are also in file1.
Step 1: fill array a with the first words of file 1:
awk '{a[$1];}' file1
Step 2: Fill array a and ignore file 2 in the same command. For this check the total number of records until now with the number of the current input file.
awk 'NR==FNR{a[$1]}' file1 file2
Step 3: Ignore actions that might come after } when parsing file 1
awk 'NR==FNR{a[$1];next}' file1 file2
Step 4: print key of file2 when found in the array a
awk 'NR==FNR{a[$1];next} $1 in a{print $1}' file1 file2
Look up NR and FNR in the awk manual and then ask yourself what is the condition under which NR==FNR in the following example:
$ cat file1
a
b
c
$ cat file2
d
e
$ awk '{print FILENAME, NR, FNR, $0}' file1 file2
file1 1 1 a
file1 2 2 b
file1 3 3 c
file2 4 1 d
file2 5 2 e
There are awk built-in variables.
NR - It gives the total number of records processed.
FNR - It gives the total number of records for each input file.
Assuming you have Files a.txt and b.txt with
cat a.txt
a
b
c
d
1
3
5
cat b.txt
a
1
2
6
7
Keep in mind
NR and FNR are awk built-in variables.
NR - Gives the total number of records processed. (in this case both in a.txt and b.txt)
FNR - Gives the total number of records for each input file (records in either a.txt or b.txt)
awk 'NR==FNR{a[$0];}{if($0 in a)print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
a.txt 1 1 a
a.txt 2 2 b
a.txt 3 3 c
a.txt 4 4 d
a.txt 5 5 1
a.txt 6 6 3
a.txt 7 7 5
b.txt 8 1 a
b.txt 9 2 1
lets Add "next" to skip the first matched with NR==FNR
in b.txt and in a.txt
awk 'NR==FNR{a[$0];next}{if($0 in a)print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
b.txt 8 1 a
b.txt 9 2 1
in b.txt but not in a.txt
awk 'NR==FNR{a[$0];next}{if(!($0 in a))print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
b.txt 10 3 2
b.txt 11 4 6
b.txt 12 5 7
awk 'NR==FNR{a[$0];next}!($0 in a)' a.txt b.txt
2
6
7
Here is the pseudo code for your interest.
NR = 1
for (i=1; i<=files.length; ++i) {
line = read line from files[i]
FNR = 1
while (not EOF) {
columns = getColumns(line)
if (NR is equals to FNR) { // processing first file
add columns[1] to a
} else { // processing remaining files
if (columns[1] exists in a) {
print columns[1]
}
}
NR = NR + 1
FNR = FNR + 1
line = read line from files[i]
}
}

awk + How do I find duplicates in a column?

How do I find duplicates in a column?
$ head countries_lat_long_int_code3.csv | cat -n
1 country,latitude,longitude,name,code
2 AD,42.546245,1.601554,Andorra,376
3 AE,23.424076,53.847818,United Arab Emirates,971
4 AF,33.93911,67.709953,Afghanistan,93
5 AG,17.060816,-61.796428,Antigua and Barbuda,1
6 AI,18.220554,-63.068615,Anguilla,1
7 AL,41.153332,20.168331,Albania,355
8 AM,40.069099,45.038189,Armenia,374
9 AN,12.226079,-69.060087,Netherlands Antilles,599
10 AO,-11.202692,17.873887,Angola,244
For instance this has duplicates in the 5th column.
5 AG,17.060816,-61.796428,Antigua and Barbuda,1
6 AI,18.220554,-63.068615,Anguilla,1
How do I view all the others in this file?
I know I can do this:
awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort
And I can eyeball and see if there is any duplicates, but is there a better way?
Or I can do this:
Find out how may are there completely
$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | wc -l
210
Find out how many unique values are there
$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | uniq | wc -l
183
Therefore there are at most 27 (210-183) duplicates.
EDIT1
My desired output would be something as follows, basically all the columns but just showing the rows that are duplicates:
5 AG,17.060816,-61.796428,Antigua and Barbuda,1
6 AI,18.220554,-63.068615,Anguilla,1
This will give you the duplicated codes
awk -F, 'a[$5]++{print $5}'
if you're only interested in count of duplicate codes
awk -F, 'a[$5]++{count++} END{print count}'
To print duplicated rows try this
awk -F, '$5 in a{print a[$5]; print} {a[$5]=$0}'
This will print the whole row with duplicates found in col $5:
awk -F, 'a[$5]++{print $0}'
This is the less memory aggressive i can guess:
$ cat infile
country,latitude,longitude,name,code
AD,42.546245,1.601554,Andorra,376
AE,23.424076,53.847818,United Arab Emirates,971
AF,33.93911,67.709953,Afghanistan,93
AG,17.060816,-61.796428,Antigua and Barbuda,1
AI,18.220554,-63.068615,Anguilla,1
AL,41.153332,20.168331,Albania,355
AM,40.069099,45.038189,Armenia,374
AN,12.226079,-69.060087,Netherlands Antilles,599
AO,-11.202692,17.873887,Angola,355
$ awk -F\, '$NF in a{if (a[$NF]!=0){print a[$NF];a[$NF]=0}print;next}{a[$NF]=$0}' infile
AG,17.060816,-61.796428,Antigua and Barbuda,1
AI,18.220554,-63.068615,Anguilla,1
AL,41.153332,20.168331,Albania,355
AO,-11.202692,17.873887,Angola,355
NOTE: I have included another duplicate for testing purposes.
If you just want to print out a unique value that repeat over the same file just add at the end of the awk:
awk ... ... | sort | uniq -u
That will print the unique values only on alphabetic order