awk + How do I find duplicates in a column? - awk

How do I find duplicates in a column?
$ head countries_lat_long_int_code3.csv | cat -n
1 country,latitude,longitude,name,code
2 AD,42.546245,1.601554,Andorra,376
3 AE,23.424076,53.847818,United Arab Emirates,971
4 AF,33.93911,67.709953,Afghanistan,93
5 AG,17.060816,-61.796428,Antigua and Barbuda,1
6 AI,18.220554,-63.068615,Anguilla,1
7 AL,41.153332,20.168331,Albania,355
8 AM,40.069099,45.038189,Armenia,374
9 AN,12.226079,-69.060087,Netherlands Antilles,599
10 AO,-11.202692,17.873887,Angola,244
For instance this has duplicates in the 5th column.
5 AG,17.060816,-61.796428,Antigua and Barbuda,1
6 AI,18.220554,-63.068615,Anguilla,1
How do I view all the others in this file?
I know I can do this:
awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort
And I can eyeball and see if there is any duplicates, but is there a better way?
Or I can do this:
Find out how may are there completely
$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | wc -l
210
Find out how many unique values are there
$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | uniq | wc -l
183
Therefore there are at most 27 (210-183) duplicates.
EDIT1
My desired output would be something as follows, basically all the columns but just showing the rows that are duplicates:
5 AG,17.060816,-61.796428,Antigua and Barbuda,1
6 AI,18.220554,-63.068615,Anguilla,1

This will give you the duplicated codes
awk -F, 'a[$5]++{print $5}'
if you're only interested in count of duplicate codes
awk -F, 'a[$5]++{count++} END{print count}'
To print duplicated rows try this
awk -F, '$5 in a{print a[$5]; print} {a[$5]=$0}'
This will print the whole row with duplicates found in col $5:
awk -F, 'a[$5]++{print $0}'

This is the less memory aggressive i can guess:
$ cat infile
country,latitude,longitude,name,code
AD,42.546245,1.601554,Andorra,376
AE,23.424076,53.847818,United Arab Emirates,971
AF,33.93911,67.709953,Afghanistan,93
AG,17.060816,-61.796428,Antigua and Barbuda,1
AI,18.220554,-63.068615,Anguilla,1
AL,41.153332,20.168331,Albania,355
AM,40.069099,45.038189,Armenia,374
AN,12.226079,-69.060087,Netherlands Antilles,599
AO,-11.202692,17.873887,Angola,355
$ awk -F\, '$NF in a{if (a[$NF]!=0){print a[$NF];a[$NF]=0}print;next}{a[$NF]=$0}' infile
AG,17.060816,-61.796428,Antigua and Barbuda,1
AI,18.220554,-63.068615,Anguilla,1
AL,41.153332,20.168331,Albania,355
AO,-11.202692,17.873887,Angola,355
NOTE: I have included another duplicate for testing purposes.

If you just want to print out a unique value that repeat over the same file just add at the end of the awk:
awk ... ... | sort | uniq -u
That will print the unique values only on alphabetic order

Related

awk Can not Select Column with empty value

i am trying to select a column with its missing value
here is my input file separated by tab
1 2 3
4 5
6
7 8
9
i am trying to select the first column in which output will look like
1
4
7
and the length of my column would be 5 in this case
I have tried
awk '$1!=""{print $1}' ./demo.txt
but it returns
1
4
6
7
9
can anybody help with this I am new in AWK
You can use cut:
$ cut -f 1 file # the default delimiter is a tab
Or with sed:
$ sed 's/[[:blank:]].*$//' file
Or awk:
$ awk '{sub(/[[:blank:]].*$/,"")}1' file
Or:
$ awk 'BEGIN{FS=OFS="\t"} {print $1}' file
All those print the first column and all five lines (blank or not)
Prints:
1
4
7
Tell awk to use a tab (\t) as the input field delimiter (-F):
$ awk -F'\t' '{ print $1 }' demo.txt
1
4
7
If you want to print multiple columns, maintaining the same delimiter for output, another approach using the FS and OFS variables:
$ awk 'BEGIN { FS=OFS="\t" } { print $1,$3 }' demo.txt
1 3
4 5
7
9
With sed something like:
sed 's/^\([^[:blank:]]*\).*/\1/' demo.txt
Using FIELDWIDTHS in gnu-awk you can do this for fixed width separated data:
awk 'BEGIN {FIELDWIDTHS = "4 4 *"} {print $1}' file
1
4
7
For demo purpose:
awk 'BEGIN {FIELDWIDTHS = "4 4 *"} {print NR ":", $1}' file
1: 1
2: 4
3:
4: 7
5:
if they're all single digits in 1st column :
echo \
'1 2 3
4 5
6
7 8
9' |
mawk NF=1 FS=
gcat -n
1 1
2 4
3
4 7
5
that's literally all you need. To play it safe, then do
nawk NF=1 FS='[[:space:]]' # overly-verbose so-called
# "proper" posix form
gawk NF=1 FS='[ \t]' # suffices unless the input
# happens to have uncommon bytes
# like \013 \v or \014 \f
or a very fringe way of fudging NF :
mawk 'NF ^= FS="[ \t]"'

need to rearrange and sum column in solaris command

I have below data named atp.csv file
Date_Time,M_ID,N_ID,Status,Desc,AMount,Type
2015-01-05 00:00:00 076,1941321748,BD9010423590206,200,Transaction Successful,2000,PRETOP
2015-01-05 00:00:00 077,1941323504,BD9010423590207,351,Transaction Successful,5000,PRETOP
2015-01-05 00:00:00 078,1941321743,BD9010423590205,200,Transaction Successful,1500,PRETOP
2015-01-05 00:00:00 391,1941323498,BD9010500000003,200,Transaction Successful,1000,PRETOP
i want to count status wise using below command.
cat atp.csv|awk -F',' '{print $4}'|sort|uniq -c
The output is like below:
3 200
1 351
But i want to like below output and also want to sum the amount column in status wise.
200,3,4500
351,1,5000
That is status is first and then count value.Please help..
AWK has associative arrays.
% cat atp.csv | awk -F, 'NR>1 {n[$4]+=1;s[$4]+=$6;} END {for (k in n) { print k "," n[k] "," s[k]; }}' | sort
200,3,4500
351,1,5000
In the above:
The first line (record) is skipped with NR>1.
n[k] is the number of occurrences of key k (so we add 1), and s[k] is the running sum values in field 6 (so we add $6).
Finally, after all records are processed (END), you can iterate over associated arrays by key (for (k in n) { ... }) and print the keys and values in arrays n and s associated with the key.
You can try this awk version also
awk -F',' '{print $4,",", a[$4]+=$6}' FileName | sort -r | uniq -cw 6 | sort -r
Output :
3 200 , 4500
1 351 , 5000
Another Way:
awk -F',' '{print $4,",", a[$4]+=$6}' FileName | sort -r | uniq -cw 6 |sort -r | sed 's/\([^ ]\+\).\([^ ]\+\).../\2,\1,/'
All in (g)awk
awk -F, 'NR>1{a[$4]++;b[$4]+=$6}
END{n=asorti(a,c);for(i=1;i<=n;i++)print c[i]","a[c[i]]","b[c[i]]}' file

Bash,Postfix, AWK, Error in filtering deferred mail output

This is what I have tried so far:
cat /var/spool/postfix/deferred/D3B921090 | awk -F"/" '{print $6}' |awk '{$1="" print $0}' | sort | uniq -c | sort -n
and
awk -F"/" '{print $6}' < /var/spool/postfix/deferred/D3B921090 | awk '{$1="" print $0}' | sort | uniq -c | sort -n
I get the following error message when trying to run either command:
awk: line 1: syntax error at or near print
What am I doing wrong?
awk '{$1="" print $0}'
is not a syntactically valid expression, did you mean
awk '{$1=""; print $0}'
which is equal to
awk '{$1=""}1'
?

linux/ubuntu awk match unique values (instead of bash "sort unique grep" unique values)

My command looks like this:
cut -f 1 dummy_FILE | sort | uniq -c | awk '{print $2}' | for i in $(cat -); do grep -w $i dummy_FILE |
awk -v VAR="$i" '{distance+=$3-$2} END {print VAR, distance}'; done
cat dummy_FILE
Red 13 14
Red 39 46
Blue 45 23
Blue 34 27
Green 31 73
I want to:
For every word in $1 dummy_FILE (Red, Blue, Green) - Calculate sum of differences between $3 and $2.
To get the output like this:
Red 8
Blue -29
Green 42
My questions are:
Is it possible to replace cut -f 1 dummy_FILE | sort | uniq -c | awk '{print $2}'?
I am using sort | uniq -c to extract every word from the dataset - is it possible to do it with awk?
How can I overcome useless cat in for i in $(cat -)?
grep -w $i dummy_FILE works fine, but I want to replace it with awk (should I?); If so how can I do this?
When I am trying to awk -v VAR="$i" '/^VAR/ '{distance+=$3-$2} END {print VAR, distance}' I am getting "fatal: division by zero attempted".
I got it using:
awk '{a[$1] = a[$1] + $3 - $2;} END{for (x in a) {print x" "a[x];}}' dummy_FILE
Output:
Blue -29
Green 42
Red 8
If you want to sort the output, just append sort after the AWK command.
Here's one way using awk:
awk '{ a[$1]=a[$1] + $3 - $2 } END { for(i in a) print i, a[i] }' dummy
Results:
Red 8
Blue -29
Green 42
If you require sorted output, you could simply pipe into sort like arutaku suggests:
awk '{ a[$1]=a[$1] + $3 - $2 } END { for(i in a) print i, a[i] }' dummy | sort
You can however, print into sort (within the awk statement), like this:
awk '{ a[$1]=a[$1] + $3 - $2 } END { for(i in a) print i, a[i] | "sort" }' dummy

Reading a file from line 4 to the end

I want to read a file from the line 4 to the very end is there anyway to this with awk or something?
This sed command will do:
sed -n '4,$p' file.txt
Or using awk:
awk 'NR>=4' file.txt
Or using tail:
tail +4 file.txt
awk 'NR >= 4 {print $0}'
For example
$> seq 101 110 | awk 'NR >= 4 {print $0}'
104
105
106
107
108
109
110
tail +4 filename ll serve ur purpose.
more on tail
heres a method (that can depend on the type of shell you use, bash should work):
tmpvar=`cat a_file | wc -l `; tail -$((tmpvar-4)) a_file
heres another method that should work in more shells:
cat a_file -n | awk '{if($1>4) print $2}'