AWK - minimum value each item - awk

I have the file data.txt with the following:
Claude:41:166:Paris
John:37:185:Miami
Lina:16:170:Miami
Maurice:58:172:Paris
Phoebe:21:179:Paris
Anthony:69:165:Brasilia
Being the number after the first colon the age of each one I tried to the get the name of the youngest person of every city by:
sort -t ":" -k4,1 -k2,2 datos.txt | awk -F ':' '!a[$4]++' | sort -t ":" -k4
My question is: is there a more efficient way? Can I just have the name of the person and the city? Thanks

You can do it entirely in awk. Use an array to hold the youngest age for each city, and a second array to hold the name of the person with that age.
awk -F: '!age[$4] || $2 < age[$4] { age[$4] = $2; name[$4] = $1; }
END {for (city in name) print city, name[city] }' datos.txt

To just get the name of the person, you can split on ':' and just print part 1 and 4, like this | awk -F: '{print $1 " " $4}'
Giving:
sort -t ":" -k4,1 -k2,2 datos.txt | awk -F ':' '!a[$4]++' | sort -t ":" -k4 | awk -F: '{print $1 " " $4}'

without awk
$ sort -t: -k4,4 -k2,2 file | # sort by city, age
tr ':' ' ' | # replace delimiter
uniq -f3 | # take the min for each city
tr ' ' ':' # replace delimiter back
Anthony:69:165:Brasilia
Lina:16:170:Miami
Phoebe:21:179:Paris

You could print only required field within awk, further sorting is not needed as the first sort already specifies the order
$ # printing required field with space between (default OFS)
$ sort -t: -k4,4 -k2,2 ip.txt | awk -F: '!a[$4]++{print $1, $4}'
Anthony Brasilia
Lina Miami
Phoebe Paris
$ # printing with : between fields
$ sort -t: -k4,4 -k2,2 ip.txt | awk -F: '!a[$4]++{print $1 ":" $4}'
Anthony:Brasilia
Lina:Miami
Phoebe:Paris
With GNU datamash
$ datamash -t: -s -g4 min 2 < ip.txt
Brasilia:69
Miami:16
Paris:21
however, as far as I've understood from manual, it doesn't allow to print only specific fields

Related

Counting the number of unique values based on two columns in bash

I have a tab-separated file looking like this:
A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234
Output:
A 3
B 2
C 1
Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.
awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci
I'd really appreciate your help! Thank you in advance.
With complete awk solution could you please try following.
awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Using GNU awk:
$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file
Output:
A 3
B 2
C 1
Explained:
$ gawk -F\\t '{ # using GNU awk and tab as delimiter
a[$1][$2] # hash to 2D array
}
END {
for(i in a) # for all values in first field
print i,length(a[i]) # output value and the size of related array
}' file
$ sort -u file | cut -f1 | uniq -c
3 A
2 B
1 C
Another way, using the handy GNU datamash utility:
$ datamash -g1 countunique 2 < input.txt
A 3
B 2
C 1
Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.
You could try this:
cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'
It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)

AWK how to count patterns on the first column?

I was trying get the total number of "??", " M", "A" and "D" from this:
?? this is a sentence
M this is another one
A more text here
D more and more text
I have this sample line of code but doesn't work:
awk -v pattern="\?\?" '{$1 == pattern} END{print " "FNR}'
$ awk '{ print $1 }' file | sort | uniq -c
1 ??
1 A
1 D
1 M
If for some reason you want an awk-only solution:
awk '{ ++cnt[$1] } END { for (i in cnt) print cnt[i], i }' file
but I think that's needlessly complicated compared to using the built-in unix tools that already do most of the work.
If you just want to count one particular value:
awk -v value='??' '$1 == value' file | wc -l
If you want to count only a subset of values, you can use a regex:
$ awk -v pattern='A|D|(\\?\\?)' '$1 ~ pattern { print $1 }' file | sort | uniq -c
1 ??
1 A
1 D
Here you do need to send a \ in order that the ?s are escaped within the regular expression. And because the \ is itself a special character within the string being passed to awk, you need to escape it first (hence the double backslash).

multiple field separator in awk

I'm trying to process an input which has two field seperators ; and space. I'm able to parse the input with one separator using:
echo "10.23;7.15;6.23" | awk -v OFMF="%0.2f" 'BEGIN{FS=OFS=";"} {print $1,$2,$3}'
10.23;7.15;6.23
For an input with two seperators, I tried this and it doesn't parse both the seperators:
echo "10.23;7.15 6.23" | awk -v OFMF="%0.2f" 'BEGIN{FS=OFS=";" || " "} {print $1,$2,$3}'
You want to set FS to a character list:
awk -F'[; ]' 'script' file
and the other builtin variable you're trying to set is named OFMT, not OFMF:
$ echo "10.23;7.15 6.23" | awk -F'[; ]' -v OFMT="%0.2f" '{print $1,$2,$3}'
10.23 7.15 6.23
$ echo "10.23;7.15 6.23" | awk 'BEGIN{FS="[; ]"; OFS=";"; OFMT="%0.2f"} {print $1,$2,$3}'
10.23;7.15;6.23

Sort and print in line

Input:
54578787 -58 1
6578999 -658- 3
1352413 -541- 11
4564564 -23- 11
654564 -65- 3
6543564 -65- 1
Desired output:
column3 = 1,3,11
Using:
a=$(awk '{print $3}' text | sort -u | paste -s -d,) && paste <(echo "column3 =") <(echo $a)
I only get:
column3 = [large blank] 1,11,3
Other issue: If I remove all hyphens on the second column, I get
column3 = [large blank] ,1,11,3
I think it's a paste command issue.
Last but not least: why do I have 1,11,3 instead of 1,3,11?
I would just use awk:
$ awk '{a[$3]} END {printf "column3 = "; for (i in a) {printf "%d%s", i, (++v==length(a)?"\n":",")}}' file
column3 = 1,3,11
Explanation
a[$3] populate the a[] array with the 3rd column. This way, any new value will create a new index.
END {} perform commands after processing the whole file.
printf "column3 = " prints "column3 =".
for (i in a) {printf "%d%s", i, (++v==length(a)?"\n":",")} loop through the stored values and print them comma separated, unless it is the last one.
Your current solution would work like this:
$ paste -d" " <(echo "column3 =") <(awk '{print $3}' file | sort -u | paste -s -d,)
column3 = 1,11,3
Note there is no need to store in $a. And to have just one space, use paste -d" ".
And to have it sorted numerically? Just add -n to your sort:
$ paste -d" " <(echo "column3 =") <(awk '{print $3}' file | sort -nu | paste -s -d,)
column3 = 1,3,11
With this command you get the same output, no matter the hyphens.
You can do something like
echo "column3 = $(awk '{print $3}' test.txt |sort -nu | paste -s -d, )"
gives me
column3 = 1,3,11
One key element is to sort with the -n option to do numerical sorting.
It also works with the hyphens deleted:
echo "column3 = $(tr -d - < test.txt| awk '{print $3}' |sort -nu | paste -s -d, )"
also outputs
column3 = 1,3,11
If perl is acceptable:
perl -lanE '
$c3{$F[2]} = 1;
END {say "column3 = ", join(",", sort {$a <=> $b} keys %c3)}
' file
my gawk line looks like:
awk '{a[$3]} END{c=asorti(a,d,"#val_num_asc"); printf "column3 = ";
for(x=1;x<=c;x++)printf "%d%s", d[x],(c==x?"\n":",")}' file
output:
column3 = 1,3,11
Note
you need gawk to run that (asorti function)
sorting ascending as numbers
output in single line.
Assuming you truly want the numbers sorted and not just reproduced in the order they are first seen:
$ awk '{print $3}' file | sort -nu | awk '{s=(NR>1?s",":"")$0} END{print "column3 =",s}'
column3 = 1,3,11
You were getting 1,11,3 because without the -n arg for sort you are sorting alphabetically instead of numerically and the first char of 11 (i.e. 1) comes before the first char of 3.

Identifying usernames having duplicate user-id in /etc/passwd file

I am trying to find out all users in my /etc/passwd which has a user-id of 0. It should display the username as well as the user-id. I tried the following:
awk -F: '{
count[$3]++;}END {
for (i in count)
print i, count[i];
}' passwd
It gives the duplicate user-ids and how many times they are occuring . I actually want the usernames also along with the duplicate user-ids similar like
zama 0
root 0
bin 100
nologin 100
Will be great if the solution is provided with awk asscociative arrays. Other methods are also fine.
Does this do what you want?
awk -F: '{
count[$3]++; names[$3 "," count[$3]] = $1}END {
for (i in count) {
for (j = 1; j <= count[i]; j++) {
print names[i "," j], i, count[i];
}
}
}' passwd
This might work for you:
awk -F: '$3~/0/{if($3 in a){d[$3];a[$3]=a[$3]"\n"$3" "$1}else{a[$3]=$3" "$1}};END{for(x in d)print a[x]}' /etc/passwd
or this non-awk solution:
cut -d: -f1,3 /etc/passwd |
sort -st: -k2,2n |
tr ':' ' ' |
uniq -Df1 |
sed 's/\(.*\) \(.*\)/\2 \1/p;d'
myid=`cat passwd|awk -F: '{print $3 }'| sort | uniq -d`
for i in `echo "$myid"`;do
egrep "^.*:x:$i" passwd | awk -F: '{print $1 , $3}'
done