I am trying to find out all users in my /etc/passwd which has a user-id of 0. It should display the username as well as the user-id. I tried the following:
awk -F: '{
count[$3]++;}END {
for (i in count)
print i, count[i];
}' passwd
It gives the duplicate user-ids and how many times they are occuring . I actually want the usernames also along with the duplicate user-ids similar like
zama 0
root 0
bin 100
nologin 100
Will be great if the solution is provided with awk asscociative arrays. Other methods are also fine.
Does this do what you want?
awk -F: '{
count[$3]++; names[$3 "," count[$3]] = $1}END {
for (i in count) {
for (j = 1; j <= count[i]; j++) {
print names[i "," j], i, count[i];
}
}
}' passwd
This might work for you:
awk -F: '$3~/0/{if($3 in a){d[$3];a[$3]=a[$3]"\n"$3" "$1}else{a[$3]=$3" "$1}};END{for(x in d)print a[x]}' /etc/passwd
or this non-awk solution:
cut -d: -f1,3 /etc/passwd |
sort -st: -k2,2n |
tr ':' ' ' |
uniq -Df1 |
sed 's/\(.*\) \(.*\)/\2 \1/p;d'
myid=`cat passwd|awk -F: '{print $3 }'| sort | uniq -d`
for i in `echo "$myid"`;do
egrep "^.*:x:$i" passwd | awk -F: '{print $1 , $3}'
done
Related
I have a large file with about 6 million records. I need to chunk this file into smaller files based on the first 17 chars. So records where the first 17 chars are same will be grouped into a file with the same name
The command I use for this is :
awk -v FIELDWIDTHS="17" '{print > $1".txt"}' $file_name
The problem is that this is painfully slow. For a file with 800K records it took about an hour to complete.
sample input would be :-
AAAAAAAAAAAAAAAAAAAAAAAAAAAA75838458
AAAAAAAAAAAAAAAAAAAAAAAAAAAA48234283
BBBBBBBBBBBBBBBBBBBBBBBBBBBB34723643
AAAAAAAAAAAAAAAAAAAAAAAAAAAA64734987
BBBBBBBBBBBBBBBBBBBBBBBBBBBB18741274
CCCCCCCCCCCCCCCCCCCCCCCCCCCC38123922
Is there a faster solution to this problem?
I read that perl can also be used to split files but I couldnt find an option like fieldwidths in perl..
any help will be greatly appreciated
uname : Linux
bash-4.1$ ulimit -n
1024
sort file |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
Performance improvements included:
By not referring to any field it lets awk not do field splitting
By sorting first and changing output file names only when the key part of the input changes, it lets awk only use 1 output file at a time instead of having to manage opening/closing potentially thousands of output files
And it's portable to all awks since it's not using gawk-specific extension like FIELDWIDTHS.
If the lines in each output file have to retain their original relative order after sorting then it'd be something like this (assuming no white space in the input just like in the example you provided):
awk '{print substr($0,1,17)".txt", NR, $0}' file |
sort -k1,1 -k2,2n |
awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
After borrowing #dawg's script (perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file - thanks!) to generate the same type of sample input file he has, here's the timings for the above:
$ time sort ../file | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
real 0m45.709s
user 0m15.124s
sys 0m34.090s
$ time awk '{print substr($0,1,17)".txt", NR, $0}' ../file | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
real 0m49.190s
user 0m11.170s
sys 0m34.046s
and for #dawg's for comparison running on the same machine as the above with the same input ... I killed it after it had been running for 14+ minutes:
$ time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' ../file
real 14m23.473s
user 0m7.328s
sys 1m0.296s
I created a test file of this form:
% head file
SXXYTTLDCNKRTDIHE00004
QAMKKMCOUHJFSGFFA00001
XGHCCGLVASMIUMVHS00002
MICMHWQSJOKDVGJEO00005
AIDKSTWRVGNMQWCMQ00001
OZQDJAXYWTLXSKAUS00003
XBAUOLWLFVVQSBKKC00005
ULRVFNKZIOWBUGGVL00004
NIXDTLKKNBSUMITOA00003
WVEEALFWNCNLWRAYR00001
% wc -l file
600000 file
ie, 120,000 different 17 letter prefixes to with 01 - 05 appended in random order.
If you want a version for yourself, here is that test script:
perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file
If I run this:
% time awk -v FIELDWIDTHS="17" '{print > $1".txt"}' file
Well I gave up after about 15 minutes.
You can do this instead:
% time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file
You asked about Perl, and here is a similar program in Perl that is quite fast:
perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file
Here is a little script that compares Ed's awk to these:
#!/bin/bash
# run this in a clean directory Luke!
perl -le 'for (1..12000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} '
| awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}'
| awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'
| sort -n
| cut -c8- > file.txt
wc -l file.txt
#awk -v FIELDWIDTHS="17" '{cnt[$1]++} END{for (e in cnt) print e, cnt[e]}' file
echo "abd awk"
time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file.txt
echo "abd Perl"
time perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file.txt
echo "Ed 1"
time sort file.txt |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 2"
time sort file.txt | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 3"
time awk '{print substr($0,1,17)".txt", NR, $0}' file.txt | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
Which prints:
60000 file.txt
abd awk
real 0m3.058s
user 0m0.329s
sys 0m2.658s
abd Perl
real 0m3.091s
user 0m0.332s
sys 0m2.600s
Ed 1
real 0m1.158s
user 0m0.174s
sys 0m0.992s
Ed 2
real 0m1.069s
user 0m0.175s
sys 0m0.932s
Ed 3
real 0m1.174s
user 0m0.275s
sys 0m0.946s
I have the file data.txt with the following:
Claude:41:166:Paris
John:37:185:Miami
Lina:16:170:Miami
Maurice:58:172:Paris
Phoebe:21:179:Paris
Anthony:69:165:Brasilia
Being the number after the first colon the age of each one I tried to the get the name of the youngest person of every city by:
sort -t ":" -k4,1 -k2,2 datos.txt | awk -F ':' '!a[$4]++' | sort -t ":" -k4
My question is: is there a more efficient way? Can I just have the name of the person and the city? Thanks
You can do it entirely in awk. Use an array to hold the youngest age for each city, and a second array to hold the name of the person with that age.
awk -F: '!age[$4] || $2 < age[$4] { age[$4] = $2; name[$4] = $1; }
END {for (city in name) print city, name[city] }' datos.txt
To just get the name of the person, you can split on ':' and just print part 1 and 4, like this | awk -F: '{print $1 " " $4}'
Giving:
sort -t ":" -k4,1 -k2,2 datos.txt | awk -F ':' '!a[$4]++' | sort -t ":" -k4 | awk -F: '{print $1 " " $4}'
without awk
$ sort -t: -k4,4 -k2,2 file | # sort by city, age
tr ':' ' ' | # replace delimiter
uniq -f3 | # take the min for each city
tr ' ' ':' # replace delimiter back
Anthony:69:165:Brasilia
Lina:16:170:Miami
Phoebe:21:179:Paris
You could print only required field within awk, further sorting is not needed as the first sort already specifies the order
$ # printing required field with space between (default OFS)
$ sort -t: -k4,4 -k2,2 ip.txt | awk -F: '!a[$4]++{print $1, $4}'
Anthony Brasilia
Lina Miami
Phoebe Paris
$ # printing with : between fields
$ sort -t: -k4,4 -k2,2 ip.txt | awk -F: '!a[$4]++{print $1 ":" $4}'
Anthony:Brasilia
Lina:Miami
Phoebe:Paris
With GNU datamash
$ datamash -t: -s -g4 min 2 < ip.txt
Brasilia:69
Miami:16
Paris:21
however, as far as I've understood from manual, it doesn't allow to print only specific fields
I was trying get the total number of "??", " M", "A" and "D" from this:
?? this is a sentence
M this is another one
A more text here
D more and more text
I have this sample line of code but doesn't work:
awk -v pattern="\?\?" '{$1 == pattern} END{print " "FNR}'
$ awk '{ print $1 }' file | sort | uniq -c
1 ??
1 A
1 D
1 M
If for some reason you want an awk-only solution:
awk '{ ++cnt[$1] } END { for (i in cnt) print cnt[i], i }' file
but I think that's needlessly complicated compared to using the built-in unix tools that already do most of the work.
If you just want to count one particular value:
awk -v value='??' '$1 == value' file | wc -l
If you want to count only a subset of values, you can use a regex:
$ awk -v pattern='A|D|(\\?\\?)' '$1 ~ pattern { print $1 }' file | sort | uniq -c
1 ??
1 A
1 D
Here you do need to send a \ in order that the ?s are escaped within the regular expression. And because the \ is itself a special character within the string being passed to awk, you need to escape it first (hence the double backslash).
I want to sum up occurrence output of "uniq -c" command.
How can I do that on the command line?
For example if I get the following in output, I would need 250.
45 a4
55 a3
1 a1
149 a5
awk '{sum+=$1} END{ print sum}'
This should do the trick:
awk '{s+=$1} END {print s}' file
Or just pipe it into awk with
uniq -c whatever | awk '{s+=$1} END {print s}'
for each line add the value of of first column to SUM, then print out the value of SUM
awk is a better choice
uniq -c somefile | awk '{SUM+=$1}END{print SUM}'
but you can also implement the logic using bash
uniq -c somefile | while read num other
do
let SUM+=num;
done
echo $SUM
uniq -c is slow compared to awk. like REALLY slow.
{mawk/mawk2/gawk} 'BEGIN { OFS = "\t" } { freqL[$1]++; } END { # modify FS for that
# column you want
for (x in freqL) { printf("%8s %s\n", freqL[x], x) } }' # to uniq -c upon
if your input isn't large like 100MB+, then gawk suffices after adding in the
PROCINFO["sorted_in"] = "#ind_num_asc"; # gawk specific, just use gawk -b mode
if it's really large, it's far faster to use mawk2 then pipe to to
{ mawk/mawk2 stuff... } | gnusort -t'\t' -k 2,2
While the aforementioned answer uniq -c example-file | awk '{SUM+=$1}END{print SUM}' would theoretically work to sum the left column output of uniq -c so should wc -l somefile as mentioned in the comment.
If what you are looking for is the number of uniq lines in your file, then you can use this command:
sort -h example-file | uniq | wc -l
I have a file (user.csv)like this
ip,hostname,user,group,encryption,aduser,adattr
want to print all column sort by user,
I tried awk -F ":" '{print|"$3 sort -n"}' user.csv , it doesn't work.
How about just sort.
sort -t, -nk3 user.csv
where
-t, - defines your delimiter as ,.
-n - gives you numerical sort. Added since you added it in your
attempt. If your user field is text only then you dont need it.
-k3 - defines the field (key). user is the third field.
Use awk to put the user ID in front.
Sort
Use sed to remove the duplicate user ID, assuming user IDs do not contain any spaces.
awk -F, '{ print $3, $0 }' user.csv | sort | sed 's/^.* //'
Seeing as that the original question was on how to use awk and every single one of the first 7 answers use sort instead, and that this is the top hit on Google, here is how to use awk.
Sample net.csv file with headers:
ip,hostname,user,group,encryption,aduser,adattr
192.168.0.1,gw,router,router,-,-,-
192.168.0.2,server,admin,admin,-,-,-
192.168.0.3,ws-03,user,user,-,-,-
192.168.0.4,ws-04,user,user,-,-,-
And sort.awk:
#!/usr/bin/awk -f
# usage: ./sort.awk -v f=FIELD FILE
BEGIN {
FS=","
}
# each line
{
a[NR]=$0 ""
s[NR]=$f ""
}
END {
isort(s,a,NR);
for(i=1; i<=NR; i++) print a[i]
}
#insertion sort of A[1..n]
function isort(S, A, n, i, j) {
for( i=2; i<=n; i++) {
hs = S[j=i]
ha = A[j=i]
while (S[j-1] > hs) {
j--;
S[j+1] = S[j]
A[j+1] = A[j]
}
S[j] = hs
A[j] = ha
}
}
To use it:
awk sort.awk f=3 < net.csv # OR
chmod +x sort.awk
./sort.awk f=3 net.csv
You can choose a delimiter, in this case I chose a colon and printed the column number one, sorting by alphabetical order:
awk -F\: '{print $1|"sort -u"}' /etc/passwd
awk -F, '{ print $3, $0 }' user.csv | sort -nk2
and for reverse order
awk -F, '{ print $3, $0 }' user.csv | sort -nrk2
try this -
awk '{print $0|"sort -t',' -nk3 "}' user.csv
OR
sort -t',' -nk3 user.csv
awk -F "," '{print $0}' user.csv | sort -nk3 -t ','
This should work
To exclude the first line (header) from sorting, I split it out into two buffers.
df | awk 'BEGIN{header=""; $body=""} { if(NR==1){header=$0}else{body=body"\n"$0}} END{print header; print body|"sort -nk3"}'
With GNU awk:
awk -F ',' '{ a[$3]=$0 } END{ PROCINFO["sorted_in"]="#ind_str_asc"; for(i in a) print a[i] }' file
See 8.1.6 Using Predefined Array Scanning Orders with gawk for more sorting algorithms.
I'm running Linux (Ubuntu) with mawk:
tmp$ awk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan
random-funcs: srandom/random
regex-funcs: internal
compiled limits:
sprintf buffer 8192
maximum-integer 2147483647
mawk (and gawk) has an option to redirect the output of print to a command. From man awk chapter 9. Input and output:
The output of print and printf can be redirected to a file or command by appending > file, >> file or | command to the end of the print statement. Redirection opens file or command only once, subsequent redirections append to the already open stream.
Below you'll find a simplied example how | can be used to pass the wanted records to an external program that makes the hard work. This also nicely encapsulates everything in a single awk file and reduces the command line clutter:
tmp$ cat input.csv
alpha,num
D,4
B,2
A,1
E,5
F,10
C,3
tmp$ cat sort.awk
# print header line
/^alpha,num/ {
print
}
# all other lines are data lines that should be sorted
!/^alpha,num/ {
print | "sort --field-separator=, --key=2 --numeric-sort"
}
tmp$ awk -f sort.awk input.csv
alpha,num
A,1
B,2
C,3
D,4
E,5
F,10
See man sort for the details of the sort options:
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
-k, --key=KEYDEF
sort via a key; KEYDEF gives location and type
-n, --numeric-sort
compare according to string numerical value