How to do unique count with grep and wc together? - cut

cut -f1 test.csv | wc -l
I want to get the first column of the file, and do a unique count.
Where can I add a 'sort u' to make it count unique lines only?

Related

Sed/Awk: how to find and remove two lines if a pattern in the first line is being repeated; bash

I am processing text file(s) with thousands of records per file. Each record is made up of two lines: a header that starts with ">" and followed by a line with a long string of characters "-AGTCNR". The header has 10 fields separated by "|" whose first field is a unique identifier to each record e.g ">KEN096-15" and a record is termed duplicate if it has same identifier. Here is how a simple record look like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----
Now I am triying to delete repeats, like duplicate records of "ACRJP458-10" and "PMANL2431-12".
Using a bash script I have extracted the unique identifiers and stored repeated ones in a variable "$duplicate_headers". Currently, I am trying to find any repeated instances of their two-line records and deleting them as follows:
for i in "$#"
do
unset duplicate_headers
duplicate_headers=`grep ">" $1 | awk 'BEGIN { FS="|"}; {print $1 "\n"; }' | sort | uniq -d`
for header in `echo -e "${duplicate_headers}"`
do
sed -i "/^.*\b${header}\b.*$/,+1 2d" $i
#sed -i "s/^.*\b${header}\b.*$//,+1 2g" $i
#sed -i "/^.*\b${header}\b.*$/{$!N; s/.*//2g; }" $i
done
done
The final result (with thousands of records in mind) will look like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
$ awk -F'[|]' 'NR%2{f=seen[$1]++} !f' file
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
To run it on multiple files at once would be this to remove duplicates across all files:
awk -F'[|]' 'FNR%2{f=seen[$1]++} !f' *
or this to only remove duplicates within each file:
awk -F'[|]' 'FNR==1{delete seen} FNR%2{f=seen[$1]++} !f' *

Finding top directories sending out mail per day

I am trying to find out the top directories (cPanel accounts) sending out mail per day.
I have tried using the following code, which works, but this doesn't limit the results per day / date.
grep cwd /var/log/exim_mainlog | grep -v /var/spool | awk -F"cwd=" '{print $2}' | awk '{print $1}' | sort | uniq -c | sort -n
Is there anyway I can amend this code to show only the results of a specific date?
If you'd like to filter by a specific date perhaps include the date string in your grep statement?
If you want to group by day, then use sed to strip the time part of the date and use uniq -c to count the occurrences of each day.

What did I do wrong? Not sorting properly with awk

Hi so basically I have a 'temp' text file that I'm using that has a long list of various email addresses (some repeats). What I'm trying to output is the email addresses in order of highest frequency and then the total number of unique email addresses at the end.
awk '{printf "%s %s\n", $2, $1} END {print "total "NR}' temp | sort -n | uniq -c -i
So far I got the output I wanted except for the fact that it's not ordered in terms of highest frequency. Instead, it's in alphabetical order.
I've been stuck on this for a few hours now and have no idea why. I know I probably did something wrong but I'm not sure. Please let me know if you need more information and if the code I provided was not the problem. Thank you in advance.
edit: I've also tried doing sort -nk1 (output has frequency in first column) and even -nk2
edit2: Here is a sample of my 'temp' file
aol.com
netscape.net
yahoo.com
yahoo.com
adelphia.net
twcny.rr.com
charter.net
yahoo.com
edit 3:
expected output:
33 aol.com
24 netscape.net
18 yahoo.com
5 adelphia.net
4 twcny.rr.com
3 charter.net
total 6
(no repeat emails, 6 total unique email addresses)
Sample input modified to include an email with two instances
$ cat ip.txt
aol.com
netscape.net
yahoo.com
yahoo.com
adelphia.net
twcny.rr.com
netscape.net
charter.net
yahoo.com
Using perl
$ perl -lne '
$c++ if !$h{$_}++;
END
{
#k = sort { $h{$b} <=> $h{$a} } keys %h;
print "$h{$_} $_" foreach (#k);
print "total ", $c;
}' ip.txt
3 yahoo.com
2 netscape.net
1 adelphia.net
1 charter.net
1 aol.com
1 twcny.rr.com
total 6
$c++ if !$h{$_}++ increment counter for unique input lines, increment hash value with input line as key. Default initial value is 0 for both
After processing all input lines:
#k = sort { $h{$b} <=> $h{$a} } keys %h get keys sorted by descending numeric values of hash
print "$h{$_} $_" foreach (#k) print each hash value and key based on sorted keys #k
print "total ", $c print total unique lines
Can be written in single line if preferred:
perl -lne '$c++ if !$h{$_}++; END{#k = sort { $h{$b} <=> $h{$a} } keys %h; print "$h{$_} $_" foreach (#k); print "total ", $c}' ip.txt
Reference: How to sort perl hash on values and order the keys correspondingly
In Gnu awk using #Sundeep's data:
$ cat program.awk
{ a[$0]++ } # count domains
END {
PROCINFO["sorted_in"]="#val_num_desc" # sort in desc order in for loop
for(i in a) { # this for in desc order
print a[i], i
j++ # count total
}
print "total", j
}
Run it:
$ awk -f program.awk ip.txt
3 yahoo.com
2 netscape.net
1 twcny.rr.com
1 aol.com
1 adelphia.net
1 charter.net
total 6
Updated / Summary
Summarising a few tested approaches here for this handy sorting tool:
Using bash (In my case v4.3.46)
sortedfile="$(sort temp)" ; countedfile="$(uniq -c <<< "$sortedfile")" ; uniquefile="$(sort -rn <<< "$countedfile")" ; totalunique="$(wc -l <<< "$uniquefile")" ; echo -e "$uniquefile\nTotal: $totalunique"
Using sh/ash/busybox (Though they aren't all the same, they all worked the same for these tests)
time (sort temp > /tmp/sortedfile ; uniq -c /tmp/sortedfile > /tmp/countedfile ; sort -rn /tmp/countedfile > /tmp/uniquefile ; totalunique="$(cat /tmp/uniquefile | wc -l)" ; cat /tmp/uniquefile ; echo "Total: $totalunique")
Using perl (see this answer https://stackoverflow.com/a/40145395/3544399)
perl -lne '$c++ if !$h{$_}++; END{#k = sort { $h{$b} <=> $h{$a} } keys %h; print "$h{$_} $_" foreach (#k); print "Total: ", $c}' temp
What was tested
A file temp was created using a random generator:
#domain.com was different in the unique addresses
Duplicated addresses were scattered
File had 55304 total addresses
File has 17012 duplicate addresses
A small sample of the file looks like this:
24187#9674.com
29397#13000.com
18398#27118.com
23889#7053.com
24501#7413.com
9102#4788.com
16218#20729.com
991#21800.com
4718#19033.com
22504#28021.com
Performance:
For the sake of completeness it's worth mentioning the performance;
perl: sh: bash:
Total: 17012 Total: 17012 Total: 17012
real 0m0.119s real 0m0.838s real 0m0.973s
user 0m0.061s user 0m0.772s user 0m0.894s
sys 0m0.027s sys 0m0.025s sys 0m0.056s
Original Answer (Counted total addresses and not unique addresses):
tcount="$(cat temp | wc -l)" ; sort temp | uniq -c -i | sort -rn ; echo "Total: $tcount"
tcount="$(cat temp | wc -l)": Make Variable with line count
sort temp: Group email addresses ready for uniq
uniq -c -i: Count occurrences allowing for case variation
sort -rn: Sort according to numerical occurrences and reverse the order (highest on top)
echo "Total: $tcount": Show the total addresses at the bottom
Sample temp file:
john#domain.com
john#domain.com
donald#domain.com
john#domain.com
sam#domain.com
sam#domain.com
bill#domain.com
john#domain.com
larry#domain.com
sam#domain.com
larry#domain.com
larry#domain.com
john#domain.com
Sample Output:
5 john#domain.com
3 sam#domain.com
3 larry#domain.com
1 donald#domain.com
1 bill#domain.com
Total: 13
Edit: See comments below regarding use of sort

cut -f command for selecting multiple fields

I have a file with as below:
A,B
1,hi there
2, Heloo there
I am trying to print the B column first and then column A
cat file.txt | cut -d "," -f2,1
However, this does not work. Any ideas how to do it?

Count files that have unique prefixes

I have set of files that looks like the following. I'm looking for a good way to count all files that have unique prefixes, where "prefix" is defined by all characters before the second hyphen.
0406-0357-9.jpg 0591-0349-9.jpg 0603-3887-27.jpg 59762-1540-40.jpg 68180-517-6.jpg
0406-0357-90.jpg 0591-0349-90.jpg 0603-3887-28.jpg 59762-1540-41.jpg 68180-517-7.jpg
0406-0357-91.jpg 0591-0349-91.jpg 0603-3887-29.jpg 59762-1540-42.jpg 68180-517-8.jpg
0406-0357-92.jpg 0591-0349-92.jpg 0603-3887-3.jpg 59762-1540-5.jpg 68180-517-9.jpg
0406-0357-93.jpg 0591-0349-93.jpg 0603-3887-30.jpg 59762-1540-6.jpg
Depending on what you actually want output, either of these might be what you want:
ls | awk -F'-' '{c[$1"-"$2]++} END{for (p in c) print p, c[p]}'
or
ls | awk -F'-' '!seen[$1,$2]++{count++} END{print count+0}'
If it's something else, update your question to show the output you're looking for.
This should do it:
ls *.jpg | cut -d- -s -f1,2 | uniq | wc -l
Or if your prefixes are always 4 digits, one dash, 4 digits, you don't need cut:
ls *.jpg | uniq -w9 | wc -l
Parses ls (bad, but it doesn't look like it will cause a problem with these filenames),
uses awk to set the field separator as -.
!seen[$1,$2]++) uses an associative array with $1,$2 as the key and increments, then checks if the value equals 0 to ensure it is only printed once (based on $1 and $2).
print prints on screen :)
ls | awk 'BEGIN{FS="-" ; printf("%-20s%-10s\n","Prefix","Count")} {seen[$1"-"$2]++} END{ for (k in seen){printf("%-20s%-10i\n",k,seen[k])}}'
Will now count based on prefix with headers :)