Count number of elements that match one file with another using AWK

Count number of elements that match one file with another using AWK - awk

First of all, thank you for your help. I have the file letter.txt:
A
B
C
And I have the file number.txt
B 10
D 20
A 15
C 18
E 23
A 12
B 14
I want to count how many times does each letter in letter.txt appears in number.txt so the output will be:
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
I know I can do it using this code, but I want to do it generally with any file.
cat number.txt | awk 'BEGIN {A=0;B=0;C=0;count=0}; {count++};{if ($1 == "A")A++};{if ($1 == "B")B++};{if ($1 == "C")C++}END{print "We have found" A "A\n" "We have found" B "B\n" "We have found" C "C"}

You basically want to do an inner join (easy enough to google) and group by the join key and return the count for each group.
awk 'NR==FNR { count[$1] = 0; next }
$1 in count { ++count[$1]; ++total}
END { for(k in count)
print "We have found", count[k], k
print "Total", total, "letters"}' letters.txt numbers.txt
All of this should be easy to find in a basic Awk tutorial, but in brief, the line number within the file FNR is equal to the overall line number NR when you are reading the first input file. We initialize count to contain the keys we want to look for. If we fall through, we are reading the second file; if we see a key we want, we increase its count. When we are done, report what we found.

Consider starting with:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c
2 A
2 B
1 C
Then:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c |
awk '
{ print "We have found", $1, $2; tot+=$1 }
END { print "Total letter found:", tot+0 }
'
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
although in reality I'd probably just do it all in awk, just wanted to show an alternative.

Don't know if you need awk
to me easier (but slower execution as you read in comments) to use grep -c
cat file1 | while read line; do
c=`grep -c $line file2 | sed 's/ //g'`;
echo We have found $c $line;
done
it's a cycle, where
$c is the count taken with grep -c, and sed remove spaces in grep -c output

grep and coreutils can also do this:
grep -f letter.txt number.txt | cut -d' ' -f1 | sort | uniq -c
Output:
2 A
2 B
1 C

Related

Bash/AWK conditionals using two files

First of all, thank you for your help. I have a problem trying to use bash conditionals using two files. I have the file letters.txt
A
B
C
And I have the file number.txt
B 10
D 20
A 15
C 18
E 23
A 12
B 14
And I want to use conditionals so if one letter in file letter.txt is also in number.txt it generates the file a.txt b.txt c.txt so the will look as this:
a.txt
A 12
A 15
b.txt
B 10
B 14
c.txt
C 18
I know I can do it using this code:
cat number.txt | awk '{if($1=="A")print $0}' > a.txt
But I want to do it using two files.

The efficient way to approach this type of problem is to sort the input on the key field(s) first so you don't need to have multiple output files open simultaneously (which has limits and/or can slow processing down managing them) or be opening/closing output files with every line read (which is always slow).
Using GNU sort for -s (stable sort) to retain input order of the non-key fields and only having 1 output file open at a time and keeping it open for the whole time it's being populated:
$ sort -k1,1 -s number.txt |
awk '
NR==FNR { lets[$1]; next }
!($1 in lets) { next }
$1 != prev { close(out); out=tolower($1) ".txt"; prev=$1 }
{ print > out }
' letters.txt -
$ head ?.txt
==> a.txt <==
A 15
A 12
==> b.txt <==
B 10
B 14
==> c.txt <==
C 18
If you don't have GNU sort for -s to retain input order of the lines for each key field, you can replace it with awk | sort | cut, e.g.:
$ sort -k1,1 -s number.txt
A 15
A 12
B 10
B 14
C 18
D 20
E 23
$ awk '{print NR, $0}' number.txt | sort -k2,2 -k1,1n | cut -d' ' -f2-
A 15
A 12
B 10
B 14
C 18
D 20
E 23
Note the change in the order of the 2nd fields for A compared to the input order without doing the above because by default sort doesn't guarantee to retain the relative line order for each key it sorts on:
$ sort -k1,1 number.txt
A 12
A 15
B 10
B 14
C 18
D 20
E 23

With your shown samples, please try following.
awk '
FNR==NR{
arr[$0]
next
}
($1 in arr){
outputFile=(tolower($1)".txt")
print >> (outputFile)
close(outputFile)
}
' letters.txt numbers.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when letters.txt is being read.
arr[$0] ##Creating array arr with index of current line.
next ##next will skip all further statements from here.
}
($1 in arr){ ##checking condition if 1st field is present in arr.
outputFile=(tolower($1)".txt") ##Creating outputFile to print output.
print >> (outputFile) ##Printing current line into output file.
close(outputFile) ##Closing output file in backend.
}
' letters.txt numbers.txt ##Mentioning Input_file names here.

Counting the number of unique values based on two columns in bash

I have a tab-separated file looking like this:
A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234
Output:
A 3
B 2
C 1
Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.
awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci
I'd really appreciate your help! Thank you in advance.

With complete awk solution could you please try following.
awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.

Using GNU awk:
$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file
Output:
A 3
B 2
C 1
Explained:
$ gawk -F\\t '{ # using GNU awk and tab as delimiter
a[$1][$2] # hash to 2D array
}
END {
for(i in a) # for all values in first field
print i,length(a[i]) # output value and the size of related array
}' file

$ sort -u file | cut -f1 | uniq -c
3 A
2 B
1 C

Another way, using the handy GNU datamash utility:
$ datamash -g1 countunique 2 < input.txt
A 3
B 2
C 1
Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.

You could try this:
cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'
It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)

Sort a file preserving the header as first position with bash

When sorting a file, I am not preserving the header in its position:
file_1.tsv
Gene Number
a 3
u 7
b 9
sort -k1,1 file_1.tsv
Result:
a 3
b 9
Gene Number
u 7
So I am tryig this code:
sed '1d' file_1.tsv | sort -k1,1 > file_1_sorted.tsv
first='head -1 file_1.tsv'
sed '1 "$first"' file_1_sorted.tsv
What I did is to remove the header and sort the rest of the file, and then trying to add again the header. But I am not able to perform this last part, so I would like to know how can I copy the header of the original file and insert it as the first row of the new file without substituting its actuall first row.

You can do this as well :
{ head -1; sort; } < file_1.tsv
** Update **
For macos :
{ IFS= read -r header; printf '%s\n' "$header" ; sort; } < file_1.tsv

a simpler awk
$ awk 'NR==1{print; next} {print | "sort"}' file

$ head -1 file; tail -n +2 file | sort
Output:
Gene Number
a 3
b 9
u 7

Could you please try following.
awk '
FNR==1{
first=$0
next
}
{
val=(val?val ORS:"")$0
}
END{
print first
print val | "sort"
}
' Input_file
Logical explanation:
Check condition FNR==1 to see if its first line; then save its values to variable and move on to next line by next.
Then keep appending all lines values to another variable with new line till last line.
Now come to END block of this code which executes when Input_file is done being read, there print first line value and put sort command on rest of the lines value there.

This will work using any awk, sort, and cut in any shell on every UNIX box and will work whether the input is coming from a pipe (when you can't read it twice) or from a file (when you can) and doesn't involve awk spawning a subshell:
awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
The above uses awk to stick a 0 at the front of the header line and a 1 in front of the rest so you can sort by that number then whatever other field(s) you want to sort on and then remove the added field again with a cut. Here it is in stages:
$ awk -v OFS='\t' '{print (NR>1), $0}' file
0 Gene Number
1 a 3
1 u 7
1 b 9
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2
0 Gene Number
1 a 3
1 b 9
1 u 7
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
Gene Number
a 3
b 9
u 7

AWK how to count patterns on the first column?

I was trying get the total number of "??", " M", "A" and "D" from this:
?? this is a sentence
M this is another one
A more text here
D more and more text
I have this sample line of code but doesn't work:
awk -v pattern="\?\?" '{$1 == pattern} END{print " "FNR}'

$ awk '{ print $1 }' file | sort | uniq -c
1 ??
1 A
1 D
1 M
If for some reason you want an awk-only solution:
awk '{ ++cnt[$1] } END { for (i in cnt) print cnt[i], i }' file
but I think that's needlessly complicated compared to using the built-in unix tools that already do most of the work.
If you just want to count one particular value:
awk -v value='??' '$1 == value' file | wc -l
If you want to count only a subset of values, you can use a regex:
$ awk -v pattern='A|D|(\\?\\?)' '$1 ~ pattern { print $1 }' file | sort | uniq -c
1 ??
1 A
1 D
Here you do need to send a \ in order that the ?s are escaped within the regular expression. And because the \ is itself a special character within the string being passed to awk, you need to escape it first (hence the double backslash).

How to match "field 5 through the end of the line" (for example, in awk)

I want to pretty-print the output of a find-like script that would take input like this:
- 2015-10-02 19:45 102 /My Directory/some file.txt
and produce something like this:
- 102 /My Directory/some file.txt
In other words: "f" (for "file"), file size (right-justified), then pathname (with an arbitrary number of spaces).
This would be easy in awk if I could write a script that takes $1, $4, and "everything from $5 through the end of the line".
I tried using the awk construct substr($0, index($0, $8)), which I thought meant "everything starting with field $8 to the end of $0".
Using index() in this way is offered as a solution on linuxquestions.org and was upvoted 29 times in a stackoverflow.com thread.
On closer inspection, however, I found that index() does not achieve this effect if the starting field happens to match an earlier point in the string. For example, given:
-rw-r--r-- 1 tbaker staff 3024 2015-10-01 14:39 calendar
-rw-r--r-- 1 tbaker staff 4062 2015-10-01 14:39 b
-rw-r--r-- 1 tbaker staff 2374 2015-10-01 14:39 now or later
Gawk (and awk) get the following results:
$ gawk '{ print index($0, $8) }' test.txt
49
15
49
In other words, the value of $8 ('b') matches at index 15 instead of 49 (i.e., like most of the other filenames).
My issue, then is how to specify "everything from field X to the end of the string".
I have re-written this question in order to make this clear.

Looks to me like you should just be using the "stat" command rather than "ls", for the reasons already commented upon:
stat -c "f%15s %n" *
But you should double-check how your "stat" operates; it apparently can be shell-specific.

The built-in awk function index() is sometimes recommended as a way
to print "from field 5 through the end of the string" [1, 2, 3].
In awk, index($0, $8) does not mean "the index of the first character of
field 8 in string $0". Rather, it means "the index of the first occurrence in
string $0 of the string value of field 8". In many cases, that first
occurrence will indeed be the first character in field 8 but this is not the
case in the example above.
It has been pointed out that parsing the output of ls is generally a bad
idea [4], in part because implementations of ls significantly differ in output.
Since the author of that note recommends find as a replacement for ls for some uses,
here is a script using find:
find $# -ls |
sed -e 's/^ *//' -e 's/ */ /g' -e 's/ /|/2' -e 's/ /|/2' -e 's/ /|/4' -e 's/ /|/4' -e 's/ /|/6' |
gawk -F'|' '{ $2 = substr($2, 1, 1) ; gsub(/^-/, "f", $2) }
{ printf("%s %15s %s\n", $2, $4, $6) }'
...which yields the required output:
f 4639 /Users/foobar/uu/a
f 3024 /Users/foobar/uu/calendar
f 2374 /Users/foobar/uu/xpect
This approach recursively walks through a file tree. However, there may of course be implementation differences between versions of find as well.
http://www.linuxquestions.org/questions/linux-newbie-8/awk-print-field-to-end-and-character-count-179078/
How to print third column to last column?
Print Field 'N' to End of Line
http://mywiki.wooledge.org/ParsingLs

Maybe some variation of find -printf | awk is what you're looking for?
$ ls -l tmp
total 2
-rw-r--r-- 1 Ed None 7 Oct 2 14:35 bar
-rw-r--r-- 1 Ed None 2 Oct 2 14:35 foo
-rw-r--r-- 1 Ed None 0 May 3 09:55 foo bar
$ find tmp -type f -printf "f %s %p\n" | awk '{sub(/^[^ ]+ +[^ ]/,sprintf("%s %10d",$1,$2))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
or
$ find tmp -type f -printf "%s %p\n" | awk '{sub(/^[^ ]+/,sprintf("f %10d",$1))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
It won't work with file names that contain newlines.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Count number of elements that match one file with another using AWK - awk

grep and coreutils can also do this: grep -f letter.txt number.txt | cut -d' ' -f1 | sort | uniq -c Output: 2 A 2 B 1 C

Related

Bash/AWK conditionals using two files

Counting the number of unique values based on two columns in bash

Sort a file preserving the header as first position with bash

AWK how to count patterns on the first column?

How to match "field 5 through the end of the line" (for example, in awk)

Categories

Resources