Count number of different lines - awk

I have a file that has a lot of its lines repeated, it looks like this:
a
a
.
.
.
a
b
b
c
.
.
c
d
.
.
d
e
.
.
.
e
I need to count each line value only once so for example if the only possible values lines can be are from a,b,c,d,e the number i'm interested in is 5.
here's how I've been counting all of the lines in the file:
wc -l file
which only gives me n times a, m times b, etc. and doesn't provide me any valuable information.
I sense this can be done using awk, any ideas?

Does it have to be awk? one way using shell-commands is
$ sort input.txt | uniq -c
10 .
3 a
2 b
2 c
2 d
2 e
Using awk:
$ awk '{a[$0]++}END{for(i in a){print i, a[i]}}' input.txt
a 3
b 2
. 10
c 2
d 2
e 2

You don't really need to do any programming for this, e.g.
$ sort -u input.txt | wc -l
sort -u sorts the input file removing any duplicates and the output is then piped to wc -l to generate a count of these unique lines.

Given this file:
$ cat /tmp/lines.txt
a
a
.
.
.
a
b
b
c
.
.
c
d
.
.
d
e
.
.
.
You can also Perl to filter the type of lines to count. In this case, only letters:
$ perl -lane '$c{$1}++ if /^(\w+)/; END {print "$_: $c{$_}" foreach (sort keys%c); $s = keys %c; print "total uniques: $s"}' /tmp/lines.txt
a: 3
b: 2
c: 2
d: 2
e: 2
total uniques: 5
The total unique values is found by the number of key, value pairs in the hash %c
Similarly in awk, you can do:
$ awk '/\w+/{ a[$0]++}END{for(i in a){print i, a[i]; c++} print "unique lines:", c}' /tmp/lines.txt
a 3
b 2
c 2
d 2
e 2
unique lines: 5
Or, cobble together a grep/uniq/wc solution:
$ grep -E '\w+' /tmp/lines.txt | uniq | wc -l
5

The idiomatic way to do this in awk:
awk '!seen[$0]++' file
That prints a line only the first time it is seen

awk '!seen[$0]++{cnt++} END{print cnt+0}' file

Related

Count number of elements that match one file with another using AWK

First of all, thank you for your help. I have the file letter.txt:
A
B
C
And I have the file number.txt
B 10
D 20
A 15
C 18
E 23
A 12
B 14
I want to count how many times does each letter in letter.txt appears in number.txt so the output will be:
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
I know I can do it using this code, but I want to do it generally with any file.
cat number.txt | awk 'BEGIN {A=0;B=0;C=0;count=0}; {count++};{if ($1 == "A")A++};{if ($1 == "B")B++};{if ($1 == "C")C++}END{print "We have found" A "A\n" "We have found" B "B\n" "We have found" C "C"}
You basically want to do an inner join (easy enough to google) and group by the join key and return the count for each group.
awk 'NR==FNR { count[$1] = 0; next }
$1 in count { ++count[$1]; ++total}
END { for(k in count)
print "We have found", count[k], k
print "Total", total, "letters"}' letters.txt numbers.txt
All of this should be easy to find in a basic Awk tutorial, but in brief, the line number within the file FNR is equal to the overall line number NR when you are reading the first input file. We initialize count to contain the keys we want to look for. If we fall through, we are reading the second file; if we see a key we want, we increase its count. When we are done, report what we found.
Consider starting with:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c
2 A
2 B
1 C
Then:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c |
awk '
{ print "We have found", $1, $2; tot+=$1 }
END { print "Total letter found:", tot+0 }
'
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
although in reality I'd probably just do it all in awk, just wanted to show an alternative.
Don't know if you need awk
to me easier (but slower execution as you read in comments) to use grep -c
cat file1 | while read line; do
c=`grep -c $line file2 | sed 's/ //g'`;
echo We have found $c $line;
done
it's a cycle, where
$c is the count taken with grep -c, and sed remove spaces in grep -c output
grep and coreutils can also do this:
grep -f letter.txt number.txt | cut -d' ' -f1 | sort | uniq -c
Output:
2 A
2 B
1 C

Breakline after matching pattern with awk

I have this kind of file
A 1,2,3,4
B 1
C 1,2
I would like to get with awk this output :
A 1
A 2
A 3
A 4
B 1
C 1
C 2
C 3
tried code:
sed 's/,/\n&/g' file
Any idea with awk?
Could you please try following, using multiple field separator concept, written and tested with shown samples in GNU awk.
awk 'BEGIN{FS="[ ,]"} {for(i=2;i<=NF;i++){print $1,$i}}' Input_file
2nd solution: splitting 2nd field value.
awk '{num=split($2,arr,",");for(i=1;i<=num;i++){print $1,arr[i]}}' Input_file
Hmm:
$ awk '{gsub(/,/,ORS $1 OFS)}1' file
Output:
A 1
A 2
A 3
A 4
B 1
C 1
C 2
And if you really want THAT output from THAT input, you need to add END{print "C 3"} in the end...
Edit Please see #EdMorton's comment for a pitfall.

Split large single column into two columns

I need to split a single column of data in a large file into two columns as follows:
A
B B A
C ----> D C
D F E
E H G
F
G
H
Is there an easy way of doing it with unix shell commands and/or small shell script? awk?
$ awk 'NR%2{s=$0;next} {print $0,s}' file
B A
D C
F E
H G
You can use the following awk script:
awk 'NR % 2 != 0 {cache=$0}; NR % 2 == 0 {print $0 cache}' data.txt
Output:
BA
DC
FE
HG
It caches the value of odd lines and outputs even lines + appends the cache to them.
I know this is tagged awk, but I just can't stop myself from posting a sed solution, since the question left it open for "easy way . . . with unix shell commands":
$ sed -n 'h;n;G;s/\n/ /g;p' data.txt
B A
D C
F E
H G

How to take a single record in a file and display it vertically with the count of field before the actual field is displayed

I have a file with a single line in it that has a record that is deliminted by semi-colons. So far I have figured out that I can use tr by issuing:
tr ';' '\n' < t
However since the record has 140 fields, I'd like to be able to show the field count when displaying such as the following:
1 23
2 324234
3 AAA
.
.
140 Blah
Help is greatly appreciated!
tr \; '\n' <t|nl
or
awk -v RS=';' '$1=++i" "$1' file
test:
kent$ echo "a;b;c;d"|awk -v RS=';' '$1=++i" "$1'
1 a
2 b
3 c
4 d
You could just run it through cat -n.
tr \; '\n' < t | cat -n
Since this is tagged awk, you could do it that way, too; it's just a little wordier:
awk -F\; '{for (i=1;i<=NF;++i) { print i" "$i }}'
In shell you can use IFS to specify a field separator like so,
IFS=";"
i=0
for s in $(<file)
do
((i++))
echo $i $s
done

In AWK, is it possible to specify "ranges" of fields?

In AWK, is it possible to specify "ranges" of fields?
Example. Given a tab-separated file "foo" with 100 fields per line, I want to print only the fields 32 to 57 for each line, and save the result in a file "bar". What I do now:
awk 'BEGIN{OFS="\t"}{print $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57}' foo > bar
The problem with this is that it is tedious to type and prone to errors.
Is there some syntactic form which allows me to say the same in a more concise and less error prone fashion (like "$32..$57") ?
Besides the awk answer by #Jerry, there are other alternatives:
Using cut (assumes tab delimiter by default):
cut -f32-58 foo >bar
Using perl:
perl -nle '#a=split;print join "\t", #a[31..57]' foo >bar
Mildly revised version:
BEGIN { s = 32; e = 57; }
{ for (i=s; i<=e; i++) printf("%s%s", $(i), i<e ? OFS : "\n"); }
You can do it in awk by using RE intervals. For example, to print fields 3-6 of the records in this file:
$ cat file
1 2 3 4 5 6 7 8 9
a b c d e f g h i
would be:
$ gawk 'BEGIN{f="([^ ]+ )"} {print gensub("("f"{2})("f"{4}).*","\\3","")}' file
3 4 5 6
c d e f
I'm creating an RE segment f to represent every field plus it's succeeding field separator (for convenience), then I'm using that in the gensub to delete 2 of those (i.e the first 2 fields), remember the next 4 for reference later using \3, and then delete what comes after them. For your tab-separated file where you want to print fields 32-57 (i.e. the 26 fields after the first 31) you'd use:
gawk 'BEGIN{f="([^\t]+\t)"} {print gensub("("f"{31})("f"{26}).*","\\3","")}' file
The above uses GNU awk for it's gensub() function. With other awks you'd use sub() or match() and substr().
EDIT: Here's how to write a function to do the job:
gawk '
function subflds(s,e, f) {
f="([^" FS "]+" FS ")"
return gensub( "(" f "{" s-1 "})(" f "{" e-s+1 "}).*","\\3","")
}
{ print subflds(3,6) }
' file
3 4 5 6
c d e f
Just set FS as appropriate. Note that this will need a tweak for the default FS if your input file can start with spaces and/or have multiple spaces between fields and will only work if your FS is a single character.
I'm late but this is quick at to the point so I'll leave it here. In cases like this I normally just remove the fields I don't need with gsub and print. Quick and dirty example, since you know your file is delimited by tabs you can remove the first 31 fields:
awk '{gsub(/^(\w\t){31}/,"");print}'
example of removing 4 fields because lazy:
printf "a\tb\tc\td\te\tf\n" | awk '{gsub(/^(\w\t){4}/,"");print}'
Output:
e f
This is shorter to write, easier to remember and uses less CPU cycles than horrendous loops.
You can use a combination of loops and printf for that in awk:
#!/bin/bash
start_field=32
end_field=58
awk -v start=$start_field -v end=$end_field 'BEGIN{OFS="\t"}
{for (i=start; i<=end; i++) {
printf "%s" $i;
if (i < end) {
printf "%s", OFS;
} else {
printf "\n";
}
}}'
This looks a bit hacky, however:
it properly delimits your output based on the specified OFS, and
it makes sure to print a new line at the end for each input line in the file.
I do not know a way to do field range selection in awk. I know how to drop fields at the end of the input (see bellow), but not easily at the beginning. Bellow, the hard way to drop fields at the beginning.
If you know a character c that is not included in your input, you could use the following awk script:
BEGIN { s = 32; e = 57; c = "#"; }
{ NF = e # Drop the fields after e.
$s = c $s # Put a c in front of the s field.
sub(".*"c, "") # Drop the chars before c.
print # Print the edited line.
}
EDIT:
And I just thought that you can always find a character that is not in the input: use \n.
Unofrtunately don't seem to have access to my account anymore, but also don't have 50 rep to add a comment anyway.
Bob's answer can be simplified a lot using 'seq':
echo $(seq -s ,\$ 5 9| cut -d, -f2-)
$6,$7,$8,$9
The minor disadvantage is you have to specify your first field number as one lower.
So to get fields 3 through 7, I specify 2 as the first argument.
seq -s ,\$ 2 7 sets field seperator for seq at ',$' and yields 2,$3,$4,$5,$6,$7
cut -d, -f2- sets field delimiter at ',' and basically cuts of everything before the first comma, by showing everything from the second field on. Thus resulting in $3,$4,$5,$6,$7
When combined with Bob's answer, we get:
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(seq -s ,\$ 2 7| cut -d, -f2-)}" awk.txt
3 4 5 6 7
c d e f g
$
I use this simple function, which does not check that the field range exists in the line.
function subby(f,l, s) {
s = $f
for(i=f+1;i<=l;i++)
s = sprintf("%s %s",s,$i)
return s
}
(I know OP requested "in AWK" but ... )
Using bash expansion on the command line to generate arguments list;
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(c="" ;for i in {3..7}; do c=$c\$$i, ; done ; c=${c%%,} ; echo $c ;)}" awk.txt
3 4 5 6 7
c d e f g
explanation ;
c="" # var to hold args list
for i in {3..7} # the required variable range 3 - 7
do
# replace c's value with concatenation of existing value, literal $, i value and a comma
c=$c\$$i,
done
c=${c%%,} # remove trailing/final comma
echo $c #return the list string
placed on single line using semi-colons, inside $() to evaluate/expand in place.