Word Occurrences Script

Word Occurrences Script - awk

I'm putting together a script that will the count the occurrences of words in text documents.
{
$0 = tolower($0)
for ( i = 1; i <= NF; i++ )
freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
sort = "sort -k 2nr"
for (word in freq)
printf "%-20s %-6s\n", word, freq[word] | sort
close(sort)
}
It works fine so far, but I'd like to make a couple tweaks/additions:
I'm having a hard time displaying the array index number, tried freq[$i] which just spit 0's back at me
Is there any way to eliminate the whitespace (spaces) from the word count?

You do not need to code an own loop to scan the fields, just set RS to make each word an own record: E.g. RS=[^A-Za-z] will treat every string not completely being built from uppercase and lowercase letters as record separator.
$ echo 'Hello world! I am happy123...' | awk 'BEGIN{RS="[^A-Za-z]+"}$0'
Hello
world
I
am
happy
The single $0 matches nonempty lines.
Maybe you want to allow digits in words.. just adapt RS to your needs.
So what's left?
Transform to lowercase, count, print sorted results.
File wfreq.awk:
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
printf "%-20s %6s\n", "Word", "Count"
sort = "sort -k 2nr"
for(word in counts)
printf "%-20s %6s\n",word,counts[word] | sort
close(sort)
}
Example run (only top 10 lines of output for not spamming the answer):
$ awk -f wfreq.awk /etc/motd | head
Word Count
the 5
debian 3
linux 3
are 2
bpo 2
gnu 2
in 2
with 2
absolutely 1
But now for something not really completely different...
To sort by a different field, just adapt the sort = "sort ..." options.
I don't use asort() because not every awk has this extension.
File wfreq2.awk:
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
printf "%-20s %6s\n", "Word", "Count"
sort = "sort -k 1"
for(word in counts)
printf "%-20s %6s\n",word,counts[word] | sort
close(sort)
}
Example run (only top 10 lines of output for not spamming the answer):
$ awk -f wfreq2.awk /etc/motd | head
Word Count
absolutely 1
amd 1
applicable 1
are 2
bpo 2
by 1
comes 1
copyright 1
darkstar 1

Related

Awk - delete oldest duplicate lines, preserving the latest duplicate + deleting one line above deleted duplicate

I have input in a following format:
#1655636921
cd
#1655636926
history
#1655637510
history
#1655637934
ls
#1655637934
ls
#1655638524
cd
#1655638927
ls
#1655638928
history
and I would like to search for duplicates (in lines, that do not start with '#' OR detect duplicates only in even lines), delete all previous duplicates (keeping only the latest one) + for each deleted duplicate delete one previous line, so the output would look like this:
#1655638524
cd
#1655638927
ls
#1655638928
history
I am new to awk and I couldn't find any solution even with preserving latest duplicates, the only related solution that I have found is:
awk '!visited[$0]++'
Which deletes only latest duplicates preserving the oldest one.
Thank you very much in advance for any kind of help.

$ tac file | awk '!/^#/{f = !seen[$0]++} f' | tac
#1655638524
cd
#1655638927
ls
#1655638928
history
If you don't have the tac command on your system you can create a tac function to do the same thing the command does, i.e. reverse the order of input lines, using just the mandatory POSIX tools awk, sort, and cut:
tac() { awk -v OFS='\t' '{print NR, $0}' "${#:--}" | sort -k1,1rn | cut -f2-; }
or if your cat has a -n argument (non-POSIX) or you have nl (POSIX but not mandatory):
tac() { nl "${#:--}" | sort -k1,1rn | cut -f2-; }
tac() { cat -n "${#:--}" | sort -k1,1rn | cut -f2-; }

somehow there was a strange duplicate lingering and had to trim it out brute force :
{m,g}awk'
BEGIN {
1 RS = "(\r?\n)?[#]"
1 FS = (_="[ \t]*")"\n+"(_)
1 OFS = _=""
1 ___ = "\21#"
} {
____[+__[$NF]]++
__[$NF] = NR ___ $+_
} END {
1 FS = "[0-9]+\21"
1 OFS = ORS
1 _ = ""
1 $+_ = _
1 delete ____[_]
1 delete ____[+_]
4 for(_ in __) { if(!(+(___=__[_]) in ____)) {
4 $+___=___
sub("^[^\021]+\21[#]?","#",$+___)
4 } }
sub("^.+\n\n", ""); print }'
=
#1655638524
cd
#1655638927
ls
#1655638928
history

Assumptions:
OP mentions processing by 'line' so this means ls and ls *.txt are to be treated as two distinct commands (ie, both will show up in the final output)
OP mentions detecting duplicates only in 'even lines' which implies we do not need to worry about nested linefeeds (in either the #comment or the command), nor multi-line #comments
One awk idea that eliminates the need for any other programs:
awk '
/^#/ { comment=$0; next }
{ comments[$0]=comment # associate previous line/comment with current command
delete lineno2cmd[cmd2lineno[$0]] # delete previous line number associated with this command
lineno2cmd[FNR]=$0 # associate the current line number with this command; this array used to generate output in line number order (ie, maintain ordering of lines)
cmd2lineno[$0]=FNR # maintain reverse link from command to line number; this array used solely to make sure only one entry in lineno2cmd[] is associated with the current command
}
END { for (i=1;i<=FNR;i++) # loop through list of line numbers and ...
if (i in lineno2cmd) { # if line number is an index in the lineno2cmd[] array then ...
printf "%s\n%s\n", comments[lineno2cmd[i]], lineno2cmd[i]
}
}
' history.dat
If OP has access to GNU awk (v 4.0+) (for PROCINFO["sorted_in"] support) we can streamline this a bit:
awk '
/^#/ { comment=$0; next }
{ comments[$0]=comment
cmd2lineno[$0]=FNR
}
END { PROCINFO["sorted_in"]="#val_num_asc" # sort array by the numerical values (ascending)
for (i in cmd2lineno) {
printf "%s\n%s\n", comments[i], i
}
}
' history.dat
These both generate:
#1655638524
cd
#1655638927
ls
#1655638928
history

Bash: Finding average of entries from multiple columns after reading a CSV text file

I am trying read a CSV text file and find average of weekly hours (columns 3 through 7) spent by all user-ids (column 2) ending with an even number (2,4,6,...).
The input sample is as below:
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer6,User7,4,7,8,2,5
Computer7,User6,8,8,8,0,0
Computer8,User9,5,2,0,6,8
Computer9,User8,2,5,7,3,6
Computer10,User10,8,9,9,9,10
I have written the following script:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' user-list.txt > superuser.txt
The output of this script is:
User4 4
User2 5.4
User6 4.8
User8 4.6
User10 9
However, I want to change the script to only print one average for all user-Ids ending with an even number.
The desired output for this would be as below (which is technically the average of all hours for the IDs ending with even numbers):
5.56
Any help would be appreciated.
TIA

Trying to fix OP's attempt here and adding logic to get average of averages at last of the file's reading. Written on mobile so couldn't test it should work in case I got the thought correct by OP's description.
awk -F, '
$2~/[24680]$/{
count++
for(i=3;i<=7;i++){
sum+=$i
}
tot+=sum/5
sum=0
}
END{
print "Average of averages is: " (count?tot/count:"NaN")
}
' user-list.txt > superuser.txt

You may try:
awk -F, '$2 ~ /[02468]$/ {
for(i=3; i<=7; i++) {
s += $i
++n
}
}
END {
if (n)
printf "%.2f\n", s/n
}' cust.csv
5.56

awk -F, 'NR == 1 { next } { match($2,/[[:digit:]]+/);num=substr($2,RSTART,RLENGTH);if(num%2==0) { av+=($3+$4+$5+$6+$7)/5 } } END { printf "%.2f\n",av/5}' user-list.txt
Ignore the first header like. Pick the number out of the userid with awk's match function. Set the num variable to this number. Check to see if the number is even with num%2. If it is average, set the variable av to av plus the average. At the end, print the average to 2 decimal places.

Print the daily average, for all even numbered user IDs:
#!/bin/sh
awk -F , '
(NR>1) &&
($2 ~ /[02468]$/) {
hours += ($3 + $4 + $5 + $6 + $7)
(users++)
}
END {
print (hours/users/5)
}' \
"$1"
Usage example:
$ script user-list
5.56
One way to get evenness or oddness of an integer is to use modulus (%), as in N % 2. For even values of N, this sum evaluates to zero, and for odd values, it evaluates to 1.
However in this case, a string operation would be required to extract the number any way, so may as well just use a single string match, to get odd or even.
Also, IMO, for 5 fields, which are not going to change (days of the week), it's more succinct to just add them directly, instead of a loop. (NR>1) skips the titles line too, in case there's a conflict.
Finally, you can of of course swap /[02468]$/ for /[13579]$/ to get the same data, for odd numbered users.

awk: first, split a line into separate lines; second, use those new lines as a new input

Let's say I have this line:
foo|bar|foobar
I want to split it at | and then use those 3 new lines as the input for the further proceedings (let's say replace bar with xxx).
Sure, I can pipe two awk instances, like this:
echo "foo|bar|foobar" | awk '{gsub(/\|/, "\n"); print}' | awk '/bar/ {gsub(/bar/, "xxx"); print}'
But how I can achieve this in one script? First, do one operation on some input, and then treat the result as the new input for the second operation?
I tried something like this:
echo "foo|bar|foobar" | awk -v c=0 '{
{
gsub(/\|/, "\n");
sprintf("%s", $0);
}
{
if ($0 ~ /bar/) {
c+=1;
gsub(/bar/, "xxx");
print c;
print
}
}
}'
Which results in this:
1
foo
xxx
fooxxx
And thanks to the counter c, it's absolutely obvious that the subsequent if doesn't treat the multi-line input it receives as several new records but instead just as one multi-lined record.
Thus, my question is: how to tell awk to treat this new multi-line record it receives as many single-line records?
The desired output in this very example should be something like this if I'm correct:
1
xxx
2
fooxxx
But this is just an example, the question is more about the mechanics of such a transition.

I would suggest an alternative approach using split() where you can just split the elements based on the delimiter into an array and iterate over its fields, Instead of working on a single multi line string.
echo "foo|bar|foobar" |\
awk '{
count = 0
n = split($0, arr, "|")
for ( i = 1; i <= n; i++ )
{
if ( arr[i] ~ /bar/ )
{
count += sub(/bar/, "xxx", arr[i])
print count
print arr[i]
}
}
}'
Also you don't need an explicit increment of count variable, sub() returns the number of substitutions made on the source string. You can just increment to the existing value of count.
As one more level of optimization, you can get rid of the ~ match in the if condition and directly use the sub() function there
if ( sub(/bar/, "xxx", arr[i]) )
{
count++
print count
print arr[i]
}

If you set the record separator (RS) to the pipe character, you almost get the desired effect, e.g.:
echo 'foo|bar|foobar' | awk -v RS='|' 1
Output:
foo
bar
foobar
[...an empty line
Except that a new-line character becomes part of the last field, so there is an extra line at the end of the output. You can work around this by either including a new-line in the RS variable, making it less portable, or avoid sending new-lines to awk.
For example using the less portable way:
echo 'foo|bar|foobar' | awk -v RS='\\||\n' '{ sub(/bar/, "baz") } 1'
Output:
foo
baz
foobaz
Note that the empty record at the end is ignored.

With GNU awk:
$ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file
1
xxx
2
fooxxx
With any awk:
$ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file
1
xxx
2
fooxxx

Find the most repeated sequence in a file

I have a file with some binary sequence [010101...], and I would like to get the most generated sequence of 5 bits in the file.
Example of my file:
00010111000100100100100100100101110101010100011001010111011100010
I've started by generating all the possible sequences, means if I take the first 7 bits, I will get the following sequences:
00010 00101 01011
Now I'm looking for a way the count the occurrence of each sequence in the file.
Any help, please?

If you're using perl, you could go for something like this:
use strict;
use warnings;
my $str = '00010111000100100100100100100101110101010100011001010111011100010';
# create list of all substrings of length 5
my #sequences = map { substr $str, $_, 5 } 0..length($str) - 5;
# build hash of counts for each substring
my %counts;
++$counts{$_} for #sequences;
# take key corresponding to the maximum value in counts
my ($max) = sort { $counts{$b} <=> $counts{$a} } keys %counts;
print "$max\n";
Output:
10010

With awk and UNIX utils:
count.awk
{
for(i=0;i<=length($0)-5;i++) {
a[substr($0,i,5)]++
}
}
END{for(i in a){print i, a[i]}}
Call it:
awk -f count.awk input.file | sort -k2
This gives you a sorted list of all 5 bit sequences. If you want just the most frequent, use:
awk -f count.awk input.file | sort -k2 | tail -n1 | cut -d' ' -f1
Btw, you can also use a single awk script but imo the combination of the above tools gives you more flexibility.
Just for completeness:
count.awk:
{
for(i=0;i<=length($0)-5;i++){
a[substr($0,i,5)]++
}
}
END {
for(i in a) {
if(a[i]>=a[m] || !m) {
m=i
}
}
print m
}

awk to Count Sum and Unique improve command

Would like to print based on 2nd column ,count of line items, sum of 3rd column and unique values of first column.Having around 100 InputTest files and not sorted ..
Am using below 3 commands to achieve the desired output , would like to know the simplest way ...
InputTest*.txt
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
ghi,zz,10,sss
Step#1:
cat InputTest*.txt | awk -F, '{key=$2;++a[key];b[key]=b[key]+$3} END {for(i in a) print i","a[i]","b[i]}'
Op#1
xx,4,40
yy,4,60
zz,1,10
Step#2
awk -F ',' '{print $1,$2}' InputTest*.txt | sort | uniq >Op_UniqTest2.txt
Op#2
abc xx
abc yy
def xx
def yy
ghi zz
Step#3
awk '{print $2}' Op_UniqTest2.txt | sort | uniq -c
Op#3
2 xx
2 yy
1 zz
Desired Output:
xx,4,40,2
yy,4,60,2
zz,1,10,1
Looking for suggestions !!!

BEGIN { FS = OFS = "," }
{ ++lines[$2]; if (!seen[$2,$1]++) ++diff[$2]; count[$2]+=$3 }
END { for(i in lines) print i, lines[i], count[i], diff[i] }
lines tracks the number of occurrences of each value in column 2
seen records unique combinations of the second and first column, incrementing diff[$2] whenever a unique combination is found. The ++ after seen[$2,$1] means that the condition will only be true the first time the combination is found, as the value of seen[$2,$1] will be increased to 1 and !seen[$2,$1] will be false.
count keeps a total of the third column
$ awk -f avn.awk file
xx,4,40,2
yy,4,60,2
zz,1,10,1

Using awk:
$ awk '
BEGIN { FS = OFS = "," }
{ keys[$2]++; sum[$2]+=$3 } !seen[$1,$2]++ { count[$2]++ }
END { for(key in keys) print key, keys[key], sum[key], count[key] }
' file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Set the input and output field separator to , in BEGIN block. We use arrays keys to identify and count keys. sum array keeps the sum for each keys. count allows us to keep track of unique column1 for each of column2 values.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Word Occurrences Script - awk

Related

Awk - delete oldest duplicate lines, preserving the latest duplicate + deleting one line above deleted duplicate

Bash: Finding average of entries from multiple columns after reading a CSV text file

awk: first, split a line into separate lines; second, use those new lines as a new input

Find the most repeated sequence in a file

awk to Count Sum and Unique improve command

Categories

Resources