Find the most repeated sequence in a file

Find the most repeated sequence in a file - awk

I have a file with some binary sequence [010101...], and I would like to get the most generated sequence of 5 bits in the file.
Example of my file:
00010111000100100100100100100101110101010100011001010111011100010
I've started by generating all the possible sequences, means if I take the first 7 bits, I will get the following sequences:
00010 00101 01011
Now I'm looking for a way the count the occurrence of each sequence in the file.
Any help, please?

If you're using perl, you could go for something like this:
use strict;
use warnings;
my $str = '00010111000100100100100100100101110101010100011001010111011100010';
# create list of all substrings of length 5
my #sequences = map { substr $str, $_, 5 } 0..length($str) - 5;
# build hash of counts for each substring
my %counts;
++$counts{$_} for #sequences;
# take key corresponding to the maximum value in counts
my ($max) = sort { $counts{$b} <=> $counts{$a} } keys %counts;
print "$max\n";
Output:
10010

With awk and UNIX utils:
count.awk
{
for(i=0;i<=length($0)-5;i++) {
a[substr($0,i,5)]++
}
}
END{for(i in a){print i, a[i]}}
Call it:
awk -f count.awk input.file | sort -k2
This gives you a sorted list of all 5 bit sequences. If you want just the most frequent, use:
awk -f count.awk input.file | sort -k2 | tail -n1 | cut -d' ' -f1
Btw, you can also use a single awk script but imo the combination of the above tools gives you more flexibility.
Just for completeness:
count.awk:
{
for(i=0;i<=length($0)-5;i++){
a[substr($0,i,5)]++
}
}
END {
for(i in a) {
if(a[i]>=a[m] || !m) {
m=i
}
}
print m
}

Related

Awk - delete oldest duplicate lines, preserving the latest duplicate + deleting one line above deleted duplicate

I have input in a following format:
#1655636921
cd
#1655636926
history
#1655637510
history
#1655637934
ls
#1655637934
ls
#1655638524
cd
#1655638927
ls
#1655638928
history
and I would like to search for duplicates (in lines, that do not start with '#' OR detect duplicates only in even lines), delete all previous duplicates (keeping only the latest one) + for each deleted duplicate delete one previous line, so the output would look like this:
#1655638524
cd
#1655638927
ls
#1655638928
history
I am new to awk and I couldn't find any solution even with preserving latest duplicates, the only related solution that I have found is:
awk '!visited[$0]++'
Which deletes only latest duplicates preserving the oldest one.
Thank you very much in advance for any kind of help.

$ tac file | awk '!/^#/{f = !seen[$0]++} f' | tac
#1655638524
cd
#1655638927
ls
#1655638928
history
If you don't have the tac command on your system you can create a tac function to do the same thing the command does, i.e. reverse the order of input lines, using just the mandatory POSIX tools awk, sort, and cut:
tac() { awk -v OFS='\t' '{print NR, $0}' "${#:--}" | sort -k1,1rn | cut -f2-; }
or if your cat has a -n argument (non-POSIX) or you have nl (POSIX but not mandatory):
tac() { nl "${#:--}" | sort -k1,1rn | cut -f2-; }
tac() { cat -n "${#:--}" | sort -k1,1rn | cut -f2-; }

somehow there was a strange duplicate lingering and had to trim it out brute force :
{m,g}awk'
BEGIN {
1 RS = "(\r?\n)?[#]"
1 FS = (_="[ \t]*")"\n+"(_)
1 OFS = _=""
1 ___ = "\21#"
} {
____[+__[$NF]]++
__[$NF] = NR ___ $+_
} END {
1 FS = "[0-9]+\21"
1 OFS = ORS
1 _ = ""
1 $+_ = _
1 delete ____[_]
1 delete ____[+_]
4 for(_ in __) { if(!(+(___=__[_]) in ____)) {
4 $+___=___
sub("^[^\021]+\21[#]?","#",$+___)
4 } }
sub("^.+\n\n", ""); print }'
=
#1655638524
cd
#1655638927
ls
#1655638928
history

Assumptions:
OP mentions processing by 'line' so this means ls and ls *.txt are to be treated as two distinct commands (ie, both will show up in the final output)
OP mentions detecting duplicates only in 'even lines' which implies we do not need to worry about nested linefeeds (in either the #comment or the command), nor multi-line #comments
One awk idea that eliminates the need for any other programs:
awk '
/^#/ { comment=$0; next }
{ comments[$0]=comment # associate previous line/comment with current command
delete lineno2cmd[cmd2lineno[$0]] # delete previous line number associated with this command
lineno2cmd[FNR]=$0 # associate the current line number with this command; this array used to generate output in line number order (ie, maintain ordering of lines)
cmd2lineno[$0]=FNR # maintain reverse link from command to line number; this array used solely to make sure only one entry in lineno2cmd[] is associated with the current command
}
END { for (i=1;i<=FNR;i++) # loop through list of line numbers and ...
if (i in lineno2cmd) { # if line number is an index in the lineno2cmd[] array then ...
printf "%s\n%s\n", comments[lineno2cmd[i]], lineno2cmd[i]
}
}
' history.dat
If OP has access to GNU awk (v 4.0+) (for PROCINFO["sorted_in"] support) we can streamline this a bit:
awk '
/^#/ { comment=$0; next }
{ comments[$0]=comment
cmd2lineno[$0]=FNR
}
END { PROCINFO["sorted_in"]="#val_num_asc" # sort array by the numerical values (ascending)
for (i in cmd2lineno) {
printf "%s\n%s\n", comments[i], i
}
}
' history.dat
These both generate:
#1655638524
cd
#1655638927
ls
#1655638928
history

awk: first, split a line into separate lines; second, use those new lines as a new input

Let's say I have this line:
foo|bar|foobar
I want to split it at | and then use those 3 new lines as the input for the further proceedings (let's say replace bar with xxx).
Sure, I can pipe two awk instances, like this:
echo "foo|bar|foobar" | awk '{gsub(/\|/, "\n"); print}' | awk '/bar/ {gsub(/bar/, "xxx"); print}'
But how I can achieve this in one script? First, do one operation on some input, and then treat the result as the new input for the second operation?
I tried something like this:
echo "foo|bar|foobar" | awk -v c=0 '{
{
gsub(/\|/, "\n");
sprintf("%s", $0);
}
{
if ($0 ~ /bar/) {
c+=1;
gsub(/bar/, "xxx");
print c;
print
}
}
}'
Which results in this:
1
foo
xxx
fooxxx
And thanks to the counter c, it's absolutely obvious that the subsequent if doesn't treat the multi-line input it receives as several new records but instead just as one multi-lined record.
Thus, my question is: how to tell awk to treat this new multi-line record it receives as many single-line records?
The desired output in this very example should be something like this if I'm correct:
1
xxx
2
fooxxx
But this is just an example, the question is more about the mechanics of such a transition.

I would suggest an alternative approach using split() where you can just split the elements based on the delimiter into an array and iterate over its fields, Instead of working on a single multi line string.
echo "foo|bar|foobar" |\
awk '{
count = 0
n = split($0, arr, "|")
for ( i = 1; i <= n; i++ )
{
if ( arr[i] ~ /bar/ )
{
count += sub(/bar/, "xxx", arr[i])
print count
print arr[i]
}
}
}'
Also you don't need an explicit increment of count variable, sub() returns the number of substitutions made on the source string. You can just increment to the existing value of count.
As one more level of optimization, you can get rid of the ~ match in the if condition and directly use the sub() function there
if ( sub(/bar/, "xxx", arr[i]) )
{
count++
print count
print arr[i]
}

If you set the record separator (RS) to the pipe character, you almost get the desired effect, e.g.:
echo 'foo|bar|foobar' | awk -v RS='|' 1
Output:
foo
bar
foobar
[...an empty line
Except that a new-line character becomes part of the last field, so there is an extra line at the end of the output. You can work around this by either including a new-line in the RS variable, making it less portable, or avoid sending new-lines to awk.
For example using the less portable way:
echo 'foo|bar|foobar' | awk -v RS='\\||\n' '{ sub(/bar/, "baz") } 1'
Output:
foo
baz
foobaz
Note that the empty record at the end is ignored.

With GNU awk:
$ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file
1
xxx
2
fooxxx
With any awk:
$ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file
1
xxx
2
fooxxx

how to use sed/awk to do math arithmetic from a file

I have a file test.txt with multiple lines sharing the same pattern:
a:1;qty=2;px=3;d=4;
a:5;qty=6;px=7;d=8;
a:9;qty=10;px=11;d=12;
And I would like to write a simple terminal linux cmd using sed/awk to calculate (2*3+6*7+10*11)/(2+6+10), which is sum(qty*px)/sum(qty).
May I ask that, how to retrieve the value of qty and px in each line, and then use awk to store the values and do the final calculation?
Thanks,

One way if no empty lines:
awk -F"[=;]" '{x+=$3;y+=$3*$5}END{print y/x}' file
If empty lines present,
awk -F"[=;]" '!/^$/{x+=$3;y+=$3*$5}END{print y/x}' file

If that's the most general pattern, then the following oneline should suffice
cat test.txt | sed 's/[a-zA-Z]*[:=]//g' | awk -F';' '{ s1 += $2*$3; s2 += $2; }; END { print s1/s2; }'

In case the keys are not always in the same order, you can do
awk -F "[=: ]*" '{ for( i=2; i<=NF;i+=2) a[$i]=$(i+1) }
{ num += a["px"]*a["qty"]; den+=a["qty"]}
END { print num/den }' file

Word Occurrences Script

I'm putting together a script that will the count the occurrences of words in text documents.
{
$0 = tolower($0)
for ( i = 1; i <= NF; i++ )
freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
sort = "sort -k 2nr"
for (word in freq)
printf "%-20s %-6s\n", word, freq[word] | sort
close(sort)
}
It works fine so far, but I'd like to make a couple tweaks/additions:
I'm having a hard time displaying the array index number, tried freq[$i] which just spit 0's back at me
Is there any way to eliminate the whitespace (spaces) from the word count?

You do not need to code an own loop to scan the fields, just set RS to make each word an own record: E.g. RS=[^A-Za-z] will treat every string not completely being built from uppercase and lowercase letters as record separator.
$ echo 'Hello world! I am happy123...' | awk 'BEGIN{RS="[^A-Za-z]+"}$0'
Hello
world
I
am
happy
The single $0 matches nonempty lines.
Maybe you want to allow digits in words.. just adapt RS to your needs.
So what's left?
Transform to lowercase, count, print sorted results.
File wfreq.awk:
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
printf "%-20s %6s\n", "Word", "Count"
sort = "sort -k 2nr"
for(word in counts)
printf "%-20s %6s\n",word,counts[word] | sort
close(sort)
}
Example run (only top 10 lines of output for not spamming the answer):
$ awk -f wfreq.awk /etc/motd | head
Word Count
the 5
debian 3
linux 3
are 2
bpo 2
gnu 2
in 2
with 2
absolutely 1
But now for something not really completely different...
To sort by a different field, just adapt the sort = "sort ..." options.
I don't use asort() because not every awk has this extension.
File wfreq2.awk:
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
printf "%-20s %6s\n", "Word", "Count"
sort = "sort -k 1"
for(word in counts)
printf "%-20s %6s\n",word,counts[word] | sort
close(sort)
}
Example run (only top 10 lines of output for not spamming the answer):
$ awk -f wfreq2.awk /etc/motd | head
Word Count
absolutely 1
amd 1
applicable 1
are 2
bpo 2
by 1
comes 1
copyright 1
darkstar 1

evaluate arithmetic expression passed as argument in awk

How to evaluate arithmetic expression passed as argument in awk?
I have this in a file.
1*2*3
1+2*3
awk should output 6, 7, when this file is passed in.

awk(1) is the wrong tool as it doesn't have an eval() function. I suggest modifying the file into input for bc(1) or using shell arithmetic expansion:
while read expr; do
echo "$(($expr))"
done < file

awk doesn't have eval() funciton
use bc or shell arithmetic expansion both can make it
But, if you use it in hadoop scripts, consider the subprocesses problem
besides, you can try these ways:
Consider to write an expression evaluator in AWK (from archive.org, search for calc3)
Use eval
Use Python's eval function

I know this is awful but we can:
awk '{system(sprintf("awk \"BEGIN {print " $0 "}\""))}'
as pointed out, bc, Python eval, bash $(( )), are better solutions
One last suggestion Perl:
perl -nE 'say eval'

Here's another hack inspired in part by #JJoao's answer, and feedback from #DracoMetallium of Twitter...
#!/usr/bin/env bash
calc() {
awk 'BEGIN { print '"${#//inf/(2 ** 1024)}"'; }'
}
calc '1/2'
#> 0.5
... this also handles instances of inf being passed via Bash built-ins for search-and-replace, eg...
calc 'inf + inf'
#> inf
calc '-inf + -inf'
#> -inf
calc '-inf + inf'
#> -nan
Which may be useful within one's .bashrc file for quick command-line based calculations.
And for completeness, here's an example how to preform the above in (mostly) pure Awk...
calc.awk
#!/usr/bin/awk -f
function calc(expression) {
gsub("inf", "(2 ** 1024)", expression)
system(sprintf("awk \"BEGIN {printf(" expression ")}\""))
}
{
print calc($0)
}
... as well as examples of usage...
calc.awk <<<'1 /2'
#> 0.5
printf '2*2\nsqrt(9)\n' | calc.awk
#> 4
#> 3
calc.awk <<'EOF'
22 / 7
(1 + sqrt(5)) / 2
EOF
#> 3.14286
#> 1.61803
tee -a 'input-file.txt' 1>'/dev/null' <<'EOF'
1*2*3
1+2*3
EOF
calc.awk input-file.txt
#> 6
#> 7

awk code self-eval :
echo '1*2*3
1+2*3' |
mawk '
function eval(_,__,___) {
return substr("",
(___=RS)*(RS="^$")*((__="mawk \47BEGIN { "\
"printf(\"%.f\","(_)") }\47")|getline _),
close(__)^(RS=___)*__)_
}
$++NF = eval($!_)'
1*2*3 6
1+2*3 7
And have non-GMP-enabled variants of awk handle bigints via gawk-gmp :
echo '9^357' | mawk2 '
function eval(__,_,___) {
return substr("",(___=RS) * (RS="^$") * ((_="gawk -Mbe"\
" \47BEGIN { printf("(__)") }\47")|getline __), close(_)^(RS=___)*_)__
} $++NF = eval($!_)'
9^357 46192968246584020379055552051071189505164865440669900464
39030285864012137741835863345354556100224446056419891013
64348709024164571890111337972631022968123699490725498380
48619487796915547325757427881925121757649463471671577403
93732287476951829673979533419257547784348206387576562750
0451665854873600139914343339972692154903156749530623670508969

As an example, consider what iftop gives you:
Host name last 2s last 10s last 40s cumulative
1 10.150.1.1 => 650B 533B 533B 2.08KB
85.239.108.20 <= 16.0KB 12.9KB 12.9KB 51.5KB
Let's say you need the 2 up/down lines into one line and calculate KB/B into right byte values (*1024). You could have this:
iftop -i eth1 -ts 10 -Bn|egrep "<|>"| sed 's/^ //g;s/^[1-9]/x/g;s/KB/ 1024/g;s/B/ 1/g' | tr -d '\n'|tr "x" '\n'| grep .| awk '{print $1" "$11" - "$9*$10+$19*$20" "$9*$10" "$19*$20 }'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find the most repeated sequence in a file - awk

Related

Awk - delete oldest duplicate lines, preserving the latest duplicate + deleting one line above deleted duplicate

awk: first, split a line into separate lines; second, use those new lines as a new input

how to use sed/awk to do math arithmetic from a file

Word Occurrences Script

evaluate arithmetic expression passed as argument in awk

Categories

Resources