How can I count the frequency of letters

How can I count the frequency of letters - awk

I have a data like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
I want to count how many of each letter is there, so if I have one I count like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
cat input.txt | grep -v ">" | fold -w1 | sort | uniq -c
6 A
9 C
10 D
1 E
7 F
18 G
5 H
4 I
7 K
21 L
15 N
7 P
6 Q
11 R
16 S
18 T
7 V
8 W
7 Y
however, I want to calculate for all in a better way and more efficient especially when the data is huge

Counting characters in strings can easily be done with awk. To do this, you make use of the function gsub:
gsub(ere, repl[, in])
Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0 or in the in argument when specified.
sub(ere, repl[, in ])
Substitute the string repl in place of the first instance of the extended regular expression ERE in string in and return the number of substitutions. <snip> If in is omitted, awk shall use the current record ($0) in its place.
source: Awk Posix Standard
The following two functions perform the counting in this way:
function countCharacters(str) {
while(str != "") { c=substr(str,1,1); a[toupper[c]]+=gsub(c,"",str) }
}
or if there might appear a lot of equal consecutive characters, the following solution might shave off a couple of seconds.
function countCharacters2(str) {
n=length(str)
while(str != "") { c=substr(str,1,1); gsub(c"+","",str);
m=length(str); a[toupper[c]]+=n-m; n=m
}
}
Below you find 4 implementations based on the first function. The first two run on a standard awk, the latter two on an optimized version for fasta-files:
1. Read sequence and processes it line by line:
awk '!/^>/{s=$0; while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) } }
END {for(c in a) print c,a[c]}' file
2. concatenate all sequences and process it in the end:
awk '!/^>/{s=s $0 }
END {while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file
3. Same as 1 but use bioawk:
bioawk -c fastx '{while ($seq!=""){ c=substr($seq,1,1);a[c]+=gsub(c,"",$seq) } }
END{ for(c in a) print c,a[c] }' file
4. Same as 2 but use bioawk:
bioawk -c fastx '{s=s $seq}
END { while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file
Here are some timing results based on this fasta-file
OP : grep,sort,uniq : 47.548 s
EdMorton 1 : awk : 39.992 s
EdMorton 2 : awk,sort,uniq : 53.965 s
kvantour 1 : awk : 18.661 s
kvantour 2 : awk : 9.309 s
kvantour 3 : bioawk : 1.838 s
kvantour 4 : bioawk : 1.838 s
karafka : awk : 38.139 s
stack0114106 1: perl : 22.754 s
stack0114106 2: perl : 13.648 s
stack0114106 3: perl (zdim) : 7.759 s
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.

With any awk in any shell on any UNIX box:
$ cat tst.awk
!/^>/ {
for (i=1; i<=length($0); i++) {
cnt[substr($0,i,1)]++
}
}
END {
for (char in cnt) {
print char, cnt[char]
}
}
$ awk -f tst.awk file
A 107
N 67
P 107
C 41
Q 88
D 102
E 132
R 101
F 65
S 168
G 140
T 115
H 52
I 84
V 101
W 27
K 114
Y 30
L 174
M 39
or if you prefer:
$ awk -v ORS= '!/^>/{gsub(/./,"&\n"); print}' file | sort | uniq -c
107 A
41 C
102 D
132 E
65 F
140 G
52 H
84 I
114 K
174 L
39 M
67 N
107 P
88 Q
101 R
168 S
115 T
101 V
27 W
30 Y

Try this Perl solution for better performance.
$ perl -lne '
if( ! /^>/ ) { while(/./g) { $kv{$&}++} }
END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' learner.txt
A 107
C 41
D 102
E 132
F 65
G 140
H 52
I 84
K 114
L 174
M 39
N 67
P 107
Q 88
R 101
S 168
T 115
V 101
W 27
Y 30
$
One more solution using Perl, optimized for performance.
$ time perl -lne '
if( ! /^>/ ) { for($i=0;$i<length($_);$i++)
{ $x=substr($_,$i,1); $kv{$x}++ } }
END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' chrY.fa
A 2994088
C 1876822
G 1889305
N 30812232
T 3002884
a 4892104
c 3408967
g 3397589
n 140
t 4953284
real 0m15.922s
user 0m15.750s
sys 0m0.108s
$
Edit with further performance optimizations
All timings reported below are averages over 3-5 runs on a desktop, done at around the same time but swapped around to avoid pronounced cacheing effects.
Changing the C-style for loop to for my $i (0..length($_)) speeds the second solution from 9.2 seconds to 6.8 seconds.
Then, also removing a scalar ($x) at each operation, with
if (not /^>/) { for $i (0..length($_)) { ++$kv{ substr($_,$i,1) } } }
speeds this up to 5.3 seconds.
Further reducing variable use, by copying $_ and thus freeing up the loop to use $_
if (not /^>/) { $l=$_; ++$kv{ substr($l,$_,1) } for 0..length($l) }
only helps a little, running at 5.2 seconds.
This compares with the awk solution, given as kvantour 2 in nice comparisons in kvantour answer, at 6.5 seconds (on this system).
Of course none of this can be compared to the optimized bioawk (C-code?) program. For that we'd need to write this in C (which is not very hard using Inline C).
Note that removing a sub call (to substr) for every character by using
if (not /^>/) { ++$kv{$_} for split //; }
results in "only" a 6.4 seconds average, not as good as the above tweaks; this was a surprise.
These times are on a desktop with v5.16. On v5.24, on the same machine, the best-case (substr with no extra variables in the loop) time is 4.8 seconds while the one without the substr (but with split) is 5.8 seconds. It's nice to see that newer versions of Perl perform better, at least in these cases.
For reference and easy timing by others, complete code for the best run
time perl -lne'
if (not /^>/) { $l=$_; ++$kv{ substr($l,$_,1) } for 0..length($l) }
END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' chrY.fa

not sure how much faster this would be but if you try please post your timings
$ awk '!/^>/ {n=split($0,a,""); for(i=1;i<=n;i++) c[a[i]]++}
END {for(k in c) print k,c[k]}' file | sort
A 6
C 9
D 10
E 1
F 7
G 18
H 5
I 4
K 7
L 21
N 15
P 7
Q 6
R 11
S 16
T 18
V 7
W 8
Y 7
this reports counts for the file, not line by line. As noted below, not all awk's support empty string split.
Here are the timings of the three approaches:
$ time grep -v ">" filey | fold -w1 | sort | uniq -c >/dev/null
real 0m11.470s
user 0m11.746s
sys 0m0.260s
$ time awk '{n=split($0,a,""); for(i=1;i<=n;i++) c[a[i]++]} END{for(k in c) print k,c[k]}' filey >/dev/null
real 0m7.441s
user 0m7.334s
sys 0m0.060s
$ time awk '{n=length($0); for(i=1;i<=n;i++) c[substr($0,i,1)]++} END{for(k in c) print k,c[k]}' filey >/dev/null
real 0m5.055s
user 0m4.979s
sys 0m0.047s
for the test file
$ wc filey
118098 649539 16828965 filey
it surprised me that substr is faster than split. Perhaps due to array allocation.

Related

Sum column and count lines

I am trying to sum certain numbers in colum 2, it works with my code. But I want to count also how many times the same value in colum 2 is repeated and print in the last column.
file1
36 2605 1 2
36 2605 1 2
36 2603 1 2
36 2605 1 2
36 2605 1 2
36 2605 1 2
36 2606 1 2
Output Desired
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
I tried
awk '{a[$2]+=$1}{b[$2]+=$3}{c[$2]+=$4;count[$2]+=$2}END{for(i in a)print i,a[i],b[i],c[i],count[i]}' file1
Thanks in advance

Renamed the vars and added pretty print:
awk '
{
sum1[$2]+=$1
sum3[$2]+=$3
sum4[$2]+=$4
count[$2]++
len2=((l=length($2))>len2?l:len2)
len1=((l=length(sum1[$2]))>len1?l:len1)
len3=((l=length(sum3[$2]))>len3?l:len3)
len4=((l=length(sum4[$2]))>len4?l:len4)
len5=((l=length(sum5[$2]))>len5?l:len5)
}
END {
for(i in count) {
printf "%*d %*d %*d %*d %*d\n",
len2,i,len1,sum1[i],len3,sum3[i],len4,sum4[i],len5,count[i]
}
}' file
Output:
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1

Space chars are relatively inexpensive these days, you should really consider getting some for your code, especially if you want other people to read it to help you debug it! Here's the code you posted:
awk '{a[$2]+=$1}{b[$2]+=$3}{c[$2]+=$4;count[$2]+=$2}END{for(i in a)print i,a[i],b[i],c[i],count[i]}' file1
and here it is after having been run through a code beautifier (I used gawk -o):
{
a[$2] += $1
}
{
b[$2] += $3
}
{
c[$2] += $4
count[$2] += $2
}
END {
for (i in a) {
print i, a[i], b[i], c[i], count[i]
}
}
See how just by adding some white space it's now vastly easier to understand and so the bug in how count[$2] is being populated is glaringly obvious? Some meaningful variable names are always extremely useful too and I hear alphanumeric chars are on special right now!
FWIW here's how I'd do this:
$ cat tst.awk
BEGIN { keyFldNr = 2 }
{
numOutFlds = 0
for (i=1; i<=NF; i++) {
if (i != keyFldNr) {
sum[$keyFldNr,++numOutFlds] += $i
}
}
cnt[$keyFldNr]++
}
END {
for (key in cnt) {
printf "%s%s", key, OFS
for (i=1; i<=numOutFlds; i++) {
printf "%s%s", sum[key,i], OFS
}
print cnt[key]
}
}
$ awk -f tst.awk file
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
$ awk -f tst.awk file | column -t
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
Notice that it'll work as-is no matter how many fields you have on each line and if you need to use a different field for the key that you count and sum on then you just change the value of keyFldNr in the BEGIN section from 2 to whatever you want it to be.

A non-awk approach, using the very useful GNU datamash, which is designed for tasks like this one:
$ datamash -Ws groupby 2 sum 1,3,4 count 2 < input.txt
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
Read as: For each group of rows with the same value in column 2, display that value, the sums of columns 1, 3 and 4, and the number of rows in the group.

You've almost nailed it, you're not increasing count[$2] properly.
$ awk '{a[$2]+=$1;b[$2]+=$3;c[$2]+=$4;count[$2]++}
END{for(i in a) print i,a[i],b[i],c[i],count[i]}' file
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1

no need external program, faster ~21ms, tried on pure gnu awk
awk '{if($0~/^[A-Za-z0-9]/)a[NR]=$2" "$1" "$3" "$4}END{asort(a);$0="";for(;i++<NR;){split(a[i],b);if($1==""||b[1]==$1){$2+=b[2];$3+=b[3];$4+=b[4];$5++} else {print;$2=b[2];$3=b[3];$4=b[4];$5=1} $1=b[1]} print}' file1

count, groupby with sed, or awk

i want to perform two different sort and count on a file, based on each line's content.
1. i need to take the first column of a .tsv file
i would like to group by each line that starts with three digits, and keep only the three first digits, and for everything else, just sort and count the whole occurrence of the sentence in the first column.
Sample data:
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe fkdjald
890897 34213
6878853 834
32fasd 53891
abcdee 8794371
abd 873
result:
687 2
890 3
01a 1
1b 1
32fasd 1
abd 1
dfeqfe 1
abcdee 2
I would also appreciate a solution that would
also take into account a sample input like
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe 545
890897 34213
6878853 834
(632)fasd 53891
(88)abcdee 8794371
abd 873
so the first column may have values like (,), #, ', all kind of characters
so output will have two columns, the first with the values extracted, and the second with the new count, with the new values extracted from the source file.
Again preferred output format tsv.
so i need to extract all values that start with
^\d\d\d, and then for these three first digits, sort and count unique values,
but in a second pass, also do the same for each line, that does not start with 3 digits, but this time, keep the whole columns value and sort count by it.
what i have tried:
| sort | uniq -c | sort -nr for the lines that do start with ^\d\d\d, and
the same for those that do not fulfill the above regex, but is there a more elegant way using either sed or awk?

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ cnt[/^[0-9]{3}/ ? substr($1,1,3) : $1]++ }
END {
for (key in cnt) {
print (key !~ /^[0-9]{3}/), cnt[key], key, cnt[key]
}
}
$ awk -f tst.awk file | sort -k1,2n | cut -f3-
687 1
890 2
abcdee 1

You can try Perl
$ cat nefijaka.txt
687 878 9
890987 4
890a 34
abcdee 987
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt
687 1
890 2
abcdee 1
$
You can pipe it to sort and get the values sorted..
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt | sort -k2 -nr
890 2
abcdee 1
687 1
EDIT1:
$ cat nefijaka.txt2
687 878 9
890987 4
890a 34
abcdee 987
a word and then 23
$ perl -lne ' /^(\d{3})|(.+?\t)/; $x=$1?$1:$2; $x=~s/\t//g; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt2
687 1
890 2
a word and then 1
abcdee 1
$

Print Distinct Values from Field AWK

I'm looking for a way to print the distinct values in a field while in the command-prompt environment using AWK.
ID Title Promotion_ID Flag
12 Purse 7 Y
24 Wallet 7 Y
709 iPhone 1117 Y
74 Satchel 7 Y
283 Xbox 84 N
Ideally I'd like to return the promotion_ids: 7, 1117, 84.
I've researched the question on Google and have found some examples such as:
`cut -f 3 | uniq *filename.ext*` (returned error)
`awk cut -f 3| uniq *filename.ext*` (returned error)
`awk cut -d, -f3 *filename.ext* |sort| uniq` (returned error)

awk 'NR>1{a[$3]++} END{for(b in a) print b}' file
Output:
7
84
1117

Solution 1st: Simple awk may help.(Following will remove the header of Input_file)
awk 'FNR>1 && !a[$3]++{print $3}' Input_file
Solution 2nd: In case you need to keep the Header of the Input_file then following may help you on same.
awk 'FNR==1{print;next} !a[$3]++{print $3}' Input_file

with the pipe line
$ sed 1d file | # remove header
tr -s ' ' '\t' | # normalize space delimiters to tabs
cut -f3 | # isolate the field
sort -nu # sort numerically and report unique entries
7
84
1117

[root#test ~]# cat test
ID Title Promotion_ID Flag
12 Purse 7 Y
24 Wallet 7 Y
709 iPhone 1117 Y
74 Satchel 7 Y
283 Xbox 84 N
Output -:
[root#test ~]# awk -F" " '!s[$3]++' test
ID Title Promotion_ID Flag
12 Purse 7 Y
709 iPhone 1117 Y
283 Xbox 84 N
[root#test ~]#

mawk '!__[$!NF=$--NF]--^(!_<NR)'
or
gawk' !__[$!--NF=$NF]--^(!_<NR)'
or perhaps
gawk '!__[$!--NF=$NF]++^(NF<NR)'
or even
mawk '!__[$!--NF=$NF]++^(NR-!_)' # mawk-only
gawk '!__[$!--NF=$--NF]--^(NR-NF)' # gawk-equiv of similar idea
7
1117
84

AWK print next line of match between matches

Let's presume I have file test.txt with following data:
.0
41
0.0
42
0.0
43
0.0
44
0.0
45
0.0
46
0.0
START
90
34
17
34
10
100
20
2056
30
0.0
10
53
20
2345
30
0.0
10
45
20
875
30
0.0
END
0.0
48
0.0
49
0.0
140
0.0
With AWK how would I print the lines after 10 and 20 between START and END.
So the output would be.
100
2056
53
2345
45
875
I was able to get the lines with 10 and 20 with
awk '/START/,/END/ {if($0==10 || $0==20) print $0}' test.txt
but how would I get the next lines?

I actually got what I wanted with
awk '/^START/,/^END/ {if($0==10 || $0==20) {getline; print} }' test.txt

Range in awk works fine, but is less flexible than using flags.
awk '/^START/ {f=1} /^END/ {f=0} f && /^(1|2)0$/ {getline;print}' file
100
2056
53
2345
45
875

Don't use ranges as they make trivial things slightly briefer but require a complete rewrite or duplicate conditions when things get even slightly more complicated.
Don't use getline unless it's an appropriate application and you have read and fully understand http://awk.info/?tip/getline.
Just let awk read your lines as designed:
$ cat tst.awk
/START/ { inBlock=1 }
/END/ { inBlock=0 }
foundTgt { print; foundTgt=0 }
inBlock && /^[12]0$/ { foundTgt=1 }
$ awk -f tst.awk file
100
2056
53
2345
45
875
Feel free to use single-character variable names and cram it all onto one line if you find that useful:
awk '/START/{b=1} /END/{b=0} f{print;f=0} b&&/^[12]0$/{f=1}' file

awk Repeat the values of a column n times

I have a column and I want to repeat it multiple times.
eg
for example I want the following column
1
2
3
4
5
to repeat n times and become
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
.
.
.
etc
is this possible?

Replace 3 with any value of n and $1 with the column you want to print:
$ for i in {1..3}; do awk '{print $1}' file; done
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
You could do it just with awk but you would have to store the whole file in memory:
$ awk '{a[NR]=$1}END{for(i=1;i<=3;i++)for(j=1;j<=NR;j++)print a[j]}' file
This is one of those occasions were using the shell constructs actually makes sense.

awk -v n=7 '{s = s $1 ORS} END{for (i=1;i<=n;i++) printf "%s", s}' file
Set n to whatever number you like.
Alternatively, with GNU awk:
awk -v n=7 -v RS='\0' -v ORS= 'END{for (i=1;i<=n;i++) print}' file

Here's a portable awk-based solution to do it using only built in variables with no looping at all :
{m,n,g}awk 'NF += __-(OFS = $-_)^_' RS='^$' FS='^$' ORS= __=3
1 <<<<
2
3
4
1 <<<<
2
3
4
1 <<<<
2
3
4
To repeat each row N times consecutively before the next row, then do
{m,n,g}awk 'NF += __-(OFS = "\n" $-_)^_' FS='^$' __=3
1 <<<<
1 <<<<
1 <<<<
2
2
2
3
3
3
4
4
4
If you wanna duplicate each row horizontally by N times, it's the same concept, slightly different :
{m,n,g}awk 'NF = __ *(OFS = $-_)^_' FS='^$' __=7
10101010101010
11111111111111
12121212121212
13131313131313
14141414141414
15151515151515
16161616161616
17171717171717
18181818181818
19191919191919
You can even make cute numeric pyramids by having the repeat count be based on NR :
{m,n,g}awk 'NF = NR * (OFS=$-_)^_' FS='^$'
10
1111
121212
13131313
1414141414
151515151515
16161616161616
1717171717171717
181818181818181818
19191919191919191919
2020202020202020202020
So if you ever need 2^25 aka 32^5 copies for ASCII string of 2^25 itself, then here's a pretty quick (0.45s) and easy way to get it done with no loops at all :
out9: 256MiB 0:00:00 [ 591MiB/s] [ 591MiB/s] [ <=> ]
( echo 33554432 | mawk2 'NF=(OFS=$-_)^_*$-_' FS='^$'; )
0.30s user 0.10s system 89% cpu 0.454 total
1 268435457
Even if you wanna repeat pretty sizable files, it's reasonably speedy fast : e.g. less than 13 secs for 9 x a 1.85 GB file :
{m,g}awk 'NF += __-(OFS = $-_)^_' RS='^$' FS='^$' ORS= __=9
out9: 16.6GiB 0:00:12 [1.30GiB/s] [1.30GiB/s] [ <=> ]
in0: 1.85GiB 0:00:00 [3.42GiB/s] [3.42GiB/s] [========>] 100%
5.48s user 4.44s system 77% cpu 12.809 total
112448475 17851902237 17851902237
If that's too slow for your needs, perhaps this way is a bit faster :
{m,g}awk '
BEGIN { FS = RS = "^$" }{__=sprintf("%*s",__, OFS = ORS = _)
gsub(".", "\\&", __) } sub(".+",__)' __=9
in0: 1.85GiB 0:00:00 [3.48GiB/s] [3.48GiB/s] [========>] 100%
out9: 16.6GiB 0:00:07 [2.12GiB/s] [2.12GiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}" | LC_ALL=C mawk2 -v __=9 ; )
0.72s user 4.34s system 64% cpu 7.855 total
112448475 17851902237 17851902237

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I count the frequency of letters - awk

Related

Sum column and count lines

count, groupby with sed, or awk

Print Distinct Values from Field AWK

AWK print next line of match between matches

awk Repeat the values of a column n times

Categories

Resources