How to print symbols instead of numbers using awk in bash - awk

I have input file
AD;12;qwe
EL;66;qwe
AD;90;oop
AD;8;lpp
EL;56;oop
GR;55;qwe
If I use echo "$file" | awk -F';' '{a[$1]+= 1} END{for(i in a){printf "%-5s: %s\n", i, a[i]}}' | sort
My output will be
AD : 3
EL : 2
GR : 1
But I want to get something like this and I have no idea how to do it
AD : ###
EL : ##
GR : #
Could anyone help me?

A little function
awk -F';' '
{a[$1] += 1}
function repeat(char, num, s) {
s = sprintf("%*s", num, "")
gsub(/ /, char, s)
return s
}
END {
for (i in a)
printf "%-5s: %s\n", i, repeat("#", a[i])
}
' file | sort
AD : ###
EL : ##
GR : #

yet another awk
$ awk -F';' 'function repeat(n,c)
{return (n<=0)?"":(c repeat(n-1,c))}
{a[$1]++}
END {for(k in a) printf "%-5s: %s\n",k,repeat(a[k],"#")}' file | sort
AD : ###
EL : ##
GR : #
or with memoization
awk -F';' 'function repeat(n,c)
{return (n<=0)?"":(c memoize(n-1,c))}
function memoize(n,c)
{if(!(n in mem)) mem[n]=repeat(n,c); return mem[n]}
{a[$1]++}
END {for(k in a) printf "%-5s: %s\n",k,memoize(a[k],"#")}'
with the cost of additional complexity, this should be much faster for large counts
awk -F';' 'function repeat(n,c, _t_)
{if(n<=0) return "";
else if(n%2) return c memoize(n-1,c);
else {_t_=memoize(n/2,c); return _t_ _t_}}
function memoize(n,c)
{if(!(n in mem)) mem[n]=repeat(n,c); return mem[n]}
{a[$1]++}
END {for(k in a) printf "%-5s: %s\n",k,memoize(a[k],"#")}'

echo "$file" | awk -F';' '{a[$1]+= 1} END{ for(i in a){ printf "%s : ",i;for (j=1;j<=a[i];j++) { printf "%s","#" } printf "\n", i, a[i]}}' | sort
Print the index of the array and then process another loop from 1 to the value of the index, printing "#"
Using GNU awk and utilising array sorting to negate the need to pipe through to sort
echo "$file" | awk -F';' '{a[$1]+= 1} END{ PROCINFO["sorted_in"]="#val_num_desc";for(i in a){ printf "%s : ",i;for (j=1;j<=a[i];j++) { printf "%s","#" } printf "\n", i, a[i]}}'

This is similar to #glennjackman, but the field-width modifier is used to build a string of spaces for output (b) and then converting the spaces to '#' with gsub() rather than using repeat(). You can do:
awk -F';' '
{ a[$1] += 1 }
END {
for (i in a) {
b = sprintf ("%*s",a[i]," ")
gsub (/ /,"#",b)
printf "%-5s: %s\n", i, b
}
}
' file | sort
Example Use/Output
$ awk -F';' '
> { a[$1] += 1 }
> END {
> for (i in a) {
> b = sprintf ("%*s",a[i]," ")
> gsub (/ /,"#",b)
> printf "%-5s: %s\n", i, b
> }
> }
> ' file | sort
AD : ###
EL : ##
GR : #
They do close to the same thing (but if I'd snapped to the use of repeat() earlier, I'd probably have gone that route :). Let me know if you have questions.

With GNU awk for gensub():
$ cut -d';' -f1 file | sort | uniq -c |
awk '{printf "%-5s: %s\n", $2, gensub(/ /,"#","g",sprintf("%*s",$1,""))}'
AD : ###
EL : ##
GR : #
or with any awk:
$ cut -d';' -f1 file | sort | uniq -c |
awk '{str=sprintf("%*s",$1,""); gsub(/ /,"#",str); printf "%-5s: %s\n", $2, str}'
AD : ###
EL : ##
GR : #

I want to propose little change to original code in order to get desired result - simply append # in place of keeping count. Let file.txt content be
AD;12;qwe
EL;66;qwe
AD;90;oop
AD;8;lpp
EL;56;oop
GR;55;qwe
then
awk 'BEGIN{FS=";"}{a[$1]=a[$1] "#"}END{for(i in a){printf "%-5s: %s\n", i, a[i]}}' file.txt
output
EL : ##
AD : ###
GR : #
For simplicity sake I left echo and sort parts as these should remain unchanged.
(tested in gawk 4.2.1)

Related

AWK: How to number auto-increment?

I have a file.file content is:
20210126000880000003|3|33.00|20210126|15:30
1|20210126000000000000000000002207|1220210126080109|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT07|621067000000123645|收款方户名|2021-01-26|2021-01-26|10.00|TN|NCS|12|875466
2|20210126000000000000000000002208|1220210126080110|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT06|621067000000123645|收款方户名|2021-01-26|2021-01-26|20.00|TN|NCS|12|875466
3|20210126000000000000000000002209|1220210126080111|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT08|621067000000123645|收款方户名|2021-01-26|2021-01-26|3.00|TN|NCS|12|875466
I use awk command:
awk -F"|" 'NR==1{print $1};FNR==2{print $2,$3}' testfile
Get the following result:
20210126000880000003
20210126000000000000000000002207 1220210126080109
I want the number to auto-increase:
awk -F"|" 'NR==1{print $1+1};FNR==2{print $2+1,$3+1}' testfile
But get follow result:
20210126000880001024
20210126000000000944237587726336 1220210126080110
have question:
I want to the numer is auto-increase: hope the result is:
20210126000880000003
20210126000000000000000000002207|1220210126080109
-------------------------------------------------
20210126000880000004
20210126000000000000000000002208|1220210126080110
--------------------------------------------------
20210126000880000005
20210126000000000000000000002209|1220210126080111
How to auto_increase?
Thanks!
You may try this gnu awk command:
awk -M 'BEGIN {FS=OFS="|"} NR == 1 {hdr = $1; next} NF>2 {print ++hdr; print $2, $3; print "-------------------"}' file
20210126000880000004
20210126000000000000000000002207|1220210126080109
-------------------
20210126000880000005
20210126000000000000000000002208|1220210126080110
-------------------
20210126000880000006
20210126000000000000000000002209|1220210126080111
-------------------
A more readable version:
awk -M 'BEGIN {
FS=OFS="|"
}
NR == 1 {
hdr = $1
next
}
NF > 2 {
print ++hdr
print $2, $3
print "-------------------"
}' file
Here is a POSIX awk solution that doesn't need -M:
awk 'BEGIN {FS=OFS="|"} NR == 1 {hdr = $1; next} NF>2 {"echo " hdr " + 1 | bc" | getline hdr; print hdr; print $2, $3; print "-------------------"}' file
20210126000880000004
20210126000000000000000000002207|1220210126080109
-------------------
20210126000880000005
20210126000000000000000000002208|1220210126080110
-------------------
20210126000880000006
20210126000000000000000000002209|1220210126080111
-------------------
Anubhava has the best solution but for older versions of GNU awk that don't support -M (big numbers) you can try the following:
awk -F\| 'NR==1 { print $1;hed=$1;hed1=substr($1,(length($1)-1));next; } !/^$/ {print $2" "$3 } /^$/ { print "--------------------------------------------------";printf "%s%s\n",substr(hed,1,((length(hed))-(length(hed1)+1))),++hed1 }' testfile
Explanation:
awk -F\| 'NR==1 { # Set field delimiter to | and process the first line
print $1; # Print the first field
hed=$1; # Set the variable hed to the first field
hed1=substr($1,(length($1)-1)); # Set a counter variable hed1 to the last digit in hed ($1)
next;
}
!/^$/ {
print $2" "$3 # Where there is no blank line, print the second field, a space and the third field
}
/^$/ {
print "--------------------------------------------------"; # Where there is a blank field, process
printf "%s%s\n",substr(hed,1,((length(hed))-(length(hed1)+1))),++hed1 # print the header extract before the counter, followed by the incremented counter
}' testfile

Awk column with pattern array

Is it possible to do this but use an actual array of strings where it says "array"
array=(cat
dog
mouse
fish
...)
awk -F "," '{ if ( $5!="array" ) { print $0; } }' file
I would like to use spaces in some of the strings in my array.
I would also like to be able to match partial matches, so "snow" in my array would match "snowman"
It should be case sensitive.
Example csv
s,dog,34
3,cat,4
1,african elephant,gd
A,African Elephant,33
H,snowman,8
8,indian elephant,3k
7,Fish,94
...
Example array
snow
dog
african elephant
Expected output
s,dog,34
H,snowman,8
1,african elephant,gd
Cyrus posted this which works well, but it doesn't allow spaces in the array strings and wont match partial matches.
echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$2){next}} print}' FS=',' - file
The brief approach using a single regexp for all array contents:
$ array=('snow' 'dog' 'african elephant')
$ printf '%s\n' "${array[#]}" | awk -F, 'NR==FNR{r=r s $0; s="|"; next} $2~r' - example.csv
s,dog,34
1,african elephant,gd
H,snowman,8
Or if you prefer string comparisons:
$ cat tst.sh
#!/bin/env bash
array=('snow' 'dog' 'african elephant')
printf '%s\n' "${array[#]}" |
awk -F',' '
NR==FNR {
array[$0]
next
}
{
for (val in array) {
if ( index($2,val) ) { # or $2 ~ val for a regexp match
print
next
}
}
}
' - example.csv
$ ./tst.sh
s,dog,34
1,african elephant,gd
H,snowman,8
This prints no line from csv file which contains an element from array in column 5:
echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$5){next}} print}' FS=',' - file

Print columns from two files

How to print columns from various files?
I tried according to Awk: extract different columns from many different files
paste <(awk '{printf "%.4f %.5f ", $1, $2}' FILE.R ) <(awk '{printf "%.6f %.0f.\n", $3, $4}' FILE_R )
FILE.R == ARGV[1] { one[FNR]=$1 }
FILE.R == ARGV[2] { two[FNR]=$2 }
FILE_R == ARGV[3] { three[FNR]=$3 }
FILE_R == ARGV[4] { four[FNR]=$4 }
END {
for (i=1; i<=length(one); i++) {
print one[i], two[i], three[i], four[i]
}
}
but I don't understand how to use this script.
FILE.R
56604.6017 2.3893 2.2926 2.2033
56605.1562 2.3138 2.2172 2.2033
FILE_R
56604.6017 2.29259 0.006699 42.
56605.1562 2.21716 0.007504 40.
Output desired
56604.6017 2.3893 0.006699 42.
56605.1562 2.3138 0.007504 40.
Thank you
This is one way:
$ awk -v OFS="\t" 'NR==FNR{a[$1]=$2;next}{print $1,a[$1],$3,$4}' file1 file2
Output:
56604.6017 2.3893 0.006699 42.
56605.1562 2.3138 0.007504 40.
Explained:
$ awk -v OFS="\t" ' # setting the field separator to a tab
NR==FNR { # process the first file
a[$1]=$2 # hash the second field, use first as key
next
}
{
print $1,a[$1],$3,$4 # output
}' file1 file2
If the field spacing with tabs is not enough, use printf with modifiers like in your sample.

How to get cardinality of fields with AWK?

I am trying to count the unique occurrences for each field in a txt file.
Sample:
2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79
2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0
2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0
2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA
Full dataset here: https://github.com/markgrover/cloudcon-hive (Flight delay dataset from 2008.)
For a single column we can do:
for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo
Is there a way to do it in one go for all the columns?
I think the expected output looks like this:
1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2:
For the entire dataset:
1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
vals[i][$i]
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", length(vals[i]), (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
and with any awk:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
if ( !seen[i,$i]++ ) {
cnt[i]++
}
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", cnt[i], (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
In GNU awk:
$ awk '
BEGIN { FS=OFS="," } # delimiters to ,
{
for(i=1;i<=NF;i++) # iterate over every field
a[i][$i] # store unique values to 2d hash
}
END { # after all the records
for(i=1;i<=NF;i++) # iterate the unique values for each field
for(j in a[i])
c[i]++ # count them and
for(i=1;i<=NF;i++)
printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values
}' file
1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3
The output is not exactly the same, not sure if the mistake is your or mine. Well, the last column has the values 79,0 and NA so mine is more accurate on that one.
another awk
this will give you a rolling counts, pipe to tail -1 to get the last line for the overall counts
$ awk -F, -v OFS=: '{for(i=1;i<=NF;i++)
printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file
1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1
1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2
1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3
1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3
1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3
1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3
1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3
1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3
1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

awk: gsub /pattern1/, but not /pattern1pattern2/

In my work, I have to solve such a simple problem: change pattern1 to newpattern, but only if it is not followed by pattern2 or pattern3:
"pattern1 pattern1pattern2 pattern1pattern3 pattern1pattern4" → "newpattern pattern1pattern2 pattern1pattern3 newpatternpattern4"
Here is my solution, but I don't like it and I suppose there should be a more elegant and easy way to do that?
$ echo 'pattern1 pattern1pattern2 pattern1pattern3 pattern1pattern4' | awk '
{gsub(/pattern1pattern2/, "###", $0)
gsub(/pattern1pattern3/, "%%%", $0)
gsub(/pattern1/, "newpattern", $0)
gsub(/###/, "pattern1pattern2", $0)
gsub(/%%%/, "pattern1pattern3", $0)
print}'
newpattern pattern1pattern2 pattern1pattern3 newpatternpattern4
So, the sample input file:
pattern1 pattern1pattern2 aaa_pattern1pattern3 pattern1pattern4 pattern1pattern2pattern1
The sample output file should be:
newpattern pattern1pattern2 aaa_pattern1pattern3 newpatternpattern4 pattern1pattern2newpattern
This is trivial in perl, using a negative lookahead:
perl -pe 's/pattern1(?!pattern[23])/newpattern/g' file
Substitute all matches of pattern1 that are not followed by pattern2 or pattern3.
If for some reason you need to do it in awk, then here's one way you could go about it:
{
out = ""
replacement = "newpattern"
while (match($0, /pattern1/)) {
if (substr($0, RSTART + RLENGTH) ~ /^pattern[23]/) {
out = out substr($0, 1, RSTART + RLENGTH - 1)
}
else {
out = out substr($0, 1, RSTART - 1) replacement
}
$0 = substr($0, RSTART + RLENGTH)
}
print out $0
}
Consume the input while pattern1 matches and build the string out, inserting the replacement when the part after each match isn't pattern2 or pattern3. Once there are no more matches, print the string that has been build so far, followed by whatever is left in the input.
With GNU awk for the 4th arg to split():
$ cat tst.awk
{
split($0,flds,/pattern1(pattern2|pattern3)/,seps)
for (i=1; i in flds; i++) {
printf "%s%s", gensub(/pattern1/,"newpattern","g",flds[i]), seps[i]
}
print ""
}
$ awk -f tst.awk file
newpattern pattern1pattern2 aaa_pattern1pattern3 newpatternpattern4 pattern1pattern2newpattern
With other awks you can do the same with a while(match()) loop:
$ cat tst.awk
{
while ( match($0,/pattern1(pattern2|pattern3)/) ) {
tgt = substr($0,1,RSTART-1)
gsub(/pattern1/,"newpattern",tgt)
printf "%s%s", tgt, substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+RLENGTH)
}
gsub(/pattern1/,"newpattern",$0)
print
}
$ awk -f tst.awk file
newpattern pattern1pattern2 aaa_pattern1pattern3 newpatternpattern4 pattern1pattern2newpattern
but obviously the gawk solution is simpler and more concise so, as always, get gawk!
awk solution. Nice question. Basically it's doing 2 gensubs:
$ cat tst.awk
{ for (i=1; i<=NF; i++){
s=gensub(/pattern1/, "newpattern", "g", $i);
t=gensub(/(newpattern)(pattern(2|3))/, "pattern1\\2", "g", s);
$i=t
}
}1
Testing:
echo "pattern1 pattern1pattern2 aaa_pattern1pattern3 pattern1pattern4 pattern1pattern2pattern1" | awk -f tst.awk
newpattern pattern1pattern2 aaa_pattern1pattern3 newpatternpattern4 pattern1pattern2newpattern
However, this will fail whenever you already have something like newpatternpattern2 in your input. But that's not what OP suggests with his input examples, I guess.