How to print symbols instead of numbers using awk in bash - awk
I have input file
AD;12;qwe
EL;66;qwe
AD;90;oop
AD;8;lpp
EL;56;oop
GR;55;qwe
If I use echo "$file" | awk -F';' '{a[$1]+= 1} END{for(i in a){printf "%-5s: %s\n", i, a[i]}}' | sort
My output will be
AD : 3
EL : 2
GR : 1
But I want to get something like this and I have no idea how to do it
AD : ###
EL : ##
GR : #
Could anyone help me?
A little function
awk -F';' '
{a[$1] += 1}
function repeat(char, num, s) {
s = sprintf("%*s", num, "")
gsub(/ /, char, s)
return s
}
END {
for (i in a)
printf "%-5s: %s\n", i, repeat("#", a[i])
}
' file | sort
AD : ###
EL : ##
GR : #
yet another awk
$ awk -F';' 'function repeat(n,c)
{return (n<=0)?"":(c repeat(n-1,c))}
{a[$1]++}
END {for(k in a) printf "%-5s: %s\n",k,repeat(a[k],"#")}' file | sort
AD : ###
EL : ##
GR : #
or with memoization
awk -F';' 'function repeat(n,c)
{return (n<=0)?"":(c memoize(n-1,c))}
function memoize(n,c)
{if(!(n in mem)) mem[n]=repeat(n,c); return mem[n]}
{a[$1]++}
END {for(k in a) printf "%-5s: %s\n",k,memoize(a[k],"#")}'
with the cost of additional complexity, this should be much faster for large counts
awk -F';' 'function repeat(n,c, _t_)
{if(n<=0) return "";
else if(n%2) return c memoize(n-1,c);
else {_t_=memoize(n/2,c); return _t_ _t_}}
function memoize(n,c)
{if(!(n in mem)) mem[n]=repeat(n,c); return mem[n]}
{a[$1]++}
END {for(k in a) printf "%-5s: %s\n",k,memoize(a[k],"#")}'
echo "$file" | awk -F';' '{a[$1]+= 1} END{ for(i in a){ printf "%s : ",i;for (j=1;j<=a[i];j++) { printf "%s","#" } printf "\n", i, a[i]}}' | sort
Print the index of the array and then process another loop from 1 to the value of the index, printing "#"
Using GNU awk and utilising array sorting to negate the need to pipe through to sort
echo "$file" | awk -F';' '{a[$1]+= 1} END{ PROCINFO["sorted_in"]="#val_num_desc";for(i in a){ printf "%s : ",i;for (j=1;j<=a[i];j++) { printf "%s","#" } printf "\n", i, a[i]}}'
This is similar to #glennjackman, but the field-width modifier is used to build a string of spaces for output (b) and then converting the spaces to '#' with gsub() rather than using repeat(). You can do:
awk -F';' '
{ a[$1] += 1 }
END {
for (i in a) {
b = sprintf ("%*s",a[i]," ")
gsub (/ /,"#",b)
printf "%-5s: %s\n", i, b
}
}
' file | sort
Example Use/Output
$ awk -F';' '
> { a[$1] += 1 }
> END {
> for (i in a) {
> b = sprintf ("%*s",a[i]," ")
> gsub (/ /,"#",b)
> printf "%-5s: %s\n", i, b
> }
> }
> ' file | sort
AD : ###
EL : ##
GR : #
They do close to the same thing (but if I'd snapped to the use of repeat() earlier, I'd probably have gone that route :). Let me know if you have questions.
With GNU awk for gensub():
$ cut -d';' -f1 file | sort | uniq -c |
awk '{printf "%-5s: %s\n", $2, gensub(/ /,"#","g",sprintf("%*s",$1,""))}'
AD : ###
EL : ##
GR : #
or with any awk:
$ cut -d';' -f1 file | sort | uniq -c |
awk '{str=sprintf("%*s",$1,""); gsub(/ /,"#",str); printf "%-5s: %s\n", $2, str}'
AD : ###
EL : ##
GR : #
I want to propose little change to original code in order to get desired result - simply append # in place of keeping count. Let file.txt content be
AD;12;qwe
EL;66;qwe
AD;90;oop
AD;8;lpp
EL;56;oop
GR;55;qwe
then
awk 'BEGIN{FS=";"}{a[$1]=a[$1] "#"}END{for(i in a){printf "%-5s: %s\n", i, a[i]}}' file.txt
output
EL : ##
AD : ###
GR : #
For simplicity sake I left echo and sort parts as these should remain unchanged.
(tested in gawk 4.2.1)
Related
AWK: How to number auto-increment?
I have a file.file content is: 20210126000880000003|3|33.00|20210126|15:30 1|20210126000000000000000000002207|1220210126080109|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT07|621067000000123645|收款方户名|2021-01-26|2021-01-26|10.00|TN|NCS|12|875466 2|20210126000000000000000000002208|1220210126080110|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT06|621067000000123645|收款方户名|2021-01-26|2021-01-26|20.00|TN|NCS|12|875466 3|20210126000000000000000000002209|1220210126080111|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT08|621067000000123645|收款方户名|2021-01-26|2021-01-26|3.00|TN|NCS|12|875466 I use awk command: awk -F"|" 'NR==1{print $1};FNR==2{print $2,$3}' testfile Get the following result: 20210126000880000003 20210126000000000000000000002207 1220210126080109 I want the number to auto-increase: awk -F"|" 'NR==1{print $1+1};FNR==2{print $2+1,$3+1}' testfile But get follow result: 20210126000880001024 20210126000000000944237587726336 1220210126080110 have question: I want to the numer is auto-increase: hope the result is: 20210126000880000003 20210126000000000000000000002207|1220210126080109 ------------------------------------------------- 20210126000880000004 20210126000000000000000000002208|1220210126080110 -------------------------------------------------- 20210126000880000005 20210126000000000000000000002209|1220210126080111 How to auto_increase? Thanks!
You may try this gnu awk command: awk -M 'BEGIN {FS=OFS="|"} NR == 1 {hdr = $1; next} NF>2 {print ++hdr; print $2, $3; print "-------------------"}' file 20210126000880000004 20210126000000000000000000002207|1220210126080109 ------------------- 20210126000880000005 20210126000000000000000000002208|1220210126080110 ------------------- 20210126000880000006 20210126000000000000000000002209|1220210126080111 ------------------- A more readable version: awk -M 'BEGIN { FS=OFS="|" } NR == 1 { hdr = $1 next } NF > 2 { print ++hdr print $2, $3 print "-------------------" }' file Here is a POSIX awk solution that doesn't need -M: awk 'BEGIN {FS=OFS="|"} NR == 1 {hdr = $1; next} NF>2 {"echo " hdr " + 1 | bc" | getline hdr; print hdr; print $2, $3; print "-------------------"}' file 20210126000880000004 20210126000000000000000000002207|1220210126080109 ------------------- 20210126000880000005 20210126000000000000000000002208|1220210126080110 ------------------- 20210126000880000006 20210126000000000000000000002209|1220210126080111 -------------------
Anubhava has the best solution but for older versions of GNU awk that don't support -M (big numbers) you can try the following: awk -F\| 'NR==1 { print $1;hed=$1;hed1=substr($1,(length($1)-1));next; } !/^$/ {print $2" "$3 } /^$/ { print "--------------------------------------------------";printf "%s%s\n",substr(hed,1,((length(hed))-(length(hed1)+1))),++hed1 }' testfile Explanation: awk -F\| 'NR==1 { # Set field delimiter to | and process the first line print $1; # Print the first field hed=$1; # Set the variable hed to the first field hed1=substr($1,(length($1)-1)); # Set a counter variable hed1 to the last digit in hed ($1) next; } !/^$/ { print $2" "$3 # Where there is no blank line, print the second field, a space and the third field } /^$/ { print "--------------------------------------------------"; # Where there is a blank field, process printf "%s%s\n",substr(hed,1,((length(hed))-(length(hed1)+1))),++hed1 # print the header extract before the counter, followed by the incremented counter }' testfile
Awk column with pattern array
Is it possible to do this but use an actual array of strings where it says "array" array=(cat dog mouse fish ...) awk -F "," '{ if ( $5!="array" ) { print $0; } }' file I would like to use spaces in some of the strings in my array. I would also like to be able to match partial matches, so "snow" in my array would match "snowman" It should be case sensitive. Example csv s,dog,34 3,cat,4 1,african elephant,gd A,African Elephant,33 H,snowman,8 8,indian elephant,3k 7,Fish,94 ... Example array snow dog african elephant Expected output s,dog,34 H,snowman,8 1,african elephant,gd Cyrus posted this which works well, but it doesn't allow spaces in the array strings and wont match partial matches. echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$2){next}} print}' FS=',' - file
The brief approach using a single regexp for all array contents: $ array=('snow' 'dog' 'african elephant') $ printf '%s\n' "${array[#]}" | awk -F, 'NR==FNR{r=r s $0; s="|"; next} $2~r' - example.csv s,dog,34 1,african elephant,gd H,snowman,8 Or if you prefer string comparisons: $ cat tst.sh #!/bin/env bash array=('snow' 'dog' 'african elephant') printf '%s\n' "${array[#]}" | awk -F',' ' NR==FNR { array[$0] next } { for (val in array) { if ( index($2,val) ) { # or $2 ~ val for a regexp match print next } } } ' - example.csv $ ./tst.sh s,dog,34 1,african elephant,gd H,snowman,8
This prints no line from csv file which contains an element from array in column 5: echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$5){next}} print}' FS=',' - file
Print columns from two files
How to print columns from various files? I tried according to Awk: extract different columns from many different files paste <(awk '{printf "%.4f %.5f ", $1, $2}' FILE.R ) <(awk '{printf "%.6f %.0f.\n", $3, $4}' FILE_R ) FILE.R == ARGV[1] { one[FNR]=$1 } FILE.R == ARGV[2] { two[FNR]=$2 } FILE_R == ARGV[3] { three[FNR]=$3 } FILE_R == ARGV[4] { four[FNR]=$4 } END { for (i=1; i<=length(one); i++) { print one[i], two[i], three[i], four[i] } } but I don't understand how to use this script. FILE.R 56604.6017 2.3893 2.2926 2.2033 56605.1562 2.3138 2.2172 2.2033 FILE_R 56604.6017 2.29259 0.006699 42. 56605.1562 2.21716 0.007504 40. Output desired 56604.6017 2.3893 0.006699 42. 56605.1562 2.3138 0.007504 40. Thank you
This is one way: $ awk -v OFS="\t" 'NR==FNR{a[$1]=$2;next}{print $1,a[$1],$3,$4}' file1 file2 Output: 56604.6017 2.3893 0.006699 42. 56605.1562 2.3138 0.007504 40. Explained: $ awk -v OFS="\t" ' # setting the field separator to a tab NR==FNR { # process the first file a[$1]=$2 # hash the second field, use first as key next } { print $1,a[$1],$3,$4 # output }' file1 file2 If the field spacing with tabs is not enough, use printf with modifiers like in your sample.
How to get cardinality of fields with AWK?
I am trying to count the unique occurrences for each field in a txt file. Sample: 2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79 2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0 2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0 2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA 2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA Full dataset here: https://github.com/markgrover/cloudcon-hive (Flight delay dataset from 2008.) For a single column we can do: for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo Is there a way to do it in one go for all the columns? I think the expected output looks like this: 1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2: For the entire dataset: 1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:
With GNU awk for true multi-dimensional arrays: $ cat tst.awk BEGIN { FS=","; OFS=":" } { for (i=1; i<=NF; i++) { vals[i][$i] } } END { for (i=1; i<=NF; i++) { printf "%s%s", length(vals[i]), (i<NF?OFS:ORS) } } $ awk -f tst.awk file 1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3 and with any awk: $ cat tst.awk BEGIN { FS=","; OFS=":" } { for (i=1; i<=NF; i++) { if ( !seen[i,$i]++ ) { cnt[i]++ } } } END { for (i=1; i<=NF; i++) { printf "%s%s", cnt[i], (i<NF?OFS:ORS) } } $ awk -f tst.awk file 1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
In GNU awk: $ awk ' BEGIN { FS=OFS="," } # delimiters to , { for(i=1;i<=NF;i++) # iterate over every field a[i][$i] # store unique values to 2d hash } END { # after all the records for(i=1;i<=NF;i++) # iterate the unique values for each field for(j in a[i]) c[i]++ # count them and for(i=1;i<=NF;i++) printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values }' file 1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3 The output is not exactly the same, not sure if the mistake is your or mine. Well, the last column has the values 79,0 and NA so mine is more accurate on that one.
another awk this will give you a rolling counts, pipe to tail -1 to get the last line for the overall counts $ awk -F, -v OFS=: '{for(i=1;i<=NF;i++) printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file 1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1 1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2 1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3 1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3 1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3 1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3 1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3 1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3 1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3 1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
awk: gsub /pattern1/, but not /pattern1pattern2/
In my work, I have to solve such a simple problem: change pattern1 to newpattern, but only if it is not followed by pattern2 or pattern3: "pattern1 pattern1pattern2 pattern1pattern3 pattern1pattern4" → "newpattern pattern1pattern2 pattern1pattern3 newpatternpattern4" Here is my solution, but I don't like it and I suppose there should be a more elegant and easy way to do that? $ echo 'pattern1 pattern1pattern2 pattern1pattern3 pattern1pattern4' | awk ' {gsub(/pattern1pattern2/, "###", $0) gsub(/pattern1pattern3/, "%%%", $0) gsub(/pattern1/, "newpattern", $0) gsub(/###/, "pattern1pattern2", $0) gsub(/%%%/, "pattern1pattern3", $0) print}' newpattern pattern1pattern2 pattern1pattern3 newpatternpattern4 So, the sample input file: pattern1 pattern1pattern2 aaa_pattern1pattern3 pattern1pattern4 pattern1pattern2pattern1 The sample output file should be: newpattern pattern1pattern2 aaa_pattern1pattern3 newpatternpattern4 pattern1pattern2newpattern
This is trivial in perl, using a negative lookahead: perl -pe 's/pattern1(?!pattern[23])/newpattern/g' file Substitute all matches of pattern1 that are not followed by pattern2 or pattern3. If for some reason you need to do it in awk, then here's one way you could go about it: { out = "" replacement = "newpattern" while (match($0, /pattern1/)) { if (substr($0, RSTART + RLENGTH) ~ /^pattern[23]/) { out = out substr($0, 1, RSTART + RLENGTH - 1) } else { out = out substr($0, 1, RSTART - 1) replacement } $0 = substr($0, RSTART + RLENGTH) } print out $0 } Consume the input while pattern1 matches and build the string out, inserting the replacement when the part after each match isn't pattern2 or pattern3. Once there are no more matches, print the string that has been build so far, followed by whatever is left in the input.
With GNU awk for the 4th arg to split(): $ cat tst.awk { split($0,flds,/pattern1(pattern2|pattern3)/,seps) for (i=1; i in flds; i++) { printf "%s%s", gensub(/pattern1/,"newpattern","g",flds[i]), seps[i] } print "" } $ awk -f tst.awk file newpattern pattern1pattern2 aaa_pattern1pattern3 newpatternpattern4 pattern1pattern2newpattern With other awks you can do the same with a while(match()) loop: $ cat tst.awk { while ( match($0,/pattern1(pattern2|pattern3)/) ) { tgt = substr($0,1,RSTART-1) gsub(/pattern1/,"newpattern",tgt) printf "%s%s", tgt, substr($0,RSTART,RLENGTH) $0 = substr($0,RSTART+RLENGTH) } gsub(/pattern1/,"newpattern",$0) print } $ awk -f tst.awk file newpattern pattern1pattern2 aaa_pattern1pattern3 newpatternpattern4 pattern1pattern2newpattern but obviously the gawk solution is simpler and more concise so, as always, get gawk!
awk solution. Nice question. Basically it's doing 2 gensubs: $ cat tst.awk { for (i=1; i<=NF; i++){ s=gensub(/pattern1/, "newpattern", "g", $i); t=gensub(/(newpattern)(pattern(2|3))/, "pattern1\\2", "g", s); $i=t } }1 Testing: echo "pattern1 pattern1pattern2 aaa_pattern1pattern3 pattern1pattern4 pattern1pattern2pattern1" | awk -f tst.awk newpattern pattern1pattern2 aaa_pattern1pattern3 newpatternpattern4 pattern1pattern2newpattern However, this will fail whenever you already have something like newpatternpattern2 in your input. But that's not what OP suggests with his input examples, I guess.