About awk and integer to ASCII character conversion - awk

Just to make sure, is it really that using awk (Gnu awk at least) I can convert:
from octal to ASCII by:
print "\101" # or a="\101"
A
from hex to ASCII:
print "\x41" # or b="\x41"
B
but from decimal to ASCII I have to:
$ printf "%c\n", 67 # or c=sprintf("%c", 67)
C
There is no secret print "\?67" in that RTFM (Memo) I missed?
I'm trying to get character frequencies from $0="aabccc" like:
for(i=141; i<143; i++) a=a gsub("\\"i, ""); print a
213
but using decimals (instead of octals in above example). The decimalistic approach seem awfully long:
$ cat foo
aabccc
$ awk '{for(i=97;i<=99;i++){c=sprintf("%c",i);a=a gsub(c,"")} print a}' foo
213
It got used here.

No, \nnn is octal and \xnn is hex - that's all there is for including characters you cannot include as-is in strings and you should always use the octal, not the hex, representation for robustness (see, for example, http://awk.freeshell.org/PrintASingleQuote).
I don't understand the last part of your question where you state what you're trying to do with this - provide concise, testable sample input and expected output and I'm sure someone can help you do it the right way, whatever it is.
Is this what you're trying to do?
$ awk 'BEGIN{for (i=0141; i<0143; i++) print i}'
97
98

A lookup table is the only way to address this (directly convert CHAR to ASCII DECIMAL) within "AWK only".
You can simply use sprintf() to convert ASCII DECIMAL to CHAR.
You can create a lookup table by iterating through each of the known
ascii chars and storing them in an array where the key is the character and the value is the ascii value of that char.
You can use sprintf() within AWK to get the char for each decimal.
Then you can pass the char to the array to get the corresponding
decimal again.
In this example, using awk.
We cycle through all 256 characters, printing out each one.
We split the resulting string into a series of lines where each line has a single character.
We build a table in awk of the 256 characters (in BEGIN), and then feed each of the input characters in to lookup each one.
Finally we then print out the code for each character on the input.
awk 'BEGIN{
for(n=0;n<256;n++)
print sprintf("%c",n)
}' | awk '{
for (i=0; ++i <= length($0);)
printf "%s\n", substr($0, i, 1)
}' | awk 'BEGIN{
for(n=0;n<256;n++)
ord[sprintf("%c",n)]=n
}{
print ord[$1]
}'
The reverse can also be done, where we lookup a list of character codes.
awk 'BEGIN{
for(n=0;n<256;n++)
print sprintf("%s",n)
}' | awk 'BEGIN{
for(n=0;n<256;n++)
char[n]=sprintf("%c",n)
}{
print char[$1]
}'
Note: The second example may print out a lot of garbage in the high ascii range (> 128) depending on the character set you are using.

If as you say at the end of your question you're simply looking to count the frequency of characters, I'd just assemble an array.
$ awk '{for(i=1;i<=length($0);i++) a[substr($0,i,1)]++} END{for(i in a) printf "%d %s\n",a[i],i}' <<<$'aabccc\ndaae'
1 d
1 e
4 a
1 b
3 c
Note that this also supports multi-line input.
We're stepping through each line of input, incrementing a counter that is an array subscript keyed with the character in question.
I would expect this approach to be more performant than applying a regex to count the replacements for every interesting character, but I haven't done any speed comparison tests (and of course it would depend on how large a set you're interested in).
While this answer doesn't address your initial question, I hope it'll provide a better way to approach the problem.
(Thanks for including the final details in your question. XY problems are all too frequent here.)

if you need to encode bytes -> octals in awk, here's a fully self-encapsulated, recursive, and cross-awk compatible octal encoder that I came up with before :
verified on gawk, mawk-1, mawk-2, and nawk,
benchmarked throughput rate of 39.2 MByte/sec
|
out9: 1.82GiB 0:00:47 [39.2MiB/s] [39.2MiB/s] [ <=> ]
in0: 466MiB 0:00:00 [1.78GiB/s] [1.78GiB/s] [>] 100%
( pvE 0.1 in0 < "${m3l}" | mawk2x ; )
39.91s user 6.94s system 98% cpu 47.656 total
1
2 78b4c27659ae66e4c98796a60043f1fe stdin
3
echo "${data}" | awk '{
print octencode_v7($0)
}
function octencode_v7(______,_,__,___,____,_____,_______) {
if ( ( (_+=_+=_^=_<_\
)^_*(_+_)*(_^_)^(!(("\1"~"\0")+\
index(-+log(_<_),"+") ) ) )<=(___=\
(_^=_<_)<length("\333\222")\
? length(______) : match(______,"$")-_)) {
return \
octencode_v7(substr(______,_^=_<_,_=int(___/(_+_)))) \
octencode_v7(substr(______,++_))
}
_______=___
___="\36\6\2\24"
gsub(/\\/,___,______)
_______-=gsub("["(!_)"-"(_+(_+=++_+_))"]", "\\"(!_)(_)"&",______)
_--;
if (!+(_______-=gsub(___, "\\"(_--^--_+_*_),______) \
- gsub("[[]","\\" ((_^!_)(_)_),______) \
- gsub(/\^/, "\\" ((_^!_)(_)(_+_)),______))) {
return ______
}
____=___=_+=_^=_<_
_____=(___^=++____)-_^(____=!_)
do { ___=_____
do { __=_____
if (+____ || (_____-___)!=_^(_<_)) {
do { _=(____)(___)__
if (+____!=_^(_<_) || ! index(___^___,_____) ||
+__!~"^["(_____^___+___)"]$") {
_="\\"(_)
_______-=gsub(((!+____ && +_____<(___+___)) ||
(+____==_^(_<_) &&
( +___==+_____ ||
(___+____+___)==+_____))) \
? "["(_)"]" : (_), _,______)
} } while(__--)
} } while(___--)
if (!_______) {
return ______
} } while((++____+____)<_____)
return ______
}'
It's basically a triple-nested do-while loop combo to cycle through all the octal codes, without needing any previously made lookup reference strings/arrays

Note: The second example may print out a lot of garbage in the high ascii range (> 128) depending on the character set you are using.
This can be circumvented by using octal codes \200 - \377 for 128-255.
IIRC the bytes C0 C1 F5 F6 F7 F8 F9 FA FB FC FD FE FF shouldn't exist within properly encoded UTF-8 documents (or not yet spec'ed to). FE and FF may overlap with UTF16 byte order mark, but that should hardly be a concern as of today since the world has standardized upon UTF-8.

Related

Automatically round off - awk command

I have a csv file and I want to add a column that takes some values from other columns and make some calculations. As a simplified version I'm trying this:
awk -F"," '{print $0, $1+1}' myFile.csv |head -1
The output is:
29.325172701023977,...other columns..., 30
The column added should be 30.325172701023977 but the output is rounded off.
I tried some options using printf, CONVFMT and OFMT but nothing worked.
How can I avoid the round off?
Assumptions:
the number of decimal places is not known beforehand
the number of decimal places can vary from line to line
Setup:
$ cat myfile.csv
29.325172701023977,...other columns...
15.12345,...other columns...
120.666777888,...other columns...
46,...other columns...
One awk idea where we use the number of decimal places to dynamically generate the printf "%.?f" format:
awk '
BEGIN { FS=OFS="," }
{ split($1,arr,".") # split $1 on period
numdigits=length(arr[2]) # count number of decimal places
newNF=sprintf("%.*f",numdigits,$1+1) # calculate $1+1 and format with "numdigits" decimal places
print $0,newNF # print new line
}
' myfile.csv
NOTE: assumes OP's locale uses a decimal/period to separate integer from fraction; for a locale that uses a comma to separate integer from fraction it gets more complicated since it will be impossible to distinguish between a comma as integer/fraction delimiter vs field delimiter without some changes to the file's format
This generates:
29.325172701023977,...other columns...,30.325172701023977
15.12345,...other columns...,16.12345
120.666777888,...other columns...,121.666777888
46,...other columns...,47
as long as you aren't dealing with numbers greater than 9E15, there's no need to fudge any one of CONVFMT, OFMT, or s/printf() at all :
{m,g}awk '$++NF = int((_=$!__) + sub("^[^.]*",__,_))_' FS=',' OFS=','
29.325172701023977,...other columns...,30.325172701023977
15.12345,...other columns...,16.12345
120.666777888,...other columns...,121.666777888
46,...other columns…,47
if mawk-1 is sending ur numbers to scientific notation do :
mawk '$++NF=int((_=$!!NF)+sub("^[^.]*",__,_))_' FS=',' OFS=',' CONVFMT='%.f'
when u scroll right you'll notice all input digits beyond the decimal point are fully preserved
2929292929.32323232325151515151727272727270707070701010101010232323232397979797977,...other columns...,2929292930.32323232325151515151727272727270707070701010101010232323232397979797977
1515151515.121212121234343434345,...other columns...,1515151516.121212121234343434345
12121212120.66666666666767676767777777777788888888888,...other columns...,12121212121.66666666666767676767777777777788888888888
4646464646,...other columns…,4646464647
2929.32325151727270701010232397977,...other columns...,2930.32325151727270701010232397977
1515.121234345,...other columns...,1516.121234345
12120.66666767777788888,...other columns...,12121.66666767777788888
4646,...other columns...,4647
change it to CONVFMT='%\47.f', and you can even get mawk-1 to nicely comma format them for u :
29292929292929.323232323232325151515151515172727272727272707070707070701010101010101023232323232323979797979797977,...other columns…,29,292,929,292,930.323232323232325151515151515172727272727272707070707070701010101010101023232323232323979797979797977
15151515151515.12121212121212343434343434345,...other columns…,15,151,515,151,516.12121212121212343434343434345
121212121212120.666666666666666767676767676777777777777777888888888888888,…other columns…,121,212,121,212,121.666666666666666767676767676777777777777777888888888888888
46464646464646,...other columns...,46,464,646,464,647

Extract two different types of values from a file and print it to an output file

I have a file where the data looks like:
sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcription-activation domain
A=9 C=2 D=3 E=4 F=2 G=15 I=3 K=3 L=9 M=3 N=5 P=2 Q=11 R=8 S=12 T=6 V=8 W=1 Y=5
Amino acid alphabet = 19
Sequence length = 115
sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusion sbmate
A=10 C=2 D=4 E=4 F=2 G=15 H=1 I=2 K=4 L=7 M=2 N=5 P=3 Q=6 R=4 S=18 T=7 V=10 W=5 Y=10
Amino acid alphabet = 20
Sequence length = 126
sp_0021_LgLlamabotulinum neurotoxin BoNT serotype F
A=14 C=2 D=4 E=5 F=4 G=15 I=2 K=3 L=6 M=2 N=6 P=4 Q=7 R=8 S=13 T=10 V=8 W=3 Y=10
Amino acid alphabet = 19
Sequence length = 131
I want to extract the vales of 'Amino acid alphabet' and 'Sequence length into an output file', and it should look like:
19 115
20 126
19 131
As I am new to bash, all I could try so far is:
grep -i "Amino acid alphabet = $i" test.txt >>out.txt
But, I don't want the word "Amino acid alphabet" in the output. I only want the values of "Amino acid alphabet" and "Sequence length" as two columns.
Can I get any help how to do that? Thanks in advance.
$ awk -v RS= '{print $(NF-4), $NF}' file
19 115
20 126
19 131
Assuming both fields exist for all your records:
awk '/^Amino acid alphabet/{printf $NF FS} /^Sequence length/{print $NF}' file
19 115
20 126
19 131
Also you may want to have some introduction about awk into the awk wiki
This code: grep -i "Amino acid alphabet = $i" test.txt >>out.txt includes the shell expansion of $i. If you have not given a value to i then the search pattern resolves to Amino acid alphabet = , and thus will find each line that contains that. The $i would change the search pattern if $i had a value.
There are many ways to get what you want with BASH. one is to use grep with PCRE (Perl-style) regex enabled:
grep -Po "(?<=Amino acid alphapbet = )\d+" test.txt >> out.text
#yields:
19
20
19
(?<=string) tells grep that for the rest to match, it must have been preceded by string, but string is not a part of the Match. -Po are the options to enable PCRE (Perl Style) and to only print the match, rather than the whole line in which there was a match.
Note that the output redirect is >> if you want to append to a file if it already contains lines, > will overwrite an existing file if it exists, (without asking for confirmation!)
sed can do this too.
sed -En '/^Amino acid alphabet =/h; /^Sequence length =/{ H; x; s/[^0-9]+/ /g; s/^ //; p; }' infile > outfile
/^Amino acid alphabet =/h stores the first line in the save buffer.
/^Sequence length =/{ triggers all the steps inside the curlies.
H adds the current line to the save buffer.
x swaps the save buffer back to the workspace.
s/[^0-9]+/ /g; changes every sequence on NON-digits to a single space.
This includes the newline.
s/^ //; removes the leading space.
p prints the output line for this data set.

awk command or sed command

000Bxxxxx111118064085vxas - header
10000000001000000000053009-000000000053009-
10000000005000000000000000+000000000000000+
10000000030000000004025404-000000004025404-
10000000039000000000004930-000000000004930-
10000000088000005417665901-000005417665901-
90000060883328364801913 - trailer
In the above file we have header and trailer and the records which start with 1 is the detail record
in the detail record,want to sum the values starting from position 28 till 44 including the sign using awk/sed command
Here is sed, with help from bc to do the arithmetic:
sed -rn '
/header|trailer/! {
s/[[:digit:]]*[+-]([[:digit:]]+)([+-])$/\2\1/
H
}
$ {
x
s/\n//gp
}
' file | bc
I assume the +/- sign follows the number.
Using awk we can solve this problem making use of substr:
substr(s, m[, n ]):
Return the at most n-character substring of s that begins at position m, numbering from 1. If n is omitted, or if n specifies more characters than are left in the string, the length of the substring shall be limited by the length of the string s.
This allows us to take the string which represents the number. Here, I assumed that the sign before and after the number is same and thus the sign of the number :
$ echo "10000000001000000000053009-000000000053009-" \
| awk '{print length($0); print substr($0,27,43-27)}'
43
-000000000053009
Since awk implicitly converts strings to numbers if you do numeric operations on them we can write the following awk-code to achieve the requested :
$ awk '/header|trailer/{next}
{s+=substr($0,27,43-27)}
END{print s}' file.dat
-5421749244
Or in a single line:
$ awk '/header|trailer/{next}{s+=substr($0,27,43-27)} END{print s}' file.dat
-5421749244
The above examples just work on the example file given by the OP. However, if you have a file containing multiple blocks with header and trailer and you only want to use the text inside these blocks (exclude everything outside of the blocks), then you should handle it a bit differently :
$ awk '/header/{s=0;c=1;next}
/trailer/{S+=s;c=0;next}
c{s+=substr($0,27,43-27)}
END{print S}' file.dat
Here we do the following:
If a line with header is found, reset the block sum s to ZERO and set c=1 indicating that we take the next lines into account
If a line with trailer is found, add the block sum s to the overall sum S and set c=0 indicating to ignore the lines.
If c/=0 compute the block sum s
At the END, print the total sum S

Comparing hexadecimal values with awk

I'm having trouble with awk and comparing values. Here's a minimal example :
echo "0000e149 0000e152" | awk '{print($1==$2)}'
Which outputs 1 instead of 0. What am I doing wrong ? And how should I do to compare such values ?
Thanks,
To convert a string representing a hex number to a numerical value, you need 2 things: prefix the string with "0x" and use the strtonum() function.
To demonstrate:
echo "0000e149 0000e152" | gawk '{
print $1, $1+0
print $2, $2+0
n1 = strtonum("0x" $1)
n2 = strtonum("0x" $2)
print $1, n1
print $2, n2
}'
0000e149 0
0000e152 0
0000e149 57673
0000e152 57682
We can see that naively treating the strings as numbers, awk thinks their value is 0. This is because the digits preceding the first non-digit happen to be only zeros.
Ref: https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
Note that strtonum is a GNU awk extension
You need to convert $1 and $2 to strings in order to enforce alphanumeric comparison. This can be done by simply append an empty string to them:
echo "0000e149 0000e152" | awk '{print($1""==$2"")}'
Otherwise awk would perform a numeric comparison. awk will need to convert them to numeric values in this case. Converting those values to numbers in awk leads to 0 - because of the leading zero(s) they are treated as octal numbers but parsing as an octal number fails because the values containing invalid digits which aren't allowed in octal numbers, which results in 0. You can verify that using the following command:
echo "0000e149 0000e152" | awk '{print $1+0; print $2+0)}'
0
0
When using non-decimal data you just need to tell gawk that's what you're doing and specify what base you're using in each number:
$ echo "0xe152 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
1
$ echo "0xE152 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
1
$ echo "0xe149 0x0000e152" | awk --non-decimal-data '{print($1==$2)}'
0
See http://www.gnu.org/software/gawk/manual/gawk.html#Nondecimal-Data
i think many forgot the fact that the hexdigits 0-9 A-F a-f rank order in ASCII - instead of wasting time performing the conversion, or risk facing numeric precision shortfall, simply :
trim out leading edge zeros, including the optional 0x / 0X
depending on the input source, also trim out delimiters such as ":" (e.g. IPv6, MAC address), "-" (e.g. UUID), "_" (e.g. "0xffff_ffff_ffff_ffff"), "%" (e.g. URL-encoding) etc
—- be mindful of the need to pad in missing leading zeros for formats that are very flexible with delimiters, such as IPv6
compare their respective string length()s :
if those differ, then one is already distinctly larger,
— otherwise
prefix both with something meaningless like "\1" to guarantee a string-compare operation without risk of either awk being too smart or running into extreme edge cases like locale-specific peculiarities to its collating order :
(("\1") toupper(hex_str_1)) == (("\1") toupper(hex_str_2))

setting default numeric format in awk

I wanted to do a simple parsing of two files with ids and some corresponding numerical values. I didn't want awk to print numbers in scientific notation.
File looks like this:
someid-1 860025 50.0401 4.00022
someid-2 384319 22.3614 1.78758
someid-3 52096 3.03118 0.242314
someid-4 43770 2.54674 0.203587
someid-5 33747 1.96355 0.156967
someid-6 20281 1.18004 0.0943328
someid-7 12231 0.711655 0.0568899
someid-8 10936 0.636306 0.0508665
someid-9 10224.8 0.594925 0.0475585
someid-10 10188.8 0.59283 0.047391
when use print instead of printf :
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); print $1,k[2],k[3],k[4],$2,$3,$4}' OSCAo.txt dme_miRNA_PIWI_OSC.txt | sort -n -r -k 7 | head
i get this result:
dme-miR-iab-4-5p 0.333333 0.000016 0.000001 0.25 0.000605606 9.36543e-07
dme-miR-9c-5p 10987.300000 0.525413 0.048798 160.2 0.388072 0.000600137
dme-miR-9c-3p 731.986000 0.035003 0.003251 2.10714 0.00510439 7.89372e-06
dme-miR-9b-5p 30322.500000 1.450020 0.134670 595.067 1.4415 0.00222922
dme-miR-9b-3p 2628.280000 0.125684 0.011673 48 0.116276 0.000179816
dme-miR-9a-3p 10.365000 0.000496 0.000046 0.25 0.000605606 9.36543e-07
dme-miR-999-5p 103.433000 0.004946 0.000459 0.0769231 0.00018634 2.88167e-07
dme-miR-999-3p 1513.790000 0.072389 0.006723 28 0.0678278 0.000104893
dme-miR-998-5p 514.000000 0.024579 0.002283 73 0.176837 0.000273471
dme-miR-998-3p 3529.000000 0.168756 0.015673 42 0.101742 0.000157339
Notice the scientific notation in the last column
I understand that printf with appropriate format modifier can do the job but the code becomes very lengthy. I have to write something like this:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\t%3.6f\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt
This becomes clumsy when I have to parse fileout with another similarly structured file.
Is there any way to specify default numerical output, such that any string will be printed like a string but all numbers follow a particular format.
I think You misinterpreted the meaning of %3.6f. The first number before the decimal point is the field width not the "number of digits before decimal point". (See prinft(3))
So You should use %10.6f instead. It can be tested easily in bash
$ printf "%3.6f\n%3.6f\n%3.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
$ printf "%10.6f\n%10.6f\n%10.6f" 123.456 12.345 1.234
123.456000
12.345000
1.234000
You can see that the later aligns to the decimal point properly.
As sidharth c nadhan mentioned You can use the OFMT awk internal variable (seem awk(1)). An example:
$ awk 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456
12.345
1.234
$ awk -vOFMT=%10.6f 'BEGIN{print 123.456; print 12.345; print 1.234}'
123.456000
12.345000
1.234000
As I see in You example the number with maximum digits can be 123456.1234567, so the format %15.7f to cover all and show a nice looking table.
But unfortunately it will not work if the number has no decimal point in it or even if it does, but it ends with .0.
$ awk -vOFMT=%15.7f 'BEGIN{print 123.456;print 123;print 123.0;print 0.0+123.0}'
123.4560000
123
123
123
I even tried gawk's strtonum() function, but the integers are considered as non-OFMT strings. See
awk -vOFMT=%15.7f -vCONVFMT=%15.7f 'BEGIN{print 123.456; print strtonum(123); print strtonum(123.0)}'
It has the same output as before.
So I think, you have to use printf anyway. The script can be a little bit shorter and a bit more configurable:
awk -vf='\t'%15.7f 'NR==FNR{x[$1]=sprintf("%s"f f f,$1,$2,$3,$4);next}$1 in x{printf("%s"f f f"\n",x[$1],$2,$3,$4)}' file1.txt file2.txt
The script will not work properly if there are duplicated IDs in the first file. If it does not happen then the two conditions can be changed and the ;next can be left off.
awk 'NR==FNR{x[$1]=$0;next} ($1 in x){split(x[$1],k,FS); printf "%s\t%9s\t%9s\t%9s\t%9s\t%9s\t%9s\n", $1,k[2],k[3],k[4],$2,$3,$4}' file1.txt file2.txt > fileout.txt