Updated
Let's suppose that you got octal escape sequences in a stream:
backslash \134 is escaped as \134134
single quote ' and double quote \042
linefeed `\012` and carriage return `\015`
%s &
etc...
note: The escaped characters are limited to 0x01-0x1F 0x22 0x5C 0x7F
How can you revert those escape sequences back to their corresponding character with awk?
While awk is able to understand them out-of-box when used in a literal string or as a parameter argument, I can't find the way to leverage this capability when the escape sequence is part of the data. For now I'm using one gsub per escape sequence but it doesn't feel efficient.
Here's the expected output for the given sample:
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
PS: While I have the additional constraint of unescaping each line into an awk variable before printing the result, it doesn't really matter.
Using GNU awk for strtonum() and lots of meaningfully-named variables to show what each step does:
$ cat tst.awk
function octs2chars(str, head,tail,oct,dec,char) {
head = ""
tail = str
while ( match(tail,/\\[0-7]{3}/) ) {
oct = substr(tail,RSTART+1,RLENGTH-1)
dec = strtonum(0 oct)
char = sprintf("%c", dec)
head = head substr(tail,1,RSTART-1) char
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{ print octs2chars($0) }
$ awk -f tst.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
If you don't have GNU awk then write a small function to convert octal to decimal, e.g. oct2dec() below, and then call that instead of strtonum():
$ cat tst2.awk
function oct2dec(oct, dec) {
dec = substr(oct,1,1) * 8 * 8
dec += substr(oct,2,1) * 8
dec += substr(oct,3,1)
return dec
}
function octs2chars(str, head,tail,oct,dec,char) {
head = ""
tail = str
while ( match(tail,/\\[0-7]{3}/) ) {
oct = substr(tail,RSTART+1,RLENGTH-1)
dec = oct2dec(oct) # replaced "strtonum(0 oct)"
char = sprintf("%c", dec)
head = head substr(tail,1,RSTART-1) char
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{ print octs2chars($0) }
$ awk -f tst2.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
The above assumes that, as discussed in comments, the only backslashes in the input will be in the context of the start of octal numbers as shown in the provided sample input.
With GNU awk which supports strtonum() function, would you
please try:
awk '{
while (match($0, /\\[0-7]{1,3}/)) {
printf("%s", substr($0, 1, RSTART - 1)) # print the substring before the match
printf("%c", strtonum("0" substr($0, RSTART + 1, RLENGTH))) # convert the octal string to character
$0 = substr($0, RSTART + RLENGTH) # update $0 with remaining substring
}
print
}' input_file
It processes the matched substring (octal presentation)
in the while loop one by one.
substr($0, RSTART + 1, RLENGTH) skips the leading backslash.
"0" prepended to substr makes an octal string.
strtonum() converts the octal string to the numeric value.
The final print outputs the remaining substring.
UPDATE :: about gawk's strtonum() in unicode mode :
echo '\666' |
LC_ALL='en_US.UTF-8' gawk -e '
$++NF = "<( "(sprintf("%c", strtonum((_=_<_) substr($++_, ++_))))" )>"'
0000000 909522524 539507744 690009798 2622
\ 6 6 6 < ( ƶ ** ) > \n
134 066 066 066 040 074 050 040 306 266 040 051 076 012
\ 6 6 6 sp < ( sp ? ? sp ) > nl
92 54 54 54 32 60 40 32 198 182 32 41 62 10
5c 36 36 36 20 3c 28 20 c6 b6 20 29 3e 0a
0000016
By default, gawk in unicode mode would decode out a multi-byte character instead of byte \266 | 0xB6. If you wanna ensure consistency of always decoding out a single-byte out, even in gawk unicode mode, this should do the trick :
echo '\666' |
LC_ALL='en_US.UTF-8' gawk -e '$++NF = sprintf("<( %c )>",
strtonum((_=_<_) substr($++_, ++_)) + _*++_^_++*_^++_)'
0000000 909522524 539507744 1042882742 10
\ 6 6 6 < ( 266 ) > \n
134 066 066 066 040 074 050 040 266 040 051 076 012
\ 6 6 6 sp < ( sp ? sp ) > nl
92 54 54 54 32 60 40 32 182 32 41 62 10
5c 36 36 36 20 3c 28 20 b6 20 29 3e 0a
0000015
long story short : add 4^5 * 54 to output of strtonum(), which happens to be 0xD800, the starting point of UTF-16 surrogates
=================== =================== ===================
one quick note about #Gene's proposed perl-based solution :
echo 'abc \555 456' | perl -p -e 's/\\([0-7]{3})/chr(oct($1))/ge'
Wide character in print at -e line 1, <> line 1.
abc ŭ 456
octal codes wrap around, meaning \4xx = \0xx ; \6xx = \2xx etc :
printf '\n %s\n' $'\555'
m
so perl is incorrectly decoding these as multi-byte characters, when in fact \555, as confirmed by printf, is merely lowercase "m" (0x6D)
ps : my perl is version 5.34
I got my own POSIX awk solution, so I post it here for reference.
The main idea is to build a hash that translates an octal escape sequence to its corresponding character. You can then use it while splitting the line during the search for escape sequences:
LANG=C awk '
BEGIN {
for ( i = 1; i <= 255; i++ )
tr[ sprintf("\\%03o",i) ] = sprintf("%c",i)
}
{
remainder = $0
while ( match(remainder, /\\[0-7]{3}/) ) {
printf("%s%s", \
substr(remainder, 1, RSTART-1), \
tr[ substr(remainder, RSTART, RLENGTH) ] \
)
remainder = substr(remainder, RSTART + RLENGTH)
}
print remainder
}
' input.txt
backslash `\`
single quote `'` and double quote `"`
linefeed `
` and carriage return `
%s &
etc...
this separate post is made specifically to showcase how to extend the octal lookup reference tables in gawk unicode-mode to all 256 bytes without external dependencies or warning messages:
ASCII bytes reside in table o2bL
8-bit bytes reside in table o2bH
.
# gawk profile, created Fri Sep 16 09:53:26 2022
'BEGIN {
1 makeOctalRefTables(PROCINFO["sorted_in"] = "#val_str_asc" \
(ORS = ""))
128 for (_ in o2bL) {
128 print o2bL[_]
}
128 for (_ in o2bH) {
128 print o2bH[_]
}
}
function makeOctalRefTables(_,__,___,____)
{
1 _=__=___=____=""
for (_ in o2bL) {
break
}
1 if (!(_ in o2bL)) {
1 ____=_+=((_+=_^=_<_)-+-++_)^_--
128 do { o2bL[sprintf("\\%o",_)] = \
sprintf("""%c",_)
} while (_--)
1 o2bL["\\" ((_+=(_+=_^=_<_)+_)*_--+_+_)] = "\\&"
1 ___=--_*_^_--*--_*++_^_*(_^=++_)^(! —_)
128 do { o2bH[sprintf("\\%o", +_)] = \
sprintf("%c",___+_)
} while (____<--_)
}
1 return length(o2bL) ":" length(o2bH)
}'
|
\0 \1 \2 \3 \4 \5 \6 \7 \10\11 \12
\13
\14
\16 \17
\20 \21 \22 \23 \24 \25 \26 \27 \30 \31 \32 \33 34 \35 \36 \37
\40 \41 !\42 "\43 #\44 $\45 %\47 '\50 (\51 )\52 *\53 +\54 ,\55 -\56 .\57 /
\60 0\61 1\62 2\63 3\64 4\65 5\66 6\67 7\70 8\71 9\72 :\73 ;\74 <\75 =\76 >\77 ?
\100 #\101 A\102 B\103 C\104 D\105 E\106 F\107 G\110 H\111 I\112 J\113 K\114 L\115 M\116 N\117 O
\120 P\121 Q\122 R\123 S\124 T\125 U\126 V\127 W\130 X\131 Y\132 Z\133 [\134 \\46 \&\135 ]\136 ^\137 _
\140 `\141 a\142 b\143 c\144 d\145 e\146 f\147 g\150 h\151 i\152 j\153 k\154 l\155 m\156 n\157 o
\160 p\161 q\162 r\163 s\164 t\165 u\166 v\167 w\170 x\171 y\172 z\173 {\174 |\175 }\176 ~\177
\200 ?\201 ?\202 ?\203 ?\204 ?\205 ?\206 ?\207 ?\210 ?\211 ?\212 ?\213 ?\214 ?\215 ?\216 ?\217 ?
\220 ?\221 ?\222 ?\223 ?\224 ?\225 ?\226 ?\227 ?\230 ?\231 ?\232 ?\233 ?\234 ?\235 ?\236 ?\237 ?
\240 ?\241 ?\242 ?\243 ?\244 ?\245 ?\246 ?\247 ?\250 ?\251 ?\252 ?\253 ?\254 ?\255 ?\256 ?\257 ?
\260 ?\261 ?\262 ?\263 ?\264 ?\265 ?\266 ?\267 ?\270 ?\271 ?\272 ?\273 ?\274 ?\275 ?\276 ?\277 ?
\300 ?\301 ?\302 ?\303 ?\304 ?\305 ?\306 ?\307 ?\310 ?\311 ?\312 ?\313 ?\314 ?\315 ?\316 ?\317 ?
\320 ?\321 ?\322 ?\323 ?\324 ?\325 ?\326 ?\327 ?\330 ?\331 ?\332 ?\333 ?\334 ?\335 ?\336 ?\337 ?
\340 ?\341 ?\342 ?\343 ?\344 ?\345 ?\346 ?\347 ?\350 ?\351 ?\352 ?\353 ?\354 ?\355 ?\356 ?\357 ?
\360 ?\361 ?\362 ?\363 ?\364 ?\365 ?\366 ?\367 ?\370 ?\371 ?\372 ?\373 ?\374 ?\375 ?\376 ?\377 ?
Related
I am trying to parse read only comm logs from a radio. Some entries are 2-3 lines others might be more than 8 lines. The good news is I can find unique static start and stop strings. The bad news is I have tried copying SED code from the nearest example I could find for the past 11 hours with zero success.
The log looks like this:
M: 2019-06-08 18:15:24.927 DMR Slot 2, received network voice header from KS3X to TG 91
M: 2019-06-08 18:15:25.402 DMR Talker Alias (Data Format 1, Received 6/20 char): 'KS3X D'
M: 2019-06-08 18:15:25.410 DMR Slot 2, Embedded Talker Alias Header
M: 2019-06-08 18:15:25.412 0000: 04 00 68 4B 53 33 58 20 44 *..hKS3XD*
M: 2019-06-08 18:15:26.111 DMR Talker Alias (Data Format 1, Received 13/20 char): 'KS3X DMR ID: '
M: 2019-06-08 18:15:26.120 DMR Slot 2, Embedded Talker Alias Block 1
M: 2019-06-08 18:15:26.121 0000: 05 00 4D 52 20 49 44 3A 20 *..MR ID: *
M: 2019-06-08 18:15:26.824 DMR Talker Alias (Data Format 1, Received 20/20 char): 'KS3X DMR ID: 1142129'
M: 2019-06-08 18:15:26.824 DMR Slot 2, Embedded Talker Alias Block 2
M: 2019-06-08 18:15:26.824 0000: 06 00 31 31 34 32 31 32 39 *..1142129*
M: 2019-06-08 18:16:15.921 DMR Slot 2, received network end of voice transmission, 51.2 seconds, 0% packet loss, BER: 0.0%
The data variables I would want from that would be the following:
The call sign between 'from' and 'to' on line 1.
The channel located between 'to' and the end of line 1.
The DMR ID at the end of line 8(1142129).
The duration of 51.2 seconds on line 11.
The percentage of packet loss on line 11.
The percentage of BER at the end of line 11.
All records, no matter their length of lines start with "received voice header from" and end with the "%" percent symbol. Additionally could anyone point me toward a low level overview of when to use SED, GREP or AWK, they all seem very similar to me? Just a link to a good tutorial would be great.
What I am trying to is run a bash script to monitor the log using the terminal using something like this:
tail -fn0 /var/log/pi-star/MMDVM-2019-06-08.log
But with only the 6 variables named above. Super grateful!!!
#!/bin/bash
ACCESS_TOKEN="o.WOgpVaaEBjoVLGKS3VzFnsO4xGClTRiF"
tail -fn0 /var/log/pi-star/MMDVM-2019-06-08.log | \
while read line ; do
echo "$line" | gawk '
match($0, /received.*voice header from ([[:alnum:]]+) to ([[:alnum:]]+ [0-9]+)/, a) {
in_record = 1
call_sign = a[1]
channel = a[2]
}
in_record && match($0, /DMR ID: ([0-9]+)/, a) {
dmr_id = a[1]
}
in_record && match($0, /([0-9.]+) seconds, ([0-9]+)% packet loss, BER: ([0-9.]+)%/, a) {
in_record = 0
print call_sign, channel, dmr_id, a[1], a[2], a[3]
}
'
done
gawk '
match($0, /received.*voice header from ([[:alnum:]]+) to ([[:alnum:]]+ [0-9]+)/, a) {
in_record = 1
call_sign = a[1]
channel = a[2]
}
in_record && match($0, /DMR ID: ([0-9]+)/, a) {
dmr_id = a[1]
}
in_record && match($0, /([0-9.]+) seconds, ([0-9]+)% packet loss, BER: ([0-9.]+)%/, a) {
in_record = 0
print call_sign, channel, dmr_id, a[1], a[2], a[3]
}
' OFS=, radio.log
KS3X,TG 91,1142129,51.2,0,0.0
This is specific to GNU awk (for the 3-argument form of the match() function)
Here is another awk (standard Linux gawk) script doing the same trick. with less code and better capture patterns.
script.awk
/received network voice header from/,/#$/{
if (match($0, "from ([^ ]+) to (.*$)", a)) {
output[1] = a[1];
output[2] = a[2];
}
if (match($0, "DMR ID: ([^']+)'", a)) {
output[3] = a[1];
}
if (match($0, "voice transmission, ([^ ]+) seconds, ([^%]+)% packet loss, BER: ([^%]+)%", a)) {
output[4] = a[1];
output[5] = a[2];
output[6] = a[3];
outputStr = output[1];
for (i = 2; i <= 6; i++) outputStr = outputStr","output[i];
print outputStr;
}
}
run script
awk -f script.awk input.log
output:
KS3X,TG 91,1142129,51.2,0,0.0
With GNU awk for multi-char RS:
$ awk -v RS='%\n' -v OFS=, '{print $12, $14" "$15, $112+0, $150, $152+0, $156}' file
KS3X,TG 91,1142129,51.2,0,0.0
I have an older script that has been bugging me for a while now, which has a small bug in it that I haven't really gotten around to fixing, but I think it's about time. The script basically appends the columns of different files based on the ID of the rows. For example...
test1.txt:
a 3
b 2
test2.txt:
a 5
b 9
... should yield a result of:
a 3 5
b 2 9
The script itself looks like this:
#!/bin/bash
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
} printf "\n"
}' $1 $2
... called as $ script.sh test1.txt test2.txt. The problem is that the result I get is not exactly what I should get:
a 3 5
b 2 9
NA NA NA
... i.e. I get a row with NA's at the very end of the file. So far I've just been deleting this row manually, but it'd be nice to not have to do that. I don't really see where this weird functionality is coming from, though... Anybody got any ideas? I'm using GAWK on OSX, if that matters.
Here's some actual input (that's what I get for trying to make the question simple and to the point! =P)
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
I'm interested in the target_id and tpm columns, the others are unimportant. My full script:
FILES=$(find . -name 'data.txt' | xargs)
# Get replicate names for column header
printf "%s" 'ENSTID'
for file in $FILES; do
file2="${file/.results\/data.txt/}"
file3="${file2/.\/*\//}"
printf "\t%s" $file3
done
printf "\n"
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$5; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
} printf "\n"
}' $FILES
(i.e. all the files are named data.txt, but are located in differently named subfolders.)
A simpler idiomatic way to do it would be
$ cat test1.txt
a 3
b 2
$ cat test2.txt
a 5
b 9
$ awk -v OFS="\t" 'NR==FNR{rec[$1]=$0;next}$1 in rec{print rec[$1],$2}' test1.txt test2.txt
a 3 5
b 2 9
For the actual input
$ cat test1.txt
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
$ cat test2.txt
target_id length eff_length est_counts tpm
ENST00000574176 996 122 6 0.3634
ENST00000575242 213 618 105 7.277
ENST00000573052 329 291 89 2.0356
ENST00000312051 21 00 45 0.123
$ awk 'NR==FNR{rec1[$1]=$1;rec2[$1]=$5;next}$1 in rec1{printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5}' test1.txt test2.txt
target_id tpm tpm
ENST00000574176 0.825408 0.3634
ENST00000575242 5.19804 7.277
ENST00000573052 2.61356 2.0356
ENST00000312051 46.8843 0.123
Notes :
-v OFS="\t" is for tab separated fields in output, order of passed files is important (Matters to first solution).
Hard-coding newlines as in
printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5
is not a good idea as it renders the script less portable.You may well replace it with
printf "%-20s %-15s %-15s", rec1[$1],rec2[$1],$5;print # same effect
Edit : for more than two files
$ shopt -s globstar
$ awk 'NR==FNR{rec1[$1]=$1" "$5;next}{if($1 in rec1){rec1[$1]=rec1[$1]" "$5}else{rec1[$1]=$1" "$5}}END{for(i in rec1){if(i != "target_id"){print rec1[i];}}}' **/test*.txt
ENST00000312051 46.8843 46.8843 0.123 46.8843 0.123
ENST00000573052 2.61356 2.61356 2.0356 2.61356 2.0356
ENST00000575242 5.19804 5.19804 7.277 5.19804 7.277
ENST00000574176 0.825408 0.825408 0.3634 0.825408 0.3634
ENST77777777777 01245
ENST66666666666 7.277 7.277
$ shopt -u globstar
As far as I can see, the only reason you would get an empty line at the end of the output (which is what I get with gawk on OS X) is that you have a printf "\n" at the end of the script, which will add a newline even though you've just printed ORS.
Since your bash script is essentially an awk script, I would make a proper awk script out of it. That would additionally save you the problem of having incorrect quoting of $1 and $2 in the shell script (would break on exotic filenames). This also gives you proper syntax highlighting in your favourite text editor, if it understands Awk:
#!/usr/bin/gawk -f
BEGIN { OFS = "\t" }
{
vals[$1,ARGIND] = $2;
keys[$1] = 1;
}
END {
for (key in keys) {
printf("%s%s", key, OFS);
for (colNr = 1; colNr <= ARGIND; colNr++) {
printf("%s%s", vals[key,colNr], (colNr < ARGIND ? OFS : ORS));
}
}
}
The same can be done with more complex sed editing scripts.
Working native bash code :
while read line
do
a=${line:112:7}
b=${line:123:7}
if [[ $a != "0000000" || $b != "0000000" ]]
then
echo "$line" >> FILE_OT_YHAV
else
echo "$line" >> FILE_OT_NHAV
fi
done <$FILE_IN
I have the following file (its a dummy), the substrings being checked are both on the 4th field, so nm the exact numbers.
AAAAAAAAAAAAAA XXXXXX BB CCCCCCC 12312312443430000000
BBBBBBB AXXXXXX CC DDDDDDD 10101010000000000000
CCCCCCCCCC C C QWEQWEE DDD AAAAAAA A12312312312312310000
I m trying to write an awk script that compares two specific substrings, if either one is not 000000 it outputs the line into File A, if both of them are 000000 it outputs the line into File B, this is the code i have so far :
# Before first line.
BEGIN {
print "Awk Started"
FILE_OT_YHAV="FILE_OT_YHAV.test"
FILE_OT_NHAV="FILE_OT_NHAV.test"
FS=""
}
# For each line of input.
{
fline=$0
# print "length = #" length($0) "#"
print "length = #" length(fline) "#"
print "##" substr($0,112,7) "##" substr($0,123,7) "##"
if ( (substr($0,112,7) != "0000000") || (substr($0,123,7) != "0000000") )
print $0 > FILE_OT_YHAV;
else
print $0 > FILE_OT_NHAV;
}
# After last line.
END {
print "Awk Ended"
}
The problem is that when i run it, it :
a) Treats every line as having a different length
b) Therefore the substrings are applied to different parts of it (that is why i added the print length stuff before the if, to check on it.
This is a sample output of the line length awk reads and the different substrings :
Awk Started
length = #130#
## ## ##
length = #136#
##0000000##22016 ##
length = #133#
##0000001##16 ##
length = #129#
##0010220## ##
length = #138#
##0000000##1022016##
length = #136#
##0000000##22016 ##
length = #134#
##0000000##016 ##
length = #137#
##0000000##022016 ##
Is there a reason why awk treats lines of the same length as having a different length? Does it have something to do with the spacing of the input file?
Thanks in advance for any help.
After the comments about cleaning the file up with sed, i got this output (and yes now the lines have a different size) :
1 0M-DM-EM-G M-A.M-E. #DEH M-SM-TM-OM-IM-WM-EM-IM-A M-DM-V/M-DM-T/M-TM-AM-P 01022016 $
2 110000080103M-CM-EM-QM-OM-MM-TM-A M-A. 6M-AM-HM-GM-MM-A 1055801001102 0000120000012001001142 19500000120 0100M-D000000000000000000000001022016 $
3 110000106302M-TM-AM-QM-EM-KM-KM-A 5M-AM-AM-HM-GM-MM-A 1043801001101 0000100000010001001361 19500000100M-IM-SM-O0100M-D000000000000000000000001022016 $
4 110000178902M-JM-AM-QM-AM-CM-IM-AM-MM-MM-G M-KM-EM-KM-AM-S 71M-AM-HM-GM-MM-A 1136101001101 0000130000013001006061 19500000130 0100M-D000000000000000000000001022016 $
I have a file whose lines I want to split using either space or "_".
Its format is
f 5.287102213 _10_ RTR --- 312 cbr 120 [13a a 6 800] ------- [6:0 20:0 29 20] [15] 1 0
s 5.288000000 _0_ AGT --- 322 cbr 100 [0 0 0 0] ------- [0:0 2:0 32 0] [18]
My awk script is as follows:
`#!/usr/bin/awk -f
BEGIN {FS="[[:space:]]|_"} # use posix space or underscore for FS
{
action = $1;
time = $2;
sta = $4 ; # shifted here because underscores are delimiters
dest = $6;
app = $10;
pkt_size = $11;
#print $1
#print $2
print $5
#print $4
#print $5
#print $6
#print $7
#print $8
#print $9
#print $10
if( action == "s" && dest == "MAC" && app == "cbr"){
startTime+=time ;
count++;
}
if( action == "r" && dest == "MAC" && app == "cbr"){
endTime+=time ;
receivedSize+=pkt_size ;
}
}`
As seen in the above script, from the above script I was expecting RTR to be in $4.
But I find that the output of $3 is as follows:
RTR --- 312 cbr 120 [13a a 6 800] ------- [6:0 20:0 29 20] [15] 1 0
AGT --- 322 cbr 100 [0 0 0 0] ------- [0:0 2:0 32 0] [18] 0 0
RTR --- 322 cbr 100 [0 0 0 0] ------- [0:0 2:0 32 0] [18] 0 0
What am I doing wrong? Am new to awk.
Change your FS value to [[:space:]_]+ to get the tokenization (splitting into fields) you want.
Test it with this statement to see the fields recognized:
awk -F'[[:space:]_]+' '{for(i=1;i<=NF;++i){print i ": " $i}}' \
<<<'f 5.287102213 _10_ RTR --- 312 cbr 120 [13a a 6 800] ------- [6:0 20:0 29 20] [15] 1 0'
The problem with your FS value, [[:space:]]|_, is that
it only recognizes 1 character at a time as the separator
it only recognizes either whitespace or _ as the separator.
Note that specifying an explicit FS value other than ' ' (a single space) causes awk to look for a single instance of that separator, and interprets multiple adjacent instances as separating multiple - and thus empty - fields.
Thus, in your case, the spans <space>_ and _<space> each represent not a single separator, but two separators abutting an empty field.
If you want spans (runs) of a given character or characters from a set to be interpreted as a single separator instance, use duplication symbol +.
However, the proposed FS value, [[:space:]_]+, may be too permissive, as it would recognize a run of any mix of whitespace and _ chars. as a separator.
To be more restrictive, you could use the following FS value:
[[:space:]]+_?|_?[[:space:]]+
That said, if the _ chars in your input function more like delimiters enclosing only one field, a better solution may be:
to use the DEFAULT value of FS, which recognizes runs of whitespace as delimiters
to strip the _ delimiters from field $3: gsub("^_|_$", "", $3)
To retrieve the ascii code of all charterers of column 13th of a file I write this script
awk -v ch="'" '{
for (i=1;i<=length(substr($13,6,length($13)));i++)
{cmd = printf \"%d\\n\" \"" ch substr(substr($13,6,length($13)),i,1) "\"" cmd | getline output close(cmd) ;
Number= Number " " output
}
print Number ; Number=""
}' ~/a.test
but it doesn't work in the right way! I mean it works fine a while then produces the weird results!?
As an example , for this input (assume it's column 13th)
CQ:Z:%8%%%%0%%%%9%%%%:%%%%%%%%%%%%%%%%%%
I have to get this
37 56 37 37 37 37 48 37 37 37 37 57 37 37 37 37 58 37 37 37 37 ...............
But I have this
37 56 37 37 37 37 48 48 48 48 48 57 57 57 57 57 58 58 58 58 58 ...............
As you can see first miss-computation appear after character "0" (48 in result).
Do you know which part of my code is responsible for this error ?!
Try this:
awk '{
str = substr($13, 6)
for (i=1; i<=length(str); i++) {
cmd = "printf %d \42\47" substr(str, i, 1) "\42"
cmd | getline output
close(cmd)
Number= Number " " output
}
print Number
Number=""
}' ~/a.test
\42 is " and \47 is ', so this runs printf %d "'${char}" in the shell for each ${char}, which triggers evaluation as a C constant with the POSIX extension dictating a numeric value as noted in the final bullet of the POSIX printf definition's §Extended Description.
N.B. The formatting matters!
Don't try to squeeze the code unless you know exactly what you're doing!
And a pure awk solution (I took the ord/chr functions directly from the manual):
printf '%s\n' 'CQ:Z:%8%%%%0%%%%9%%%%:%%%%%%%%%%%%%%%%%%'|
awk 'BEGIN { _ord_init() }
{
str = substr($0, 6)
for (i = 0; ++i <= length(str);)
printf "%s", (ord(substr(str, i, 1)) (i < length(str) ? OFS : ORS))
}
func _ord_init( low, high, i, t) {
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
}
else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
}
else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
func ord(str, c) {
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}
func chr(c) {
# force c to be numeric by adding 0
return sprintf("%c", c + 0)
}'
This might work for you:
awk -vSQ="'" -vDQ='"' '{args=space="";n=split($13,a,"");for(i=1;i<=n;i++){args=args space DQ SQ a[i] DQ;format=format space "%d";space=" "};format=DQ format "\\n" DQ;system("printf " format " " args)}'