awk how to split and change blank by NA - awk

i have trouble doing some stuff with awk. I want to split a file into 2 files, it's working mostly but i have one last issue:
this is one of my input file :
samplexxx EH Tred GangSTR
dijen006 nofile nofile nofile
dijen006_100 22,30 22,27 19,25
dijen006_75 25,27 29 NA
dijen017 nofile nofile nofile
dijen017_100 75,121 54 24,24
dijen017_75 74,131 72 19,19
dijen081 63,84 32 40,40
dijen081_100 70,115 78 25,41
dijen081_75 79,143 95 24,104
dijen082 47,51 38 15,34
dijen082_100 46,61 52 6,32
dijen082_75 NA 55 17,17
dijen083 30,53 30,40 38,38
dijen083_100 43,53 30,59 23,32
dijen083_75 43,60 18,74 23,71
dijen1013 30 30 20,30
dijen1013_100 30 30 9,19
dijen1013_75 21 33 20,20
dijen1014 9,30 9,30 9,30
dijen1014_100 9,28 9,43 9,11
dijen1014_75 9,28 9,36 9,29
dijen1015 23,30 23,30 23,29
dijen1015_100 23,30 NA 13,22
dijen1015_75 25,27 21,42 22,39
dijen402 25,31 25,31 25,31
dijen402_100 30 29,36 14,30
dijen402_75 25,26 22,39 22,39
i am using this code :
#!/bin/awk -f
#USAGE = awk -v my_var=$ibasename $i .tsv) split_file_allelle.awk $i
BEGIN { FS=OFS="\t" }
NR == 1 {
str1 = str2 = $0
}
NR > 1 {
str1 = str2 = $1
for (i=2; i<=NF; i++) {
split($i,a,/,/)
str1 = str1 OFS a[1]
str2 = str2 OFS a[2]
}
}
{
print str1 > my_var"_all1.tsv"
print str2 > my_var"_all2.tsv"
}
and i have two file, one like that, splited on the ",". Do you think it would be a way to get, on the second file where there is no number, something like 'NA' instead of blank?
samplexxx EH Tred GangSTR
dijen006
dijen006_100 30 27 25
dijen006_75 27
dijen017
dijen017_100 121 24
dijen017_75 131 19
dijen081 84 40
dijen081_100 115 41
dijen081_75 143 104
dijen082 51 34
dijen082_100 61 32
dijen082_75 17
dijen083 53 40 38
dijen083_100 53 59 32
dijen083_75 60 74 71
dijen1013 30
dijen1013_100 19
dijen1013_75 20
dijen1014 30 30 30
dijen1014_100 28 43 11
dijen1014_75 28 36 29
dijen1015 30 30 29
dijen1015_100 30 22
dijen1015_75 27 42 39
dijen402 31 31 31
dijen402_100 36 30
dijen402_75 26 39 39
this is what i have, but i would like to have something like that :
samplexxx EH Tred GangSTR
dijen006 NA NA NA
dijen006_100 30 27 25
dijen006_75 27 NA NA
dijen017 NA NA NA
dijen017_100 121 NA 24
....
thanks for your help!

BEGIN {
FS = OFS = "\t"
all1 = my_var "_all1.tsv"
all2 = my_var "_all2.tsv"
}
NR == 1 {
str1 = str2 = $0
}
NR > 1 {
str1 = str2 = $1
for (i=2; i<=NF; i++) {
n = split($i,a,",")
str1 = str1 OFS a[1]
str2 = str2 OFS (n == 1 ? "NA" : a[2])
}
}
{
print str1 > all1
print str2 > all2
}
It wasn't necessary to change print str1 > my_var"_all1.tsv" to print str1 > all1 to solve the specific problem you asked about, the ternary using the test of split()s return does that, BUT print str1 > my_var"_all1.tsv" is undefined behavior per POSIX so it'd fail in some awks and instead needs to be written using a variable as I have or with parens around the expression that generates the file name, print str1 > (my_var"_all1.tsv"). Using a variable and doing the concatenation once total instead of once per line is more efficient.

Related

decoding octal escape sequences in input with awk

Updated
Let's suppose that you got octal escape sequences in a stream:
backslash \134 is escaped as \134134
single quote ' and double quote \042
linefeed `\012` and carriage return `\015`
%s &
etc...
note: The escaped characters are limited to 0x01-0x1F 0x22 0x5C 0x7F
How can you revert those escape sequences back to their corresponding character with awk?
While awk is able to understand them out-of-box when used in a literal string or as a parameter argument, I can't find the way to leverage this capability when the escape sequence is part of the data. For now I'm using one gsub per escape sequence but it doesn't feel efficient.
Here's the expected output for the given sample:
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
PS: While I have the additional constraint of unescaping each line into an awk variable before printing the result, it doesn't really matter.
Using GNU awk for strtonum() and lots of meaningfully-named variables to show what each step does:
$ cat tst.awk
function octs2chars(str, head,tail,oct,dec,char) {
head = ""
tail = str
while ( match(tail,/\\[0-7]{3}/) ) {
oct = substr(tail,RSTART+1,RLENGTH-1)
dec = strtonum(0 oct)
char = sprintf("%c", dec)
head = head substr(tail,1,RSTART-1) char
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{ print octs2chars($0) }
$ awk -f tst.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
If you don't have GNU awk then write a small function to convert octal to decimal, e.g. oct2dec() below, and then call that instead of strtonum():
$ cat tst2.awk
function oct2dec(oct, dec) {
dec = substr(oct,1,1) * 8 * 8
dec += substr(oct,2,1) * 8
dec += substr(oct,3,1)
return dec
}
function octs2chars(str, head,tail,oct,dec,char) {
head = ""
tail = str
while ( match(tail,/\\[0-7]{3}/) ) {
oct = substr(tail,RSTART+1,RLENGTH-1)
dec = oct2dec(oct) # replaced "strtonum(0 oct)"
char = sprintf("%c", dec)
head = head substr(tail,1,RSTART-1) char
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{ print octs2chars($0) }
$ awk -f tst2.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
The above assumes that, as discussed in comments, the only backslashes in the input will be in the context of the start of octal numbers as shown in the provided sample input.
With GNU awk which supports strtonum() function, would you
please try:
awk '{
while (match($0, /\\[0-7]{1,3}/)) {
printf("%s", substr($0, 1, RSTART - 1)) # print the substring before the match
printf("%c", strtonum("0" substr($0, RSTART + 1, RLENGTH))) # convert the octal string to character
$0 = substr($0, RSTART + RLENGTH) # update $0 with remaining substring
}
print
}' input_file
It processes the matched substring (octal presentation)
in the while loop one by one.
substr($0, RSTART + 1, RLENGTH) skips the leading backslash.
"0" prepended to substr makes an octal string.
strtonum() converts the octal string to the numeric value.
The final print outputs the remaining substring.
UPDATE :: about gawk's strtonum() in unicode mode :
echo '\666' |
LC_ALL='en_US.UTF-8' gawk -e '
$++NF = "<( "(sprintf("%c", strtonum((_=_<_) substr($++_, ++_))))" )>"'
0000000 909522524 539507744 690009798 2622
\ 6 6 6 < ( ƶ ** ) > \n
134 066 066 066 040 074 050 040 306 266 040 051 076 012
\ 6 6 6 sp < ( sp ? ? sp ) > nl
92 54 54 54 32 60 40 32 198 182 32 41 62 10
5c 36 36 36 20 3c 28 20 c6 b6 20 29 3e 0a
0000016
By default, gawk in unicode mode would decode out a multi-byte character instead of byte \266 | 0xB6. If you wanna ensure consistency of always decoding out a single-byte out, even in gawk unicode mode, this should do the trick :
echo '\666' |
LC_ALL='en_US.UTF-8' gawk -e '$++NF = sprintf("<( %c )>",
strtonum((_=_<_) substr($++_, ++_)) + _*++_^_++*_^++_)'
0000000 909522524 539507744 1042882742 10
\ 6 6 6 < ( 266 ) > \n
134 066 066 066 040 074 050 040 266 040 051 076 012
\ 6 6 6 sp < ( sp ? sp ) > nl
92 54 54 54 32 60 40 32 182 32 41 62 10
5c 36 36 36 20 3c 28 20 b6 20 29 3e 0a
0000015
long story short : add 4^5 * 54 to output of strtonum(), which happens to be 0xD800, the starting point of UTF-16 surrogates
=================== =================== ===================
one quick note about #Gene's proposed perl-based solution :
echo 'abc \555 456' | perl -p -e 's/\\([0-7]{3})/chr(oct($1))/ge'
Wide character in print at -e line 1, <> line 1.
abc ŭ 456
octal codes wrap around, meaning \4xx = \0xx ; \6xx = \2xx etc :
printf '\n %s\n' $'\555'
m
so perl is incorrectly decoding these as multi-byte characters, when in fact \555, as confirmed by printf, is merely lowercase "m" (0x6D)
ps : my perl is version 5.34
I got my own POSIX awk solution, so I post it here for reference.
The main idea is to build a hash that translates an octal escape sequence to its corresponding character. You can then use it while splitting the line during the search for escape sequences:
LANG=C awk '
BEGIN {
for ( i = 1; i <= 255; i++ )
tr[ sprintf("\\%03o",i) ] = sprintf("%c",i)
}
{
remainder = $0
while ( match(remainder, /\\[0-7]{3}/) ) {
printf("%s%s", \
substr(remainder, 1, RSTART-1), \
tr[ substr(remainder, RSTART, RLENGTH) ] \
)
remainder = substr(remainder, RSTART + RLENGTH)
}
print remainder
}
' input.txt
backslash `\`
single quote `'` and double quote `"`
linefeed `
` and carriage return `
%s &
etc...
this separate post is made specifically to showcase how to extend the octal lookup reference tables in gawk unicode-mode to all 256 bytes without external dependencies or warning messages:
ASCII bytes reside in table o2bL
8-bit bytes reside in table o2bH
.
# gawk profile, created Fri Sep 16 09:53:26 2022
'BEGIN {
1 makeOctalRefTables(PROCINFO["sorted_in"] = "#val_str_asc" \
(ORS = ""))
128 for (_ in o2bL) {
128 print o2bL[_]
}
128 for (_ in o2bH) {
128 print o2bH[_]
}
}
function makeOctalRefTables(_,__,___,____)
{
1 _=__=___=____=""
for (_ in o2bL) {
break
}
1 if (!(_ in o2bL)) {
1 ____=_+=((_+=_^=_<_)-+-++_)^_--
128 do { o2bL[sprintf("\\%o",_)] = \
sprintf("""%c",_)
} while (_--)
1 o2bL["\\" ((_+=(_+=_^=_<_)+_)*_--+_+_)] = "\\&"
1 ___=--_*_^_--*--_*++_^_*(_^=++_)^(! —_)
128 do { o2bH[sprintf("\\%o", +_)] = \
sprintf("%c",___+_)
} while (____<--_)
}
1 return length(o2bL) ":" length(o2bH)
}'
|
\0 \1 \2 \3 \4 \5 \6 \7 \10\11 \12
\13
\14
\16 \17
\20 \21 \22 \23 \24 \25 \26 \27 \30 \31 \32 \33 34 \35 \36 \37
\40 \41 !\42 "\43 #\44 $\45 %\47 '\50 (\51 )\52 *\53 +\54 ,\55 -\56 .\57 /
\60 0\61 1\62 2\63 3\64 4\65 5\66 6\67 7\70 8\71 9\72 :\73 ;\74 <\75 =\76 >\77 ?
\100 #\101 A\102 B\103 C\104 D\105 E\106 F\107 G\110 H\111 I\112 J\113 K\114 L\115 M\116 N\117 O
\120 P\121 Q\122 R\123 S\124 T\125 U\126 V\127 W\130 X\131 Y\132 Z\133 [\134 \\46 \&\135 ]\136 ^\137 _
\140 `\141 a\142 b\143 c\144 d\145 e\146 f\147 g\150 h\151 i\152 j\153 k\154 l\155 m\156 n\157 o
\160 p\161 q\162 r\163 s\164 t\165 u\166 v\167 w\170 x\171 y\172 z\173 {\174 |\175 }\176 ~\177
\200 ?\201 ?\202 ?\203 ?\204 ?\205 ?\206 ?\207 ?\210 ?\211 ?\212 ?\213 ?\214 ?\215 ?\216 ?\217 ?
\220 ?\221 ?\222 ?\223 ?\224 ?\225 ?\226 ?\227 ?\230 ?\231 ?\232 ?\233 ?\234 ?\235 ?\236 ?\237 ?
\240 ?\241 ?\242 ?\243 ?\244 ?\245 ?\246 ?\247 ?\250 ?\251 ?\252 ?\253 ?\254 ?\255 ?\256 ?\257 ?
\260 ?\261 ?\262 ?\263 ?\264 ?\265 ?\266 ?\267 ?\270 ?\271 ?\272 ?\273 ?\274 ?\275 ?\276 ?\277 ?
\300 ?\301 ?\302 ?\303 ?\304 ?\305 ?\306 ?\307 ?\310 ?\311 ?\312 ?\313 ?\314 ?\315 ?\316 ?\317 ?
\320 ?\321 ?\322 ?\323 ?\324 ?\325 ?\326 ?\327 ?\330 ?\331 ?\332 ?\333 ?\334 ?\335 ?\336 ?\337 ?
\340 ?\341 ?\342 ?\343 ?\344 ?\345 ?\346 ?\347 ?\350 ?\351 ?\352 ?\353 ?\354 ?\355 ?\356 ?\357 ?
\360 ?\361 ?\362 ?\363 ?\364 ?\365 ?\366 ?\367 ?\370 ?\371 ?\372 ?\373 ?\374 ?\375 ?\376 ?\377 ?

Compare two numerical ranges in two distincts files with awk and print ALL lines from file1 and the matching ones from file2

This new question is a followup from a recent question : Compare two numerical ranges in two distincts files with awk. The proposed solution that perfectly worked was not practical for downstream analysis (misconception of my question, not on the solution that worked).
I have a file1 with 3 columns. Columns 2 and 3 define a numerical range. Data are sorted from the smaller to the bigger value in column 2. Numerical ranges never overlap.
file1
S 24 96
S 126 352
S 385 465
S 548 600
S 621 707
S 724 736
I have a second file2 (test) structured similarly.
file2
S 27 93
S 123 348
S 542 584
S 726 740
S 1014 2540
S 12652 12987
Desired output: Print ALL lines from file1 and next to them, lines of file2 for which numerical ranges overlap (including partially) the ones of file1. If no ranges from file2 overlap to a range of file1, print zero next to the range of file 1.
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 355 * 123-355 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736
Based on the answer of the previous question from #EdMorton I modified the print command of the tst.awk script to add these new features. In addition I also changed the order file1/file2 to file2/file1 to have all the lines from file1 printed (whether or not there is a match in the second file)
'NR == FNR {
begs2ends[$2] = $3
next
}
{
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,"\t",$1,"\t",beg,"\t",end
else
print $0,"\t","0"
next
}
}
}
Note: $1 is identical in file1 and file2. This is why I used print ... $1 to make it appear. No idea how to print it from file2 and not file1 (if I understand correctly this $1 refers to file1.
And I launch the analysis with awk -f tst.awk file2 file1
The script is not accepting the else argument and I dont understand why? I assuming that it is linked to the looping but I tried several changes without any success.
Thanks if you can help me with this.
Assumptions:
a range from file1 can only overlap with one range from file2
The current code is almost correct, just need some work with the placement of the braces (using some consistent indentation helps):
awk '
BEGIN { OFS="\t" } # output field delimiter is "\t"
NR == FNR { begs2ends[$2] = $3; next }
{
# $1=$1 # uncomment to have current line ($0) reformatted with "\t" delimiters during print
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,$1,beg,end # spacing within $0 unchanged, 3 new fields prefaced with "\t"
next
}
}
# if we get this far it is because we have exhausted the "for" loop
# (ie, found no overlaps) so print current line + "0"
print $0,"0" # spacing within $0 unchanged, 1 new field prefaced with "\t"
}
' file2 file1
This generates:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
With the $1=$1 line uncommented the output becomes:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
S 900 1000 S 901 905
A slight variation on #markp-fuso's answer
Works with GNU awk: saved as overlaps.awk
BEGIN { PROCINFO["sorted_in"] = "#ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
line[FNR] = $0
lo[FNR] = $2
hi[FNR] = $3
next
}
{
overlap = "0"
for (i in line) {
if (in_range(lo[i], $2, $3) || in_range(hi[i], $2, $3)) {
overlap = line[i]
delete line[i]
break
}
}
print $0, overlap
}
Then
gawk -f overlaps.awk file2 file1 | column -t
outputs
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
$ cat tst.awk
BEGIN { OFS="\t" }
NR == FNR {
ranges[++numRanges] = $0
next
}
{
overlapped = 0
for ( i=1; i<=numRanges; i++ ) {
range = ranges[i]
split(range,vals)
beg = vals[2]+0
end = vals[3]+0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
overlapped = 1
break
}
}
if ( overlapped ) {
print $0, range, sprintf("* %d-%d overlaps with %d-%d", beg, end, $2, $3)
}
else {
print $0, 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
}
}
$ awk -f tst.awk file2 file1 | column -s$'\t' -t
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 348 * 123-348 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736

AWK/SED/getline - How to simplify/improve this example?

I'm trying to take a 3 column input file and separate it based on a condition in column 3. I think it'll be easier to show you than explain:
Input File:
outputfile1.txt
26 NCC 1 # First Start
38 NME 2
44 NSC 1 # Start2
56 NME 2
62 NCC 1 # Start3
...
314 NCC 1 # Start17
326 NME 2
332 NSC 1 # Start18
344 NME 2
349 NME 2 # Final End
(The hashed comments aren't part of the file, I've added to make things clearer).
Column 3 is used to determine a new "START" entry
"START/END" values are from Column 1
"TITLE" I would like to be all values from Column 2 between consecutive "STARTS"
Desired Output
outputfile2.txt
START=26 ; END=43 ; TITLE=NCC_NME
START=44 ; END=61 ; TITLE=NSC_NME
START=62 ; END=79 ; TITLE=NCC_...
...
START=314 ; END=331 ; TITLE=NCC_NME
START=332 ; END=349 ; TITLE=NSC_NME
Crude script that 'almost' does this but makes 5 single column temporary files in the process.
awk '{ print $1 }' outputfile1.txt | sed '$d' > tempfile1.txt
awk '{ print $1-1 }' outputfile1.txt | sed '$d' > tempfile2.txt
sed '$d' outputfile1.txt | awk 'NR{print $3-p}{p=$3}' > tempfile3.txt
awk ' { getline value < "tempfile1.txt" }
{ if (NR==1)
print value ;
else if( $1 != 1 )
print value }' tempfile3.txt > tempfile4.txt
awk ' { getline value < "tempfile2.txt" }
{ if (NR==1)
print value ;
else if ( $1 != 1 )
print value }' tempfile3.txt | sed '1d' > tempfile5.txt
awk 'END{print $1}' outputfile1.txt >> tempfile5.txt
awk ' { getline value < "tempfile5.txt" }
{print "START="$0 " ; END="value}' tempfile4.txt > outputfile2.txt
Contents of temp files
| temp1 temp2 temp3
NR=1 | 26 25 1
NR=2 | 38 37 1
NR=3 | 44 43 -1
NR=4 | 56 55 1
NR=5 | 62 61 -1
... | ... ... ...
NR=33 | 314 313 -1
NR=34 | 326 325 1
NR=35 | 332 331 -1
NR=36 | 344 343 1
----------------------------------
| temp4 temp5
NR=1 | 26 43
NR=2 | 44 61
NR=3 | 62 79
... | ... ...
NR=17 | 314 331
NR=18 | 332 359
Current output
outputfile2.txt
START=26 ; END=43
START=44 ; END=61
START=62 ; END=79
...
START=314 ; END=331
START=332 ; END=349
Try:
awk '
function print_range() {
printf "START=%s ; END=%s ; TITLE=%s\n", start, end-1, title
}
{
end=$1
}
# if column 3 is equal to 1, then there is a new start
$3==1 {
if(title) print_range()
start=$1
title=$2
next
}
# if the label in field 2 is not part of the title then add it
title!~"(^|_)" $2 "(_|$)" {
title=title"_"$2
}
END {
end++
print_range()
}
' file
You can do everything in one go using:
awk '{
if(NR==1){
# if we are the first record we initialize our variables
PREVIOUS_ONE=$1
TITLE=$2
PREVIOUS_THIRD=$3
} else {
# as long as the new third column is larger we update our variables
if(PREVIOUS_THIRD < $3) {
TITLE=TITLE"_"$2
PREVIOUS_THIRD=$3
} else {
# this means the third column was smaller
# we print out the data and reinitialize our variables
print "START="PREVIOUS_ONE" ; END="$1-1" ; TITLE= "TITLE;
PREVIOUS_ONE=$1
TITLE=$2
PREVIOUS_THIRD=$3
}
}
}' outputfile1.txt

compare file and print class

I have
file1:
id position
a1 21
a1 39
a1 77
b1 88
b1 122
c1 22
file 2
id class position1 position2
a1 Xfact 1 40
a1 Xred 41 66
a1 xbreak 69 89
b1 Xbreak 77 133
b1 Xred 140 199
c1 Xfact 1 15
c1 Xbreak 19 35
I want something like this
output:
id position class
a1 21 Xfact
a1 39 Xfact
a1 77 Xbreak
b1 88 Xbreak
b1 122 Xbreak
c1 22 Xbreak
I need a simple awk script , which print id and position from file1, take position from file1 and compare it to file 2 positions. if position in file 1 lies in range of position 1 and 2 in file two. print corresponding class
One way using awk. It's not a simple script. The process explained in short: The key point is the variable 'all_ranges', when reset reads from file of ranges saving its data, and when set, stop that process and begin reading from 'id-position'
file, checks position in the data of the array and prints if matches the range. I've tried to avoid to process the file of ranges many times and do it by chunks, which made it more complex.
EDIT to add that I assume id field in both files are sorted. Otherwise this script will fail miserably and you will need another approach.
Content of script.awk:
BEGIN {
## Arguments:
## ARGV[0] = awk
## ARGV[1] = <first_input_argument>
## ARGV[2] = <second_input_argument>
## ARGC = 3
f2 = ARGV[ --ARGC ];
all_ranges = 0
## Read first line from file with ranges to get 'class' header.
getline line <f2
split( line, fields )
class_header = fields[2];
}
## Special case for the header.
FNR == 1 {
printf "%s\t%s\n", $0, class_header;
next;
}
## Data.
FNR > 1 {
while ( 1 ) {
if ( ! all_ranges ) {
## Read line from file with range positions.
ret = getline line <f2
## Check error.
if ( ret == -1 ) {
printf "%s\n", "ERROR: " ERRNO
close( f2 );
exit 1;
}
## Check end of file.
if ( ret == 0 ) {
break;
}
## Split line in spaces.
num = split( line, fields )
if ( num != 4 ) {
printf "%s\n", "ERROR: Bad format of file " f2;
exit 2;
}
range_id = fields[1];
if ( $1 == fields[1] ) {
ranges[ fields[3], fields[4] ] = fields[2];
continue;
}
else {
all_ranges = 1
}
}
if ( range_id == $1 ) {
delete ranges;
ranges[ fields[3], fields[4] ] = fields[2];
all_ranges = 0;
continue;
}
for ( range in ranges ) {
split( range, pos, SUBSEP )
if ( $2 >= pos[1] && $2 <= pos[2] ) {
printf "%s\t%s\n", $0, ranges[ range ];
break;
}
}
break;
}
}
END {
for ( range in ranges ) {
split( range, pos, SUBSEP )
if ( $2 >= pos[1] && $2 <= pos[2] ) {
printf "%s\t%s\n", $0, ranges[ range ];
break;
}
}
}
Run it like:
awk -f script.awk file1 file2 | column -t
With following result:
id position class
a1 21 Xfact
a1 39 Xfact
a1 77 xbreak
b1 88 Xbreak
b1 122 Xbreak
c1 22 Xbreak

Get Ascii Code?

To retrieve the ascii code of all charterers of column 13th of a file I write this script
awk -v ch="'" '{
for (i=1;i<=length(substr($13,6,length($13)));i++)
{cmd = printf \"%d\\n\" \"" ch substr(substr($13,6,length($13)),i,1) "\"" cmd | getline output close(cmd) ;
Number= Number " " output
}
print Number ; Number=""
}' ~/a.test
but it doesn't work in the right way! I mean it works fine a while then produces the weird results!?
As an example , for this input (assume it's column 13th)
CQ:Z:%8%%%%0%%%%9%%%%:%%%%%%%%%%%%%%%%%%
I have to get this
37 56 37 37 37 37 48 37 37 37 37 57 37 37 37 37 58 37 37 37 37 ...............
But I have this
37 56 37 37 37 37 48 48 48 48 48 57 57 57 57 57 58 58 58 58 58 ...............
As you can see first miss-computation appear after character "0" (48 in result).
Do you know which part of my code is responsible for this error ?!
Try this:
awk '{
str = substr($13, 6)
for (i=1; i<=length(str); i++) {
cmd = "printf %d \42\47" substr(str, i, 1) "\42"
cmd | getline output
close(cmd)
Number= Number " " output
}
print Number
Number=""
}' ~/a.test
\42 is " and \47 is ', so this runs printf %d "'${char}" in the shell for each ${char}, which triggers evaluation as a C constant with the POSIX extension dictating a numeric value as noted in the final bullet of the POSIX printf definition's §Extended Description.
N.B. The formatting matters!
Don't try to squeeze the code unless you know exactly what you're doing!
And a pure awk solution (I took the ord/chr functions directly from the manual):
printf '%s\n' 'CQ:Z:%8%%%%0%%%%9%%%%:%%%%%%%%%%%%%%%%%%'|
awk 'BEGIN { _ord_init() }
{
str = substr($0, 6)
for (i = 0; ++i <= length(str);)
printf "%s", (ord(substr(str, i, 1)) (i < length(str) ? OFS : ORS))
}
func _ord_init( low, high, i, t) {
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
}
else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
}
else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
func ord(str, c) {
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}
func chr(c) {
# force c to be numeric by adding 0
return sprintf("%c", c + 0)
}'
This might work for you:
awk -vSQ="'" -vDQ='"' '{args=space="";n=split($13,a,"");for(i=1;i<=n;i++){args=args space DQ SQ a[i] DQ;format=format space "%d";space=" "};format=DQ format "\\n" DQ;system("printf " format " " args)}'