unknown comparison failure occurs with xtext formatter - formatting

I'm customizing the xtext formatter of my dsl and whist testing I get a weird comparison failure I don't understand.
Here's the relevant part of my grammar :
EisModel:
'project' '=' project_name=STRING ';'
'plcname' '=' plc_name=STRING ';'
'author' '=' author_name=STRING ';'
testcases+=Testcase*;
Testcase:
"testcase" testcase_name=ID '{'
testblock=Testblock?
'}';
Testblock:
'testActive' '=' testActive=BoolConstant ';'
'blockType' '=' blockType=BlockConstant ';'
'description' '=' description=STRING ';'
define=DefineBlock?;
BoolConstant:
value=('true' | 'false');
BlockConstant:
value=('FC' | 'FB');
And the comparison failure I get, I assume, has something to do with a problem regarding a terminal rule, since I am not doing anything extraorinary in the formatter.
This is the expected code of the JUnit failure trace:
74 4 S "true" BoolConstant:value='true'
78 0 H
78 1 S ";" Testblock:(';' )
79 2 H "\n\t" Whitespace:TerminalRule'WS'
81 9 S "blockType" Testblock:'blockType'
90 1 H " " Whitespace:TerminalRule'WS'
91 1 S "=" Testblock:( '=' )
92 1 H " " Whitespace:TerminalRule'WS'
93 2 S "FC" BlockConstant:value='FC'
And this the actual code:
B BoolConstant Testblock:testActive=BoolConstant path:Testblock/testActive=Testcase/testblock=EisModel/testcases[0]
74 4 S "true" BoolConstant:value='true'
E BoolConstant Testblock:testActive=BoolConstant path:Testblock/testActive=Testcase/testblock=EisModel/testcases[0]
78 0 H
78 1 S ";" Testblock:(';' )
79 2 H "\n\t" Whitespace:TerminalRule'WS'
81 9 S "blockType" Testblock:'blockType'
90 1 H " " Whitespace:TerminalRule'WS'
91 1 S "=" Testblock:( '=' )
92 1 H " " Whitespace:TerminalRule'WS'
B BlockConstant Testblock:blockType=BlockConstant path:Testblock/blockType=Testcase/testblock=EisModel/testcases[0]
93 2 S "FC" BlockConstant:value='FC'
E BlockConstant Testblock:blockType=BlockConstant path:Testblock/blockType=Testcase/testblock=EisModel/testcases[0]
The difference revolves around the lines 74 and 93.
And I don't know what is going wrong or even where I could tweak anything.
Could anyone please help?
Here's the test:
#Test def void testTestblock() {
assertFormatted[
toBeFormatted = '''
project="proj";plcname="name";author="Bob";
testcase One {testActive = true ; blockType = FC ;
description = "string" ; }
'''
expectation = '''
project = "proj";
plcname = "name";
author = "Bob";
testcase One {
testActive = true;
blockType = FC;
description = "string";
}
'''
]
}
The bug even occurs if I comment out my code in the formatter class which extends AbstractFormatter2, so I'll omit that here.

this sounds like a bug to me. please report it at https://github.com/eclipse/xtext-core
workaround:
BoolConstant:
value=BooleanValue;
BlockConstant:
value=BlockValue;
BlockValue:"FC"|"FB";
BooleanValue: "true"|"false";

Here's a little bit more of the grammar:
EisModel:
'project' '=' project_name=STRING ';'
'plcname' '=' plc_name=STRING ';'
'author' '=' author_name=STRING ';'
testcases+=Testcase*;
Testcase:
"testcase" testcase_name=ID '{'
testblock=Testblock?
'}';
Testblock:
'testActive' '=' testActive=BoolConstant ';'
'blockType' '=' blockType=BlockConstant ';'
'description' '=' description=STRING ';'
define=DefineBlock?;
BoolConstant:
value=BooleanValue;
BlockConstant:
value=BlockValue;
BlockValue:
'FC' | 'FB';
BooleanValue:
'true' | 'false';
DefineBlock:
'define' '{' direction=DirectionBlock '}' teststeps+=TeststepBlock*;
DirectionBlock:
input=Input & inout=InOut? & output=Output;
Input:
name='input' '[' inputVariables+=Variables* ']';
Output:
name='output' '[' outputVariables+=Variables* ']';
InOut:
name='inout' '[' inoutVariables+=Variables* ']';
And here another comparison failure.
expected:
123 5 S "input" Input:name='input'
128 0 H
128 1 S "[" Input:'['
129 0 H
129 1 S "]" Input:']'
130 0 H
130 6 S "output" Output:name='output'
136 0 H
136 1 S "[" Output:'['
137 0 H
137 1 S "]" Output:']'
actual:
B Input'input' DirectionBlock:input=Input path:DirectionBlock/input=DefineBlock/direction=Testblock/define=Testcase/testblock=EisModel/testcases[0]
123 5 S "input" Input:name='input'
128 0 H
128 1 S "[" Input:'['
129 0 H
129 1 S "]" Input:']'
E Input'input' DirectionBlock:input=Input path:DirectionBlock/input=DefineBlock/direction=Testblock/define=Testcase/testblock=EisModel/testcases[0]
130 0 H
B Output'output' DirectionBlock:output=Output path:DirectionBlock/output=DefineBlock/direction=Testblock/define=Testcase/testblock=EisModel/testcases[0]
130 6 S "output" Output:name='output'
136 0 H
136 1 S "[" Output:'['
137 0 H
137 1 S "]" Output:']'
E Output'output' DirectionBlock:output=Output path:DirectionBlock/output=DefineBlock/direction=Testblock/define=Testcase/testblock=EisModel/testcases[0]
The differences are now around the lines 123-129 and 130-137.
Edit
After activating the formatter within the editior, I received an error message which I don't understand. Maybe someone else does:
Message:
Unhandled event loop exception
Exception Stack Trace:
java.lang.StackOverflowError
at com.google.common.collect.RegularImmutableMap.get(RegularImmutableMap.java:123)
at com.google.common.collect.RegularImmutableMap.get(RegularImmutableMap.java:115)
at org.eclipse.xtext.formatting2.regionaccess.internal.NodeModelBasedRegionAccess.regionForEObject(NodeModelBasedRegionAccess.java:49)
at org.eclipse.xtext.formatting2.regionaccess.internal.NodeModelBasedRegionAccess.regionForEObject(NodeModelBasedRegionAccess.java:22)
at org.eclipse.xtext.formatting2.AbstractFormatter2.isInRequestedRange(AbstractFormatter2.java:358)
at org.eclipse.xtext.formatting2.AbstractFormatter2.shouldFormat(AbstractFormatter2.java:423)
at org.eclipse.xtext.formatting2.internal.FormattableDocument.format(FormattableDocument.java:186)
at org.example.eis.formatting2.EisFormatter._format(EisFormatter.java:224)
at org.example.eis.formatting2.EisFormatter.format(EisFormatter.java:346)
A look at the java classes:
223 protected void _format(final DirectionBlock directionblock, #Extension final IFormattableDocument document) {
224 document.<DirectionBlock>format(directionblock);
225 }

Related

Replacing sequences of space-delimited numbers in huge input with awk

How can I replace sequences of space-delimited numbers when those sequences span multiple lines and the input is too big to fit in RAM.
A sample input would be:
edit: I re-worked the sample input and input parameters for introducing border cases (excluding ones that have to do with the length of the matched sequence or replacement priorities)
3 12 3 4
0 6 7 10
8 9 12 3
4 6 7 8
10 6 6 7
9 199 10 11
11
note: the number of fields per line is homogeneous but not known in advance; the last line might contain less fields
From that input I would like to:
replace 3 4 with &
replace 6 7 8 with 9 9
replace 6 7 9 with 8 8
replace 7 10 with 11 12
replace 0 with nothing
replace 10 with 13 10
replace 8 9 12 3 5 with #
The expected output would have one number or replacement per line:
3
12
&
6
11 12
8
9
12
&
9 9
13 10
6
8 8
199
13 10
11
11
I'm trying to do the task with awk but I'm having a hard time implementing a dynamic state machine with a pseudo B-Tree:
tr -s '[:space:]' '\n' < input.txt |
awk '
BEGIN {
for (i = 2; i < ARGC; i += 2) {
n = split(ARGV[i], arr)
k = ""
for (j = 1; j <= n; j++) {
k = j SUBSEP k SUBSEP arr[j]
Tree[k]
}
Tree[k] = "$" ARGV[i+1] #=> now can test "if (Tree[k])"
delete ARGV[i]
delete ARGV[i+1]
}
}
{
Key = (int(Key) + 1) SUBSEP Key SUBSEP $1
if ( Key in Tree ) {
if (Tree[Key]) {
print substr(Tree[Key],2)
Buffer = ""
Key = ""
}
else
Buffer = Buffer $1 "\n"
} else {
print Buffer $1
Buffer = ""
Key = ""
}
}
END { if (Buffer != "") printf ("%s", Buffer) }
' - \
'3 4' '&' \
'6 7 8' '9 9' \
'6 7 9' '8 8' \
'7 10' '11 12' \
'0' '' \
'10' '13 10' \
'8 9 12 3 5' '#'
edit: I realised that the code doesn't backtrack after failing to find a complete match in the B-tree, so it's wrong...
How I'm planning to tackle the problem
I'm emulating a B-tree with an array and keys in the following format:
from the middle to the left of the key are the consecutive depths
from the middle to the right of the key are the consecutive values
When a key exists in Tree:
if it doesn't have an associated value then it's a node
if there's a value then it's a leaf
So, for the current input parameters, the content of the Tree array will be:
# from param: "3 4" => "&"
Tree[ 1,"",3 ]
Tree[2,1,"",3,4] = "$&"
# from param: "6 7 8" => "9 9"
Tree[ 1,"",6 ]
Tree[ 2,1,"",6,7 ]
Tree[3,2,1,"",6,7,8] = "$9 9"
# from param: "6 7 9" => "8 8"
Tree[ 1,"",6 ]
Tree[ 2,1,"",6,7 ]
Tree[3,2,1,"",6,7,9] = "$8 8"
# from param: "7 10" => "11 12"
Tree[ 1,"",7 ]
Tree[2,1,"",7,10] = "$11 12"
# from param: "0" => ""
Tree[1,"",0] = "$"
# from param: "10" => "13 10"
Tree[1,"",10] = "$13 10"
# from param: "8 9 12 3 5" => "#"
Tree[ 1,"",8 ]
Tree[ 2,1,"",8,9 ]
Tree[ 3,2,1,"",8,9,12 ]
Tree[ 4,3,2,1,"",8,9,12,3 ]
Tree[5,4,3,2,1,"",8,9,12,3,5] = "$#"
FWIW I'd approach this by figuring out the max number of records that you might need to search in based on the mappings you want, keep a rolling buffer of that number of records, and then do the comparison part on each buffer, e.g.:
$ cat tst.awk
BEGIN {
RS = "[[:space:]]+"
map("3,4" , "&")
map("6,7,8" , "9")
map("9" , "")
map("0" , "\\000")
map("13,10" , "10")
}
{ buf[((NR-1) % maxRecs) + 1] = $0 }
NR >= maxRecs { prt() }
END { prt() }
function prt( nr,sep,str) {
for ( nr=NR-maxRecs+1; nr<=NR; nr++ ) {
str = str sep buf[((nr-1) % maxRecs) + 1]
sep = ORS
}
print ">>>>" ORS str ORS "<<<<"
# Replace the above with something that loops through the
# strings you want replaced, e.g.
#
# for ( mapNr=1; mapNr<=numMaps; mapNr++ ) {
# old = olds[mapNr]
# if ( str ~ old ) { # add something to avoid partial matches
# new = news[mapNr]
# replace old with new in the output
# }
# }
}
function map(old,new, numRecs) {
++numMaps
numRecs = gsub(/,/,ORS,old) + 1
maxRecs = ( numRecs > maxRecs ? numRecs : maxRecs )
olds[numMaps] = old
news[numMaps] = new
}
$ awk -f tst.awk file
>>>>
112
3
4
<<<<
>>>>
3
4
6
<<<<
>>>>
4
6
7
<<<<
>>>>
6
7
8
<<<<
>>>>
7
8
9
<<<<
>>>>
8
9
12
<<<<
>>>>
9
12
0
<<<<
>>>>
12
0
3
<<<<
>>>>
0
3
4
<<<<
>>>>
3
4
15
<<<<
>>>>
4
15
255
<<<<
>>>>
15
255
13
<<<<
>>>>
255
13
10
<<<<
>>>>
13
10
6
<<<<
>>>>
10
6
7
<<<<
>>>>
6
7
8
<<<<
>>>>
7
8
199
<<<<
>>>>
8
199
9
<<<<
>>>>
199
9
0
<<<<
>>>>
9
0
13
<<<<
>>>>
9
0
13
<<<<
The above is just printing the buff-sized strings, the part to be added is replacing the target strings with the new ones in a way that the next target doesn't match the replaced part which is a common problem with, I expect, lots of solutions online so it's left as an exercise.
You'll also need to tweak it to make sure it doesn't revisit lines at the end of the input.
The above uses GNU awk for multi-char RS, if you don't have GNU awk then just pipe the input from tr -s '[:space:]' '\n' as shown in the question.
UPDATE:
previous answer (see edit revisions) was woefully slow (several minutes) when run against a ramped up input (7K mappings in map.txt; 25M tokens
in input.txt1)
new answer (below) is a complete rewrite and processes the 7K-mappings/25M-tokens in ~45 seconds
The main component of this design centers around a tree-like node structure used to manage the series of tokens (lines of input from map.txt):
tree [ParentNodeNbr] [token] [NodeType] = value
Where:
ParentNodeNbr == 0 for the root
token from map.txt
NodeType has one of two values 'node' or 'leaf'
for NodeType = 'node' the value stored in the array is a numeric node number (implemented as an counter that's incremented each time a new node is added to the tree); this node number becomes the ParentNodeNbr for the next token in the series
for NodeType = 'leaf' this designates the 'end' of a series of tokens (line of input from map.txt) and the value stored in the array is the line number (aka FNR) from map.txt; this line number (FNR) is used as an index into a couple other arrays and to determine precendence when an input sequence (from input.txt) has multiple matches from map.txt
when processing a series of tokens from a map.txt line of input we start at ParentNodeNbr == 0 looking for a series of matching nodes, adding new nodes as needed
Setup: storing replacements in a comma-delimited file (map.txt), and adding one additional line to input.txt:
$ head map.txt input.txt
==> map.txt <==
2 3 4,X # "2 3 4" has precendence over ...
2 3,Y # "2 3"
3 4,&
6 7 8,9
9,
0,\000
13 10,10
==> input.txt <==
2 3 4 # keep eye on "2 3" vs "2 3 4" precendence
112 3
4 6 7
8 9 12 0 3
4 15 255 13
10 6
7 8 199 9
0 13
NOTE: here's what tree[][][] looks like when populated from map.txt:
tree [Parent] [Token] [NodeType] = NodeVal
Parent Token NodeType NodeVal MapTo ** MapTo only applies to NodeType = leaf
====== ===== ======== ======= =====
0 0 leaf 6 "\000"
0 2 node 1
0 3 node 3
0 6 node 4
0 9 leaf 5 ""
0 13 node 6
1 3 node 2
1 3 leaf 1 "Y"
2 4 leaf 2 "X"
3 4 leaf 3 "&"
4 7 node 5
5 8 leaf 4 "9"
6 10 leaf 7 "10"
One GNU awk (for multidimensional arrrays):
awk '
function replace(op) {
while ( ((maxToken - minToken + 1) >= maxlen) || op == "flush" ) {
NodeNbr=root
minOrd=maxOrd
for (j=0 ; j<maxlen; j++) { # loop through tokens in buffer[]
token=buffer[ ((minToken + j - 1) % maxlen) + 1 ]
# if we find a matching "leaf" node then keep track of the ordering (ie, FNR from map.txt; lower order == higher precedence)
if ( token in tree [NodeNbr] && "leaf" in tree[NodeNbr][token] )
minOrd= ( tree[NodeNbr][token]["leaf"] < minOrd ) ? tree[NodeNbr][token]["leaf"] : minOrd
# if we find a matching "node" node then grab the next node to compare against the next token from buffer[]
if ( token in tree[NodeNbr] && "node" in tree[NodeNbr][token] ) {
NodeNbr=tree[NodeNbr][token]["node"]
continue
}
break # if we get here we have a token from buffer[] that does not match any of our replacement mappings so abort checking rest of buffer[]
}
if (minOrd < maxOrd) { # if we found at least one complete match (ie, hit a "leaf" node) then ...
print map[minOrd] # use the associated "ord"er to print the associated replacement string and ...
minToken=minToken + len[minOrd] # update the pointer into the buffer[] array
}
else { # otherwise we did not find a match so ...
print buffer[ ((minToken - 1) % maxlen) + 1 ] # print the first token from buffer[] and ...
minToken++ # update the pointer into the buffer[] array
}
if (minToken > maxToken)
break
}
}
BEGIN { root=maxNodeNbr=maxToken=0
minToken=1
maxOrd=9999999999
}
FNR==NR { split($0,a,",")
map[FNR]=a[2] # save replacement string for this input line from map.txt
n=split(a[1],b) # break our matching pattern into tokens
len[FNR]=n # make note of number of tokens in this line of input
maxlen=(n > maxlen) ? n : maxlen # keep track of longest series of tokens
NodeNbr=root # initiate our tree search
for (i=1 ; i<=n ; i++) { # loop through our list of tokens
token=b[i]
if (i==n) # if the last token for this line then create a "leaf" node and store the line number (aka "order")
tree[NodeNbr][token]["leaf"]=FNR
else
if ( tree[NodeNbr][token]["node"] ) # else if we already have a node at this point in the tree then grab its associated node number for the next level in the tree
NodeNbr=tree[NodeNbr][token]["node"]
else { # else create a new "node" node and populate with the next available node number
tree[NodeNbr][token]["node"]=++maxNodeNbr
NodeNbr=maxNodeNbr # use this as the next level in our tree traversal
}
}
maxrec=FNR # keep track of total number of replacement sets from map.txt (only used if we decide to print the contents of map[] to stdout
next
}
FNR==1 {
# Uncomment following to display the contents of the map[] array:
# for (i=1;i<=maxrec;i++)
# print "map:" i ":" map[i] ":"
#
# Uncomment following to display the contents of the tree[][][] array:
# fmt="%6s%8s%10s%10s%10s\n"
# fmt="%6s%8s%10s%10s%10s\n"
# printf "tree [Parent] [Token] [NodeType]\n\n"
# printf fmt, "Parent", "Token", "NodeType", "NodeVal", "MapTo"
# printf fmt, "======", "=====", "========", "=======", "====="
#
# for (NodeNbr=root ; NodeNbr<=maxNodeNbr ; NodeNbr++)
# for (token in tree[NodeNbr])
# for (NodeType in tree[NodeNbr][token]) { # ??
# NodeVal=tree[NodeNbr][token][NodeType]
# printf fmt, NodeNbr, token, NodeType, NodeVal, (NodeType=="leaf") ? "\"" map[NodeVal] "\"" : ""
# }
}
{ for (i=1 ; i<=NF ; i++) { # loop through tokens in current line from input.txt
maxToken++
buffer[ ((maxToken - 1) % maxlen) + 1 ] = $i
if ( (maxToken - minToken + 1) >= maxlen ) # if we have a "full" buffer then ...
replace() # look for replacement match
}
}
END { replace("flush") } # flush the rest of buffer[]
' map.txt input.txt
This generates:
X # "2 3 4" has precendence over "2 3"
112
&
9
12
\000
&
15
255
10
9
199
\000
13
If we switch the first 2 lines of map.txt like such:
==> map.txt <==
2 3,Y # "2 3" has precendence over ...
2 3 4,X # "2 3 4"
We now generate:
Y # "2 3" has precendence over "2 3 4" thus ...
4 # leaving "4" by itself
112
&
9
12
\000
&
15
255
10
9
199
\000
13

decoding octal escape sequences in input with awk

Updated
Let's suppose that you got octal escape sequences in a stream:
backslash \134 is escaped as \134134
single quote ' and double quote \042
linefeed `\012` and carriage return `\015`
%s &
etc...
note: The escaped characters are limited to 0x01-0x1F 0x22 0x5C 0x7F
How can you revert those escape sequences back to their corresponding character with awk?
While awk is able to understand them out-of-box when used in a literal string or as a parameter argument, I can't find the way to leverage this capability when the escape sequence is part of the data. For now I'm using one gsub per escape sequence but it doesn't feel efficient.
Here's the expected output for the given sample:
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
PS: While I have the additional constraint of unescaping each line into an awk variable before printing the result, it doesn't really matter.
Using GNU awk for strtonum() and lots of meaningfully-named variables to show what each step does:
$ cat tst.awk
function octs2chars(str, head,tail,oct,dec,char) {
head = ""
tail = str
while ( match(tail,/\\[0-7]{3}/) ) {
oct = substr(tail,RSTART+1,RLENGTH-1)
dec = strtonum(0 oct)
char = sprintf("%c", dec)
head = head substr(tail,1,RSTART-1) char
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{ print octs2chars($0) }
$ awk -f tst.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
If you don't have GNU awk then write a small function to convert octal to decimal, e.g. oct2dec() below, and then call that instead of strtonum():
$ cat tst2.awk
function oct2dec(oct, dec) {
dec = substr(oct,1,1) * 8 * 8
dec += substr(oct,2,1) * 8
dec += substr(oct,3,1)
return dec
}
function octs2chars(str, head,tail,oct,dec,char) {
head = ""
tail = str
while ( match(tail,/\\[0-7]{3}/) ) {
oct = substr(tail,RSTART+1,RLENGTH-1)
dec = oct2dec(oct) # replaced "strtonum(0 oct)"
char = sprintf("%c", dec)
head = head substr(tail,1,RSTART-1) char
tail = substr(tail,RSTART+RLENGTH)
}
return head tail
}
{ print octs2chars($0) }
$ awk -f tst2.awk file
backslash \ is escaped as \134
single quote ' and double quote "
linefeed `
` and carriage return `
%s &
etc...
The above assumes that, as discussed in comments, the only backslashes in the input will be in the context of the start of octal numbers as shown in the provided sample input.
With GNU awk which supports strtonum() function, would you
please try:
awk '{
while (match($0, /\\[0-7]{1,3}/)) {
printf("%s", substr($0, 1, RSTART - 1)) # print the substring before the match
printf("%c", strtonum("0" substr($0, RSTART + 1, RLENGTH))) # convert the octal string to character
$0 = substr($0, RSTART + RLENGTH) # update $0 with remaining substring
}
print
}' input_file
It processes the matched substring (octal presentation)
in the while loop one by one.
substr($0, RSTART + 1, RLENGTH) skips the leading backslash.
"0" prepended to substr makes an octal string.
strtonum() converts the octal string to the numeric value.
The final print outputs the remaining substring.
UPDATE :: about gawk's strtonum() in unicode mode :
echo '\666' |
LC_ALL='en_US.UTF-8' gawk -e '
$++NF = "<( "(sprintf("%c", strtonum((_=_<_) substr($++_, ++_))))" )>"'
0000000 909522524 539507744 690009798 2622
\ 6 6 6 < ( ƶ ** ) > \n
134 066 066 066 040 074 050 040 306 266 040 051 076 012
\ 6 6 6 sp < ( sp ? ? sp ) > nl
92 54 54 54 32 60 40 32 198 182 32 41 62 10
5c 36 36 36 20 3c 28 20 c6 b6 20 29 3e 0a
0000016
By default, gawk in unicode mode would decode out a multi-byte character instead of byte \266 | 0xB6. If you wanna ensure consistency of always decoding out a single-byte out, even in gawk unicode mode, this should do the trick :
echo '\666' |
LC_ALL='en_US.UTF-8' gawk -e '$++NF = sprintf("<( %c )>",
strtonum((_=_<_) substr($++_, ++_)) + _*++_^_++*_^++_)'
0000000 909522524 539507744 1042882742 10
\ 6 6 6 < ( 266 ) > \n
134 066 066 066 040 074 050 040 266 040 051 076 012
\ 6 6 6 sp < ( sp ? sp ) > nl
92 54 54 54 32 60 40 32 182 32 41 62 10
5c 36 36 36 20 3c 28 20 b6 20 29 3e 0a
0000015
long story short : add 4^5 * 54 to output of strtonum(), which happens to be 0xD800, the starting point of UTF-16 surrogates
=================== =================== ===================
one quick note about #Gene's proposed perl-based solution :
echo 'abc \555 456' | perl -p -e 's/\\([0-7]{3})/chr(oct($1))/ge'
Wide character in print at -e line 1, <> line 1.
abc ŭ 456
octal codes wrap around, meaning \4xx = \0xx ; \6xx = \2xx etc :
printf '\n %s\n' $'\555'
m
so perl is incorrectly decoding these as multi-byte characters, when in fact \555, as confirmed by printf, is merely lowercase "m" (0x6D)
ps : my perl is version 5.34
I got my own POSIX awk solution, so I post it here for reference.
The main idea is to build a hash that translates an octal escape sequence to its corresponding character. You can then use it while splitting the line during the search for escape sequences:
LANG=C awk '
BEGIN {
for ( i = 1; i <= 255; i++ )
tr[ sprintf("\\%03o",i) ] = sprintf("%c",i)
}
{
remainder = $0
while ( match(remainder, /\\[0-7]{3}/) ) {
printf("%s%s", \
substr(remainder, 1, RSTART-1), \
tr[ substr(remainder, RSTART, RLENGTH) ] \
)
remainder = substr(remainder, RSTART + RLENGTH)
}
print remainder
}
' input.txt
backslash `\`
single quote `'` and double quote `"`
linefeed `
` and carriage return `
%s &
etc...
this separate post is made specifically to showcase how to extend the octal lookup reference tables in gawk unicode-mode to all 256 bytes without external dependencies or warning messages:
ASCII bytes reside in table o2bL
8-bit bytes reside in table o2bH
.
# gawk profile, created Fri Sep 16 09:53:26 2022
'BEGIN {
1 makeOctalRefTables(PROCINFO["sorted_in"] = "#val_str_asc" \
(ORS = ""))
128 for (_ in o2bL) {
128 print o2bL[_]
}
128 for (_ in o2bH) {
128 print o2bH[_]
}
}
function makeOctalRefTables(_,__,___,____)
{
1 _=__=___=____=""
for (_ in o2bL) {
break
}
1 if (!(_ in o2bL)) {
1 ____=_+=((_+=_^=_<_)-+-++_)^_--
128 do { o2bL[sprintf("\\%o",_)] = \
sprintf("""%c",_)
} while (_--)
1 o2bL["\\" ((_+=(_+=_^=_<_)+_)*_--+_+_)] = "\\&"
1 ___=--_*_^_--*--_*++_^_*(_^=++_)^(! —_)
128 do { o2bH[sprintf("\\%o", +_)] = \
sprintf("%c",___+_)
} while (____<--_)
}
1 return length(o2bL) ":" length(o2bH)
}'
|
\0 \1 \2 \3 \4 \5 \6 \7 \10\11 \12
\13
\14
\16 \17
\20 \21 \22 \23 \24 \25 \26 \27 \30 \31 \32 \33 34 \35 \36 \37
\40 \41 !\42 "\43 #\44 $\45 %\47 '\50 (\51 )\52 *\53 +\54 ,\55 -\56 .\57 /
\60 0\61 1\62 2\63 3\64 4\65 5\66 6\67 7\70 8\71 9\72 :\73 ;\74 <\75 =\76 >\77 ?
\100 #\101 A\102 B\103 C\104 D\105 E\106 F\107 G\110 H\111 I\112 J\113 K\114 L\115 M\116 N\117 O
\120 P\121 Q\122 R\123 S\124 T\125 U\126 V\127 W\130 X\131 Y\132 Z\133 [\134 \\46 \&\135 ]\136 ^\137 _
\140 `\141 a\142 b\143 c\144 d\145 e\146 f\147 g\150 h\151 i\152 j\153 k\154 l\155 m\156 n\157 o
\160 p\161 q\162 r\163 s\164 t\165 u\166 v\167 w\170 x\171 y\172 z\173 {\174 |\175 }\176 ~\177
\200 ?\201 ?\202 ?\203 ?\204 ?\205 ?\206 ?\207 ?\210 ?\211 ?\212 ?\213 ?\214 ?\215 ?\216 ?\217 ?
\220 ?\221 ?\222 ?\223 ?\224 ?\225 ?\226 ?\227 ?\230 ?\231 ?\232 ?\233 ?\234 ?\235 ?\236 ?\237 ?
\240 ?\241 ?\242 ?\243 ?\244 ?\245 ?\246 ?\247 ?\250 ?\251 ?\252 ?\253 ?\254 ?\255 ?\256 ?\257 ?
\260 ?\261 ?\262 ?\263 ?\264 ?\265 ?\266 ?\267 ?\270 ?\271 ?\272 ?\273 ?\274 ?\275 ?\276 ?\277 ?
\300 ?\301 ?\302 ?\303 ?\304 ?\305 ?\306 ?\307 ?\310 ?\311 ?\312 ?\313 ?\314 ?\315 ?\316 ?\317 ?
\320 ?\321 ?\322 ?\323 ?\324 ?\325 ?\326 ?\327 ?\330 ?\331 ?\332 ?\333 ?\334 ?\335 ?\336 ?\337 ?
\340 ?\341 ?\342 ?\343 ?\344 ?\345 ?\346 ?\347 ?\350 ?\351 ?\352 ?\353 ?\354 ?\355 ?\356 ?\357 ?
\360 ?\361 ?\362 ?\363 ?\364 ?\365 ?\366 ?\367 ?\370 ?\371 ?\372 ?\373 ?\374 ?\375 ?\376 ?\377 ?

Output the result of each loop in different columns

price.txt file has two columns: (name and value)
Mary 134
Lucy 56
Jack 88
range.txt file has three columns: (fruit and min_value and max_value)
apple 57 136
banana 62 258
orange 88 99
blueberry 98 121
My aim is to test whether the value in price.txt file is between the min_value and max_value in range.txt. If yes, putout 1, If not, output "x".
I tried:
awk 'FNR == NR { name=$1; price[name]=$2; next} {
for (name in price) {
if ($2<=price[name] && $3>=price[name]) {print 1} else {print "x"}
}
}' price.txt range.txt
But my results are all in one column, just like follows:
1
1
x
x
x
x
x
x
1
1
1
x
Actually, I want my result to be like: (Each name has one column)
1 x 1
1 x 1
x x 1
x x x
Because I need to use paste to add the output file and range.txt file together. The final result should be like:
apple 57 136 1 x 1
banana 62 258 1 x 1
orange 88 99 x x 1
blueberry 98 121 x x x
So, how can I get the result of each loop in different columns? And is there anyway to output the final result without paste based on my current code? Thank you.
This builds on what you provided,
# load prices by index to maintain read order
FNR == NR {
price[names++]=$2
next
}
# save max index to avoid using non-standard length(array)
END {
names=NR
}
{
l = $1 " " $2 " " $3
for (i=0; i < names; i++) {
if ($2 <= price[i] && $3 >= price[i]) {
l = l " 1"
} else {
l = l " x"
}
}
print l
}
and generates output,
apple 57 136 1 x 1
banana 62 258 1 x 1
orange 88 99 x x 1
blueberry 98 121 x x x
However, you don't have the person name for the score (anonymous results) - maybe that's intentional?
The change here is to explicitly index array populated in first block to maintain order.

OCaml: Print a long int list 10 elements per row

I'm working with really long lists of integers and need a way of printing them 10 to a row. This is what I've got so far and now I'm stuck:
open Printf
let print_list list = List.iter (printf "%d ") list;;
(* Remove first n elements from list *)
let rec remove n list =
if n== 0 then list
else match list with
| [] -> []
| hd::tl -> remove (n-1) tl;;
(* Remove and return first n elements from a list *)
let rec take n list =
match n with
| 0 -> []
| _ -> List.hd list :: take (n-1) (List.tl list);;
let rec print_rows list =
if List.length list > 10 then
begin
let l = take 10 list;
print_list l;
print_endline " ";
print_rows (remove 5 list)
end else print_list list;;
I'm sure there is a better way recursively with matching patterns, but I can't figure this out. Help!
Here's a function that does something close to what you want. It doesn't do anything fancy, it just counts the number of ints printed so far and inserts endlines at the right times.
let printby10 intlist =
let iprint count n =
Printf.printf "%d " n;
if count mod 10 = 9 then Printf.printf "\n";
count + 1
in
ignore (List.fold_left iprint 0 intlist)
This code leaves an incomplete line if the number of ints isn't a multiple of 10. Maybe you would want to fix that up.
Another (but very close to that of #Jeffrey Scofield) approach would be to use the standard function List.iteri, which provides the current element's index:
let print_by_rows n_per_row =
List.iteri (fun i x ->
print_int x;
if (i + 1) mod n_per_row <> 0 then print_string " "
else print_newline ())
A test:
μ> print_by_rows 10 (Array.to_list (Array.make 20 42));;
42 42 42 42 42 42 42 42 42 42
42 42 42 42 42 42 42 42 42 42
- : unit = ()
And one more:
μ> print_by_rows 5 (Array.to_list (Array.make 20 42));;
42 42 42 42 42
42 42 42 42 42
42 42 42 42 42
42 42 42 42 42
- : unit = ()

Get Ascii Code?

To retrieve the ascii code of all charterers of column 13th of a file I write this script
awk -v ch="'" '{
for (i=1;i<=length(substr($13,6,length($13)));i++)
{cmd = printf \"%d\\n\" \"" ch substr(substr($13,6,length($13)),i,1) "\"" cmd | getline output close(cmd) ;
Number= Number " " output
}
print Number ; Number=""
}' ~/a.test
but it doesn't work in the right way! I mean it works fine a while then produces the weird results!?
As an example , for this input (assume it's column 13th)
CQ:Z:%8%%%%0%%%%9%%%%:%%%%%%%%%%%%%%%%%%
I have to get this
37 56 37 37 37 37 48 37 37 37 37 57 37 37 37 37 58 37 37 37 37 ...............
But I have this
37 56 37 37 37 37 48 48 48 48 48 57 57 57 57 57 58 58 58 58 58 ...............
As you can see first miss-computation appear after character "0" (48 in result).
Do you know which part of my code is responsible for this error ?!
Try this:
awk '{
str = substr($13, 6)
for (i=1; i<=length(str); i++) {
cmd = "printf %d \42\47" substr(str, i, 1) "\42"
cmd | getline output
close(cmd)
Number= Number " " output
}
print Number
Number=""
}' ~/a.test
\42 is " and \47 is ', so this runs printf %d "'${char}" in the shell for each ${char}, which triggers evaluation as a C constant with the POSIX extension dictating a numeric value as noted in the final bullet of the POSIX printf definition's §Extended Description.
N.B. The formatting matters!
Don't try to squeeze the code unless you know exactly what you're doing!
And a pure awk solution (I took the ord/chr functions directly from the manual):
printf '%s\n' 'CQ:Z:%8%%%%0%%%%9%%%%:%%%%%%%%%%%%%%%%%%'|
awk 'BEGIN { _ord_init() }
{
str = substr($0, 6)
for (i = 0; ++i <= length(str);)
printf "%s", (ord(substr(str, i, 1)) (i < length(str) ? OFS : ORS))
}
func _ord_init( low, high, i, t) {
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
}
else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
}
else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
func ord(str, c) {
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}
func chr(c) {
# force c to be numeric by adding 0
return sprintf("%c", c + 0)
}'
This might work for you:
awk -vSQ="'" -vDQ='"' '{args=space="";n=split($13,a,"");for(i=1;i<=n;i++){args=args space DQ SQ a[i] DQ;format=format space "%d";space=" "};format=DQ format "\\n" DQ;system("printf " format " " args)}'