Let's say I have some rebol / red code. If I load the source text, I get a block, but how can get back the source text from block ? I tried form block but it doesn't give back the source text.
text: {
Red [Title: "Red Pretty Printer"]
out: none ; output text
spaced: off ; add extra bracket spacing
indent: "" ; holds indentation tabs
emit-line: func [] [append out newline]
emit-space: func [pos] [
append out either newline = last out [indent] [
pick [#" " ""] found? any [
spaced
not any [find "[(" last out find ")]" first pos]
]
]
]
emit: func [from to] [emit-space from append out copy/part from to]
clean-script: func [
"Returns new script text with standard spacing."
script "Original Script text"
/spacey "Optional spaces near brackets and parens"
/local str new
] [
spaced: found? spacey
clear indent
out: append clear copy script newline
parse script blk-rule: [
some [
str:
newline (emit-line) |
#";" [thru newline | to end] new: (emit str new) |
[#"[" | #"("] (emit str 1 append indent tab) blk-rule |
[#"]" | #")"] (remove indent emit str 1) break |
skip (set [value new] load/next str emit str new) :new
]
]
remove out ; remove first char
]
print clean-script read %clean-script.r
}
block: load text
LOAD is a higher-level operation with complex behaviors, e.g. it can take a FILE!, a STRING!, or a BLOCK!. Because it does a lot of different things, it's hard to speak of its exact complement as an operation. (For instance, there is SAVE which might appear to be the "inverse" of when you LOAD from a FILE!)
But your example is specifically dealing with a STRING!:
If I load the source text, I get a block, but how can get back the source text from block ?
As a general point, and very relevant matter: you can't "get back" source text.
In your example above, your source text contained comments, and after LOAD they will be gone. Also, a very limited amount of whitespace information is preserved, in the form of the NEW-LINE flag that each value carries. Yet what specific indentation style you used--or whether you used tabs or spaces--is not preserved.
On a more subtle note, small amounts of notational distinction are lost. STRING! literals which are loaded will lose knowledge of whether you wrote them "with quotes" or {with curly braces}...neither Rebol nor Red preserve that bit. (And even if they did, that wouldn't answer the question of what to do after mutations, or with new strings.) There are variations of DATE! input formats, and it doesn't remember which specific one you used. Etc.
But when it comes to talking about code round-tripping as text, the formatting is minor compared to what happens with binding. Consider that you can build structures like:
>> o1: make object! [a: 1]
>> o2: make object! [a: 2]
>> o3: make object! [a: 3]
>> b: compose [(in o1 'a) (in o2 'a) (in o3 'a)]
== [a a a]
>> reduce b
[1 2 3]
>> mold b
"[a a a]"
You cannot simply serialize b to a string as "[a a a]" and have enough information to get equivalent source. Red obscures the impacts of this a bit more than in Rebol--since even operations like to block! on STRING! and system/lexer/transcode appear to do binding into the user context. But it's a problem you will face on anything but the most trivial examples.
There are some binary formats for Rebol2 and Red that attempt to address this. For instance in "RedBin" a WORD! saves its context (and index into that context). But then you have to think about how much of your loaded environment you want dragged into the file to preserve context. So it's certainly opening a can of worms.
This isn't to say that the ability to MOLD things out isn't helpful. But there's no free lunch...so Rebol and Red programs wind up having to think about serialization as much as anyone else. If you're thinking of doing processing on any source code--for the reasons of comment preservation if nothing else--then PARSE should probably be the first thing you reach for.
Related
I'm trying to describe SSH protocol in Kaitai language (.ksy file).
At the beginning, there is a protocol version exchange in the following format:
SSH-protoversion-softwareversion SP comments CR LF
where SP comments is optional. AFAIK, there is not way of describing attribute as fully optional, only via if condition. Does anybody know how to describe this relation in Kaitai, so that parser accepts also this format: SSH-protoversion-softwareversion CR LF?
Thanks
Kaitai Struct is not designed to be what you would call a grammar in its traditional meaning (i.e. something mapping to a regular language, context-free grammar, BNF, or something similar). Traditional grammars have notion of "this element being optional" or "this element can be repeated multiple times", but KS works the other way around: it's not even attempting to solve the ambiguility problem, but rather builds on a fact that all binary formats are designed to be non-ambiguous.
So, whenever you're encountering something like "optional element" or "repeated element" without any further context, please take a pause and consider if Kaitai Struct is a right tool for the task, and is it really a binary format you're trying to parse. For example, parsing something like JSON or XML or YAML might be theoretically possible with KS, but the result will be not of much use.
That said, in this particular case, it's perfectly possible to use Kaitai Struct, you'll just need to think on how a real-life binary parser will handle this. From my understanding, a real-life parser will read the whole line until the CR byte, and then will do a second pass at trying to interpret the contents of that line. You can model that in KS using something like that:
seq:
- id: line
terminator: 0xd # CR
type: version_line
# ^^^ this creates a substream with all bytes up to CR byte
- id: reserved_lf
contents: [0xa]
types:
version_line:
seq:
- id: magic
contents: 'SSH-'
- id: proto_version
type: str
terminator: 0x2d # '-'
- id: software_version
type: str
terminator: 0x20 # ' '
eos-error: false
# ^^^ if we don't find that space and will just hit end of stream, that's fine
- id: comments
type: str
size-eos: true
# ^^^ if we still have some data in the stream, that's all comment
If you want to get null instead of empty string for comments when they're not included, just add extra if: not _io.eof for the comments attribute.
I can't find new-tab whereas there is new-line so how do you preserver tabs in blocks ?
help new-line
USAGE:
NEW-LINE position value
DESCRIPTION:
Sets or clears the new-line marker within a block or paren.
NEW-LINE is a native! value.
ARGUMENTS:
position [block! paren!] "Position to change marker (modified)".
value "Set TRUE for newline".
REFINEMENTS:
/all => Set/clear marker to end of series.
/skip => Set/clear marker periodically to the end of the series.
size [integer!]
RETURNS:
[block! paren!]
There is one newline flag per-cell in arrays ("any-block!"s), which indicates whether or not the molding process should put out a newline before that value.
Indentation is driven only from these flags. Indentation starts at the first newline flag, and further newlines will each align to that level, with an outdent at the end of the block if any newlines/indents occurred.
>> data: [a b c]
>> new-line next data true
>> data
== [a
b c
]
Note there are 4 "candidate positions" for newlines inside the block [a b c] (e.g. the positions are [* a * b * c *]). Yet there are only three value cells, with a newline marker indicating a desire to output a newline before that cell. Lacking anywhere to put the fourth newline signal, the decision in Rebol2 and Red is to implicitly put the closing bracket on its own line if there were any newline markers processed.
I've previously mentioned that it's non-obvious exactly how "out-of-band" information like this gets managed in the face of series modifications. It helps to work through your expectations. Even worrying about just one bit there is a lot of nuance, such as when you say:
compose [
1 + (block1)
(block2)
]
How should newline markers be merged, between what's in the COMPOSE and what's in the spliced data itself? That's just the logic related to one bit. Putting in some "indentation count" would introduce many more questions. Plus, there's not a lot of bits to spare for that count: one of the "rules of the game" is to keep things down to just 4 platform pointers per value cell.
Expanding the formatting features isn't too likely. One feature request that the tail get its own newline marker was accepted for open source Rebol3, but rejected by Red. I wouldn't expect to see much more done in this area.
In PostScript if you have
[4 5 6]
you have the following tokens:
mark integer integer integer mark
The stack goes like this:
| mark |
| mark | integer |
| mark | integer | integer |
| mark | integer | integer | integer |
| array |
Now my question:
Is the ]-mark operator a literal object or an executable object?
Am I correct that the [-mark is a literal object (just data) and that the ]-mark is an executable object (because you always need to create an array when you see this ]-mark operator) ?
PostScript Language Reference Manual section 3.3.2 gives me:
The [ and ] operators, when executed, produce a literal array object with the en-closed objects as elements. Likewise, << and >> (LanguageLevel 2) produce a
literal dictionary object.
That is not clear for me if both [ ] operators are executable or only the ] operator.
Summary.
All of these special tokens, [, ], <<, >>, come out of the scanner as executable names. [ and << are defined to yield a marktype object (so they are not operators per se, but they are executable names defined in systemdict where all the operators live). ] and >> are defined as procedures or operators which are executed just like any other procedure or operator. These use the counttomark operator to find the opening bracket. But all of these tokens are treated specially by the scanner, which recognizes them without surrounding whitespace since they are part of its delimiter set.
Details.
It all depends on when you look at it. Let's trace through what the interpreter does with these tokens. I'm going to illustrate this with a string, but it works just the same with a file.
So if you have an input string
([4 5 6]) cvx exec
cvx makes a literal object executable. The program stream is a file object also labeled executable. exec pushes an object on the Execution Stack, where it is encountered by the interpreter on the next iteration of the inner interpreter processing loop. When executing the program stream, the executable file object is topmost on the Execution Stack.
The interpreter uses token to call the scanner. The scanner skips initial whitespace, then reads all non-whitespace characters up to the next delimiter, then attempts to interpret the string as a number, and failing that it becomes an executable name. The brackets are part of the set of delimiters, and so are termed 'self-delimiting'. So the scanner reads the one bracket character, stops reading because it's a delimiter, discovers it cannot be a number, so it yields an executable name.
Top of Exec Stack | Operand Stack
(4 5 6]) [ |
Next, the interpreter loop executes anything executable (unless it's an array). Executing a token means loading it from the dictionary, and then executing the definition if it's executable. [ is defined as a -mark- object, same as the name mark is defined. It's not technically an operator or a procedure, it's just a definition. Automatic loading happens because the name comes out of the scanner with the executable flag set.
(4 5 6]) | -mark-
The scanner then yields 4, 5, and 6 which are numbers and get pushed straight to the operand stack. 6 is delimited by the ] which is pushed back on the stream.
(]) | -mark- 4 5 6
The interpreter doesn't execute the numbers since they are not executable, but it would be just the same if it did. The action for executing a number is simply to push it on the stack.
Then, finally the scanner encounters the right bracket ]. And that's where the magic happens. Self-delimited, it doesn't need to be followed by any whitespace. The scanner yields the executable name ] and the interpreter executes it by loading and it finds ...
{ counttomark array astore exch pop }
Or maybe an actual operator that does this. But, yeah. counttomark yields the number of elements. array creates an array of that size. astore fills an array with elements from the stack. And exch pop to discard that pesky mark once and for all.
For dictionaries, << is exactly the same as [. It drops a mark. Then you line up some key-value pairs, and >> is procedure that does something to effect of ...
{ counttomark dup dict begin 2 idiv { def } repeat pop currentdict end }
Make a dictionary. Define all the pairs. Pop the mark. Yield the dictionary. This version of the procedure tries to create a fast dictionary by making it double-sized. Move the 2 idiv to before dup to make a small dictionary.
So, to get philosophical, counttomark is the operator you're using. And it requires a special object-type that isn't used for anything else, the marktype object, -mark-. The rest is just syntactical sugar to let you access this stack-counting ability to create linear data-structures.
Appendix
Here's a procedure that models the interpreter loop reading from currentfile.
{currentfile token not {exit} if dup type /arraytype ne {exec} if }loop
exec is responsible for loading (and further executing) any executable names. You can see from this that token really is the name of the scanner; and that procedures (arrays) directly encountered by the interpreter loop are not executed (type /arraytype ne {exec} if).
Using token on strings lets you do really cool stuff, however. For example, you can dynamically construct procedure bodies with substituted names. This is very much like a lisp macro.
/makeadder { % n . { n add }
1 dict begin
/n exch def
({//n add}) token % () {n add} true
pop exch pop % {n add}
end
} def
token reads the entire procedure from the string, substituting the immediately-evaluated name //n with its currently defined value. Notice here that the scanner reads an executable array all at once, effectively executing [ ... ] cvx internally before returning (In certain interpreters, like my own xpost, this allows you to bypass the stack-size limits to build an array, because the array is built in separate memory. But Level 2 garbage collection makes this largely irrelevant).
There is also the bind operator which modifies a procedure by replacing operator names with the operator objects themselves. These tricks help you to factor-out name lookups in speed-critical procedures (like inner loops).
Both [ and ] are executable tokens. [ produces a mark object, ] creates an array of objects to the last mark
I want to verify that a given file in a path is of type text file, i.e. not binary, i.e. readable by a human. I guess reading first characters and check each character with :
isAlphaNumeric
isSpecial
isSeparator
isOctetCharacter ???
but joining all those testing methods with and: [ ... and: [ ... and: [ ] ] ] seems not to be very smalltalkish. Any suggestion for a more elegant way?
(There is a Python version here How to identify binary and text files using Python? which could be useful but syntax and implementation looks like C.)
only heuristics; you can never be really certain...
For ascii, the following may do:
|isPlausibleAscii numChecked|
isPlausibleAscii :=
[:char |
((char codePoint between:32 and:127)
or:[ char isSeparator ])
].
numChecked := text size min: 1024.
isPossiblyText := text from:1 to:numChecked conform: isPlausibleAscii.
For unicode (UTF8 ?) things become more difficult; you could then try to convert. If there is a conversion error, assume binary.
PS: if you don't have from:to:conform:, replace by (copyFrom:to:) conform:
PPS: if you don't have conform: , try allSatisfy:
All text contains more space than you'd expect to see in a binary file, and some encodings (UTF16/32) will contain lots of 0's for common languages.
A smalltalky solution would be to hide the gory details in method on Standard/MultiByte-FileStream, #isProbablyText would probably be a good choice.
It would essentially do the following:
- store current state if you intend to use it later, reset to start (Set Latin1 converter if you use a MultiByteStream)
Iterate over N next characters (where N is an appropriate number)
Encounter a non-printable ascii char? It's probably binary, so return false. (not a special selector, use a map, implement a new method on Character or something)
Increase 2 counters if appropriate, one for space characters, and another for zero characters.
If loop finishes, return whether either of the counters have been read a statistically significant amount
TLDR; Use a method to hide the gory details, otherwise it's pretty much the same.
Let's say I have
o: context [
f: func[message /refine message2][
print [message]
if refine [print message 2]
]
]
I can call it like this
do get in o 'f "hello"
But how can I do for the refinement ? something like this that would work
>> do get in o 'f/refine "hello" "world"
** Script Error: in expected word argument of type: any-word
** Near: do get in o 'f/refine
>>
I don't know if there's a way to directly tell the interpreter to use a refinement in invoking a function value. That would require some parameterization of do when its argument is a function! Nothing like that seems to exist...but maybe it's hidden somewhere else.
The only way I know to use a refinement is with a path. To make it clear, I'll first use a temporary word:
>> fword: get in o 'f
>> do compose [(to-path [fword refine]) "hello" "world"]
hello
world
What that second statement evaluates to after the compose is:
do [fword/refine "hello" "world"]
You can actually put function values into paths too. It gets rid of the need for the intermediary:
>> do compose [(to-path compose [(get in o 'f) refine]) "hello" "world"]
hello
world
P.S. you have an extra space between message and 2 above, where it should just be message2
Do this:
o/('f)/refine "hello" "world"
Parens in a path expression are evaluated if they correspond to object field or series pick/poke index references. That makes the above code equivalent to this:
apply get in o 'f ["hello" true "world"]
Note that apply arguments are positional, so you need to know the order the arguments were declared in. You can't do that trick with the function refinements themselves, so you have to use apply or create path expressions to evaluate if you want to parameterize the refinements of the function call.
Use the simple path o/f/refine