Can rebol trim remove blank lines without removing CRLF? - rebol

I wanted to use trim to remove blank lines:
line 1
line 2
to get
line1
line2
but using
trim/lines
does also remove CRLF. So is there another way to use trim for that purpose ?

You could use PARSE:
parse string-with-newlines [
any [
crlf remove some crlf
| newline remove some newline
| skip
]
]
It may be faster to use charsets though:
text: complement charset crlf
parse string-with-newlines [
any [
some text
| crlf any crlf
| newline remove any newline
]
]

replace/all {Line1^/^/Line2} {^/^/} {^/}

No way just with trim, but here a solution with removeach and also, also removing leading LFs
trim-emptyline: func [
str [string!]
/local lfb4 lfnow c
] [
lfb4: true
remove-each c str [also all [lfnow: lf = c lfb4] lfb4: lfnow]
str
]

Related

Antlr4 DM string lexer rules

I'm trying to represent the BYOND DM language strings in lexer form (See http://byond.com and http://byond.com/docs/ref). Here are the rules for strings:
The string start and end with double quotes. i.e. "hello world" evaluates to hello world
A backslash acts as an escape character, which can escape the end quote. i.e. "hello\"world" evaluates to hello"world
Newlines in the string can be ignored by ending the line with a backslash. i.e. "hello\
world" evaluates to helloworld
If the string opens/closes with the sequence {"/"} respectively, newlines are allowed and entered into the final string. The sequence \\\n is still ignored
The string can contain embedded expressions inside braces which are formatted into the result. Backslashes can escape the opening brace. i.e. "hello [ "world" ] \[" evaluates to hello world [ at run-time. Any expression can go in the braces (calls, math, etc...)
If the starting quote/curly brace is prefixed with '#' escape sequences and embedded expressions are disabled for the string. i.e. #{"hello [worl\d"} and #"hello [worl\d" both evaluate to hello [worl\d
I am trying to construct ANTLR4 .g4 lexer rules to tokenize these strings. I figure there's 4 (or more) token types I'd need:
Normal string. i.e "hello world", #"hello world", #{"hello world"} or {"hello world"}
String start before embedded expression. i.e. "hello [ or {"hello [
String end after embedded expression. i.e. ] world" or ] world"}
String in between two embedded expressions. i.e. ] hello world [
Here are my (incomplete and unsuccessful) attempts:
LSTRING: '"' ('\\[' | ~[[\r\n])* '[';
RSTRING: ']' ('\\"' | ~["\r\n])* '"';
CSTRING: ']' ('\\[' | ~[[\r\n])* '[';
FSTRING: '"' ('\\"' | ~["\r\n])* '"';
If this can't be solved in the lexer, I can write the parser rules on my own with the tokens #, {", "}, [, ], \\, and ". But, I figure I'd give this a shot since it'd be more performant.
I solved it with the following lexer tidbits. Permalink
...
#lexer::members
{
ulong regularAccessLevel;
System.Collections.Generic.Stack<bool> multiString = new System.Collections.Generic.Stack<bool>();
}
...
VERBATIUM_STRING: '#"' (~["\r\n])* '"';
MULTILINE_VERBATIUM_STRING: '#{"' (~'"')* '"}';
MULTI_STRING_START: '{"' { multiString.Push(true); } -> pushMode(INTERPOLATION_STRING);
STRING_START: '"' { multiString.Push(false); } -> pushMode(INTERPOLATION_STRING);
...
LBRACE: '[' { ++regularAccessLevel; };
RBRACE: ']' { if(regularAccessLevel > 0) --regularAccessLevel; else if(multiString.Count > 0) { PopMode(); } };
...
mode INTERPOLATION_STRING;
CHAR_INSIDE: '\\\''
| '\\"'
| '\\['
| '\\\\'
| '\\0'
| '\\a'
| '\\b'
| '\\f'
| '\\n'
| '\\r'
| '\\t'
| '\\v'
;
EMBED_START: '[' -> pushMode(DEFAULT_MODE);
MULTI_STRING_CLOSE: {multiString.Peek()}? '"}' { multiString.Pop(); PopMode(); };
STRING_CLOSE: {!multiString.Peek()}? '"' { multiString.Pop(); PopMode(); };
STRING_INSIDE: {!multiString.Peek()}? ~('[' | '\\' | '"' | '\r' | '\n')+;
MULTI_STRING_INSIDE: {multiString.Peek()}? ~('[' | '\\' | '"')+;
Certain strings can cause it to emit multiple STRING_INSIDE/MULTI_STRING_INSIDE tokens in sequence, but this is acceptable since the parser will eat it all anyway.
A lot of it came from reading the C# interpolated strings in the antlr4 examples permalink

how define default rule in EBNF/Tatsu?

I have a problem in my EBNF and Tatsu implementation
extract grammar EBNF for Tatsu :
define ='#define' constantename [constante] ;
constante = CONSTANTE ;
CONSTANTE = ( Any | ``true`` ) ;
Any = /.*/ ;
constantename = (/[A-Z0-9_()]*/) ;
When I test with :
#define _TEST01_ "test01"
#define _TEST_
#define _TEST02_ "test02"
I get :
[
"#define",
"_TEST01_",
"\"test01\""
],
[
"#define",
"_TEST_",
"#define _TEST02_ \"test02\""
]
But I want this :
[
"#define",
"_TEST01_",
"\"test01\""
],
[
"#define",
"_TEST_",
"true"
],
[
"#define",
"_TEST02_",
"\"test02\""
]
Where is my mistake ?
Thanks a lot...
The problem is that Tatsu skips white space, including newlines, between elements by default. So when you apply the rule '#define' constantename [constante] to the input:
#define _TEST_
#define _TEST02_ "test02"
It first matches #define with '#define', then skips the space, then matches _TEST_ with constantename, then skips the newline, and then matches #define _TEST02_ "test02" with ANY (via constante).
Note that that's exactly the behaviour you'd want (I assume) if the newline weren't there:
#define _TEST_ #define _TEST02_ "test02"
Here you'd want the output ["#define", "_TEST_", "#define _TEST02_ \"test02\""], right? At least the C preprocessor would handle it the same way in that case.
So what that tells us is that the newline is significant. Therefore you can't ignore it. You can tell Tatsu to only ignore tabs and spaces (not newlines) either by passing whitespace = '\t ' as an option when creating the parser, or by adding this line to the grammar:
##whitespace :: /[\t ]+/
Now you'll need to explicitly mention newlines anywhere where newlines should go, so your rule becomes:
define ='#define' constantename [constante] '\n';
Now it's clear that the constant, if present, should appear before the line break, so for the line #define _TEST_, it would realize that there is no constant.
Note that you'll also want a rule to match empty lines, so empty lines aren't syntax errors.

Antlr4: How can I match end of lines inside multiline comments?

I have to create a program that counts lines of code ignoring those inside a comment. I'm a newbie working with Antlr, and after trying a lot, the nearest I came to a solution is this erroneous grammar:
grammar Comments;
comment : startc content endc;
startc : '/*';
endc : '*/';
content : newline | contenttext;
contenttext : CONTENTCHARS+;
newline : '\r\n';
CONTENTCHARS
: ~'*' '/'
| ~'/' .
;
WS : [ \r\t]+ -> skip;
If I try with /*hello\r\nworld*/ the parser recognizes this, which is erroneous:
In order to count lines, the parser needs to detect newline characters, inside and outside multiline comments. I think my problem is that I don't know how to say "match everything inside /* and */ except \r\n.
Please, can you point me in the right direction? Any help will be appreciated.
Solution
Let's simplify your grammar! In the grammar we will ignore whitespace characters and comments at the lexer stage (and the unwanted newlines at the same time!). For example the COMMENT section will match one line comments or multi-line comments and just skip them!
Next, we will introduce counter variable for counting NEWLINE tokens that are used only in content grammar rule (because COMMENT token is skipped so the NEWLINE token in it!).
Whenever we encounter a NEWLINE token we increment the counter variable.
grammar Comments;
#lexer::members {
int counter = 0;
}
WS : [ \r\t]+ -> skip;
COMMENT : '/*' .*? '*/' NEWLINE? -> skip;
TEXT : [a-zA-Z0-9]+;
NEWLINE : '\r'? '\n' { {System.out.println("Newlines so far: " + (++counter)); } };
content: (TEXT | COMMENT | NEWLINE )* EOF;

I thought this parsing would be simple

... and I'm hitting the wall, I don't understand why this doesn't work (I need to be able to parse either the single tag version (terminated with />) or the 2 tag versions (terminated with ) ):
Rebol[]
content: {<pre:myTag attr1="helloworld" attr2="hello"/>
<pre:myTag attr1="helloworld" attr2="hello">
</pre:myTag>
<pre:myTag attr3="helloworld" attr4="hello"/>
}
spacer: charset reduce [#" " newline]
letter: charset reduce ["ABCDEFGHIJKLMNOPQRSTUabcdefghijklmnopqrstuvwxyz1234567890="]
rule: [
any [
{<pre:myTag}
any [any letter {"} any letter {"}] mark:
(print {clipboard... after any letter {"} any letter {"}} write clipboard:// mark input)
any spacer mark: (print "clipboard..." write clipboard:// mark input) ["/>" | ">"
any spacer </pre:myTag>
]
any spacer
(insert mark { Visible="false"})
]
to end
]
parse content rule
write clipboard:// content
print "The end"
input
In this case, the problem isn't your rule - it's that your 'insert after each tag changes alters the position at the point you do the insert.
To illustrate:
>> probe parse str: "abd" ["ab" mark: (insert mark "c") "d"] probe str
false
"abcd"
== "abcd"
The insert is correct, but after the insert, the parse rule is still at position 2, and before where there was just "d", there is now "cd" and the rule fails. Three strategies:
1) Incorporate the new content:
>> probe parse str: "abd" ["ab" mark: (insert mark "c") "cd"] probe str
true
"abcd"
== "abcd"
2) Calculate the length of the new content and skip:
>> probe parse str: "abd" ["ab" mark: (insert mark "c") 1 skip "d"] probe str
true
"abcd"
== "abcd"
3) Change the position after the manipulation:
>> probe parse str: "abd" ["ab" mark: (mark: insert mark "c") :mark "d"] probe str
true
"abcd"
== "abcd"
Number 2) would be the quickest in your case as you know your string length is 16:
rule: [
any [
{<pre:myTag} ; opens tag
any [ ; eats through all attributes
any letter {"} any letter {"}
]
mark: ( ; mark after the last attribute, pause (input)
print {clipboard... after any letter {"} any letter {"}}
write clipboard:// mark
input
)
any spacer mark: ; space, mark, print, pause
(print "clipboard..." write clipboard:// mark input)
[ ; close tag
"/>"
|
">" any spacer </pre:myTag>
]
any spacer ; redundant without /all
(insert mark { Visible="false"})
16 skip ; adjust position based on the new content
]
to end
]
Note: this is the same rule as yours with just [16 skip] added.

Is it possible to override rebol path operator?

It is possible to overide rebol system words like print, make etc., so is it possible to do the same with the path operator ? Then what's the syntax ?
Another possible approach is to use REBOL meta-programming capabilities and preprocess your own code to catch path accesses and add your handler code. Here's an example :
apply-my-rule: func [spec [block!] /local value][
print [
"-- path access --" newline
"object:" mold spec/1 newline
"member:" mold spec/2 newline
"value:" mold set/any 'value get in get spec/1 spec/2 newline
"--"
]
:value
]
my-do: func [code [block!] /local rule pos][
parse code rule: [
any [
pos: path! (
pos: either object? get pos/1/1 [
change/part pos reduce ['apply-my-rule to-block pos/1] 1
][
next pos
]
) :pos
| into rule ;-- dive into nested blocks
| skip ;-- skip every other values
]
]
do code
]
;-- example usage --
obj: make object! [
a: 5
]
my-do [
print mold obj/a
]
This will give you :
-- path access --
object: obj
member: a
value: 5
--
5
Another (slower but more flexible) approach could also be to pass your code in string mode to the preprocessor allowing freeing yourself from any REBOL specific syntax rule like in :
my-alternative-do {
print mold obj..a
}
The preprocessor code would then spot all .. places and change the code to properly insert calls to 'apply-my-rule, and would in the end, run the code with :
do load code
There's no real limits on how far you can process and change your whole code at runtime (the so-called "block mode" of the first example being the most efficient way).
You mean replace (say)....
print mold system/options
with (say)....
print mold system..options
....where I've replaced REBOL's forward slash with dot dot syntax?
Short answer: no. Some things are hardwired into the parser.