Add to map inside of parse dialect - rebol

I want to create a map where a hash is associated to a url in order to check if some url is or is not in the map.
If it's not yet in the map, add it (and the url) to it.
parseContainer: func [cstr [string!]] [
parse cstr [
thru "<a href=" to {"}
thru "http://" copy quarto_url to {"}
(
quarto_hash: checksum/method to-binary quarto_url 'md5
old: find words-of checksums 'quarto_hash
if not old [append checksums [quarto_hash quarto_url ]]
)
]
]
But the words quarto_hash and quarto_url are not converted to their values.
This is maybe not the simplest approach for the problem, so I'll wait for your input.
One other question: is map able to address insertion and search of elements quickly for thousands of elements, or is there any other more appropriate type?
FYI, I'm using Rebol3, but included the Red tag as well because I'll also be using Red in a short future.
What is the best approach for this?

You need to reduce the words quarto_hash and quarto_url
if not old [append checksums reduce [quarto_hash quarto_url ]]
There is also no need to extract the words of the map, you should be faster with a select direct on the map
I would use
if not select checksums quarto_hash [
append checksums reduce [quarto_hash quarto_url ]
]

You need to do one of the following:
if not old [repend checksums [quarto_hash quarto_url]]
or
if not old [append checksums reduce [quarto_hash quarto_url]]
You will now find that quarto_hash and quarto_url are evaluated before adding into the checksums (hash)map.
With regards to map insertion/searching you will find examples below easier/faster:
>> checksums: map []
== make map! [
]
>> quarto-url: "www.draegtun.com"
== "www.draegtun.com"
>> quarto-hash: checksum/method to-binary quarto-url 'md5
== #{D9F71DD455C3E84E7C5C8F2C23FDF174}
>> checksums/:quarto-hash
== none
>> ;; returns NONE if key not present in map
>> checksums/:quarto-hash: quarto-url
== "www.draegtun.com"
>> checksums
== make map! [
#{D9F71DD455C3E84E7C5C8F2C23FDF174} "www.draegtun.com"
]

Related

Is it possible to interpolate Array values in token?

I'm working on homoglyphs module and I have to build regular expression that can find homoglyphed text corresponding to ASCII equivalent.
So for example I have character with no homoglyph alternatives:
my $f = 'f';
and character that can be obfuscated:
my #o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron
I can easily build regular expression that will detect homoglyphed phrase 'foo':
say 'Suspicious!' if $text ~~ / $f #o #o /;
But how should I compose such regular expression if I don't know the value to detect in compile time? Let's say I want to detect phishing that contains homoglyphed 'cash' word in messages. I can build sequence with all the alternatives:
my #lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length
Now obviously following solution cannot "unpack" array elements into the regular expression:
/ #lookup / # doing LTM, not searching elements in sequence
I can workaround this by manually quoting each element and compose text representation of alternatives to get string that can be evaluated as regular expression. And build token from that using string interpolation:
my $regexp-ish = textualize( #lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }
But that is quite error-prone.
Is there any cleaner solution to compose regular expression on the fly from arbitrary amount of elements not known at compile time?
The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.
I'm not sure this is the best approach to use.
I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2
my token confusable($source) {
:my $i = 0; # create a counter var
[
<?{ # succeed only if
my $a = self.orig.substr: self.pos+$i, 1; # the test character A
my $b = $source.substr: $i++, 1; # the source character B and
so $a eq $b # are the same or
|| $a eq %*confusables{$b}.any; # the A is one of B's confusables
}>
. # because we succeeded, consume a char
] ** {$source.chars} # repeat for each grapheme in the source
}
Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.
You can then have your code work as follows:
say $foo ~~ /<confusable: 'foo'>/
This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.
If you are okay with just 1-to-1 relationships, you can go with a much simpler token:
my token confusable($source) {
:my #chars = $source.comb; # split the source
#( # match the array based on
|( # a slip of
%confusables{#chars.head} # the confusables
// Empty # (or nothing, if none)
), #
#a.shift # and the char itself
) #
** {$source.chars} # repeating for each source char
}
The #(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)
In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.
1Unicode calls homographs both "visually similar characters" and "confusables".
2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

What's the inverse of block: load text in rebol / red

Let's say I have some rebol / red code. If I load the source text, I get a block, but how can get back the source text from block ? I tried form block but it doesn't give back the source text.
text: {
Red [Title: "Red Pretty Printer"]
out: none ; output text
spaced: off ; add extra bracket spacing
indent: "" ; holds indentation tabs
emit-line: func [] [append out newline]
emit-space: func [pos] [
append out either newline = last out [indent] [
pick [#" " ""] found? any [
spaced
not any [find "[(" last out find ")]" first pos]
]
]
]
emit: func [from to] [emit-space from append out copy/part from to]
clean-script: func [
"Returns new script text with standard spacing."
script "Original Script text"
/spacey "Optional spaces near brackets and parens"
/local str new
] [
spaced: found? spacey
clear indent
out: append clear copy script newline
parse script blk-rule: [
some [
str:
newline (emit-line) |
#";" [thru newline | to end] new: (emit str new) |
[#"[" | #"("] (emit str 1 append indent tab) blk-rule |
[#"]" | #")"] (remove indent emit str 1) break |
skip (set [value new] load/next str emit str new) :new
]
]
remove out ; remove first char
]
print clean-script read %clean-script.r
}
block: load text
LOAD is a higher-level operation with complex behaviors, e.g. it can take a FILE!, a STRING!, or a BLOCK!. Because it does a lot of different things, it's hard to speak of its exact complement as an operation. (For instance, there is SAVE which might appear to be the "inverse" of when you LOAD from a FILE!)
But your example is specifically dealing with a STRING!:
If I load the source text, I get a block, but how can get back the source text from block ?
As a general point, and very relevant matter: you can't "get back" source text.
In your example above, your source text contained comments, and after LOAD they will be gone. Also, a very limited amount of whitespace information is preserved, in the form of the NEW-LINE flag that each value carries. Yet what specific indentation style you used--or whether you used tabs or spaces--is not preserved.
On a more subtle note, small amounts of notational distinction are lost. STRING! literals which are loaded will lose knowledge of whether you wrote them "with quotes" or {with curly braces}...neither Rebol nor Red preserve that bit. (And even if they did, that wouldn't answer the question of what to do after mutations, or with new strings.) There are variations of DATE! input formats, and it doesn't remember which specific one you used. Etc.
But when it comes to talking about code round-tripping as text, the formatting is minor compared to what happens with binding. Consider that you can build structures like:
>> o1: make object! [a: 1]
>> o2: make object! [a: 2]
>> o3: make object! [a: 3]
>> b: compose [(in o1 'a) (in o2 'a) (in o3 'a)]
== [a a a]
>> reduce b
[1 2 3]
>> mold b
"[a a a]"
You cannot simply serialize b to a string as "[a a a]" and have enough information to get equivalent source. Red obscures the impacts of this a bit more than in Rebol--since even operations like to block! on STRING! and system/lexer/transcode appear to do binding into the user context. But it's a problem you will face on anything but the most trivial examples.
There are some binary formats for Rebol2 and Red that attempt to address this. For instance in "RedBin" a WORD! saves its context (and index into that context). But then you have to think about how much of your loaded environment you want dragged into the file to preserve context. So it's certainly opening a can of worms.
This isn't to say that the ability to MOLD things out isn't helpful. But there's no free lunch...so Rebol and Red programs wind up having to think about serialization as much as anyone else. If you're thinking of doing processing on any source code--for the reasons of comment preservation if nothing else--then PARSE should probably be the first thing you reach for.

How to create a clickable link in rebol VID?

Let's say I want to create a twitter client which list tweets. How would I create and detect a clickable link within a text zone ?
Update: I mean in rebol VID
This is a script that detects URLs in face/text and overlays hyperlinks: http://www.ross-gill.com/r/link-up.html
view layout [
my-text: text read %some.txt
do [link-up my-text]
]
It's based on the pattern in the article below, so you may need to adapt the recognition pattern to your specifications. The links are passed through a to-link function which by default is the same as to-url
http://www.ross-gill.com/page/Beyond_Regular_Expressions
In principle, you want to:
parse your string to identify URLs
replace each URL with an anchor tag
REBOL.org uses code very similar to the code below to do that. Note there are three elements to the implementation:
a set of parse definitions that define a URL and its components
a function that parses a string. Each time it finds a URL in the string, it calls an external function. It replaces the original string's URL with whatever that external function returns to it
an external function that simply wraps a URL in an anchor tag
;; ======================================
;; Definitions provided by
;; Andrew Martin, 15-June-2004
;; ....not all are needed for locating URLs ... so
;; feel free to remove unnecessary items
Octet: charset [#"^(00)" - #"^(FF)"]
Digit: charset "0123456789"
Digits: [some Digit]
Upper: charset [#"A" - #"Z"]
Lower: charset [#"a" - #"z"]
Alpha: union Upper Lower
Alphas: [some Alpha]
AlphaDigit: union Alpha Digit
AlphaDigits: [some AlphaDigit]
Hex: charset "0123456789ABCDEFabcdef"
Char: union AlphaDigit charset "-_~+*'"
Chars: [some [Char | Escape]]
Escape: [#"%" Hex Hex]
Path: union AlphaDigit charset "-_~+*'/.?=&;{}#"
Domain-Label: Chars
Domain: [Domain-Label any [#"." Domain-Label]]
IP-Address: [Digits #"." Digits #"." Digits #"." Digits]
User: [some [Char | Escape | #"."]]
Host: [Domain | IP-Address]
Email^: [User #"#" Host]
Url^: [["http://" | "ftp://" | "https://"] some Path]
;; function to locate URLs in a string
;; and call an action func when each is found
;; ==========================================
find-urls: func [
String [string!]
action-func [function!]
/local Start Stop
][
parse/all String [
any [
Start: copy url url^ Stop: (
Stop: change/part Start action-func url Stop
print start
)
thru </a> ;; this is dependent on the action-func setting </a> as an end marker
| skip
]
end
]
return String
]
;; example of usage with an action-func that
;; replaces url references with an anchor tag
;; ===========================================
target-string: {this string has this url http://www.test.com/path in it
and also this one: https://www.test.com/example.php}
find-urls target-string
func [url][print url return rejoin [{} url ]]
probe target-string
{this string has this url http://www.test.com/path in it
and also this one: https://www.test.com/example.php}
Notes
You should easily be able to see how to adapt find-urls into, say, find-email-addresses for obfucation and/or clickability; all the parse definitions for finding email addresses are in the sample above
You can see REBOL.org's version of this code in operation here, for example: http://www.rebol.org/aga-display-posts.r?offset=0&post=r3wp157x17091
I'll leave you the exercise of bypassing making it clickable if the URL is already in an anchor tag
Also left out: any need to escape chars in the URL (eg & ==> amp;)
Thanks to REBOL pioneer Andrew Martin for the original code that this is based on.

How to pick a letter in a charset?

Would like to do this:
letters: charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
pick letters 2
but pick doesn't work with charset so what should I use with charset to get the letter at position 2 ?
Charsets are "sets" and thus don't really have the concept of a "position". With them you can test for membership, do differencing, negation, etc. But they're just an optimization.
If you care about an "enumeration order" then it is your enumerator which enforces the order, not the set.
Note this code from
http://www.mail-archive.com/rebol-list#rebol.com/msg16432.html
bitset: charset "aaaaybcx"
chars: copy {}
for i 0 (subtract length? bitset 1) 1 [
if find bitset i [append chars to-char i]
]
?? chars
If you actually care about the order, consider keeping a series (e.g. string!) around. e.g. in your example above, nothing is stopping you from making:
letter-string: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
letter-set: charset letter-string
pick letter-string 2
Then you get the best of both worlds!

Parse and charset: why my script doesn't work

I want to extract attribute1 and attribute3 values only. I don't understand why charset doesn't seem to work in my case to "skip" any other attributes (attribute3 is not extracted as I would like):
content: {<tag attribute1="valueattribute1" attribute2="valueattribute2" attribute3="valueattribute3">
</tag>
<tag attribute2="valueattribute21" attribute1="valueattribute11" >
</tag>
}
attribute1: [{attribute1="} copy valueattribute1 to {"} thru {"}]
attribute3: [{attribute3="} copy valueattribute3 to {"} thru {"}]
spacer: charset reduce [tab newline #" "]
letter: complement spacer
to-space: [some letter | end]
attributes-rule: [(valueattribute1: none valueattribute3: none) [attribute1 | none] any letter [attribute3 | none] (print valueattribute1 print valueattribute3)
| [attribute3 | none] any letter [attribute1 | none] (print valueattribute3 print valueattribute1
valueattribute1: none valueattribute3: none
)
| none
]
rule: [any [to {<tag } thru {<tag } attributes-rule {>} to {</tag>} thru {</tag>}] to end]
parse content rule
output is
>> parse content rule
valueattribute1
none
== true
>>
Firstly you're not using parse/all. In Rebol 2 that means that whitespace has been effectively stripped out before the parse runs. That's not true in Rebol 3: if your parse rules are in block format (as you are doing here) then /all is implied.
(Note: There seemed to be consensus that Rebol 3 would throw out the non-block form of parse rules, in favor of the split function for those "minimal" parse scenarios. That would get rid of /all entirely. No action has yet been taken on this, unfortunately.)
Secondly your code has bugs, which I'm not going to spend time sorting out. (That's mostly because I think using Rebol's parse to process XML/HTML is a fairly silly idea :P)
But don't forget you have an important tool. If you use a set-word in the parse rule, then that will capture the parse position into a variable. You can then print it out and see where you're at. Change the part of attribute-rule where you first say any letter to pos: (print pos) any letter and you'll see this:
>> parse/all content rule
attribute2="valueattribute2" attribute3="valueattribute3">
</tag>
<tag attribute2="valueattribute21" attribute1="valueattribute11" >
</tag>
valueattribute1
none
== true
See the leading space? Your rules right before the any letter put you at a space... and since you said any letter was ok, no letters are fine, and everything's thrown off.
(Note: Rebol 3 has an even better debugging tool...the word ??. When you put it in the parse block it tells you what token/rule you're currently processing as well as the state of the input. With this tool you can more easily find out what's going on:
>> parse "hello world" ["hello" ?? space ?? "world"]
space: " world"
"world": "world"
== true
...though it's really buggy on r3 mac intel right now.)
Additionally, if you're not using copy then your pattern of to X thru X is unnecessary, you can achieve that with just thru X. If you want to do a copy you can also do that with the briefer copy Y to X X or if it's just a single symbol you could write the clearer copy Y to X skip
In places where you see yourself writing repetitive code, remember that Rebol can go a step above by using compose etc:
>> temp: [thru (rejoin [{attribute} num {=}])
copy (to-word rejoin [{valueattribute} num]) to {"} thru {"}]
>> num: 1
>> attribute1: compose temp
== [thru "attribute1=" copy valueattribute1 to {"} thru {"}]
>> num: 2
>> attribute2: compose temp
== [thru "attribute2=" copy valueattribute2 to {"} thru {"}]
Short answer, [any letter] eats your attribute3="..." as the #"^"" character is by your definition a 'letter. Additionally, you may have problems where there is no attribute2, then your generic second attribute rule will eat attribute3 and your attribute3 rule will not have anything to match - better to either be explicit that there is an optional attribute2 or an optional anything-but-attribute3
attribute1="foo" attribute2="bar" attribute3="foobar"
<- attribute1="..." -> <- any letter -> <- attibute3="..." ->
Also, 'parse without the /all refinement ignores spaces (or at least is very unwieldy where spaces are concerned) - /all is highly recommended for this type of parsing.
When adding parse/all it didn't seem to change anything. Finally this seems to work (using set-word has been indeed a great help for debugging !!!), what do you think ?
content: {<tag attribute1="valueattribute1" attribute2="valueattribute2" attribute3="valueattribute3">
</tag>
<tag attribute2="valueattribute21" attribute1="valueattribute11" >
</tag>
}
attribute1: [to {attribute1="} thru {attribute1="} copy valueattribute1 to {"} thru {"}]
attribute3: [to {attribute3="} thru {attribute3="} copy valueattribute3 to {"} thru {"}]
letter: charset reduce ["ABCDEFGHIJKLMNOPQRSTUabcdefghijklmnopqrstuvwxyz1234567890="]
attributes-rule: [(valueattribute1: none valueattribute3: none)
[attribute1 | none] any letter pos:
[attribute3 | none] (print valueattribute1 print valueattribute3)
| [attribute3 | none] any letter [attribute1 | none] (print valueattribute3 print valueattribute1
valueattribute1: none valueattribute3: none
)
| none
]
rule: [any [to {<tag } thru {<tag } attributes-rule {>} to {</tag>} thru {</tag>}] to end]
parse content rule
which outputs:
>> parse/all content rule
valueattribute1
valueattribute3
valueattribute11
none
== true
>>