spaCy 'IS_SPACE' flag doesn't work - spacy

Been trying to match something like '$125.00/share' on spaCy using its rule-based matching like mentioned here https://github.com/explosion/spaCy/issues/882. However, when trying out
nlp = en_core_web_sm.load()
matcher = Matcher(nlp.vocab)
doc = nlp(u'$125.00/share, $ 125 / share, $ 125.00 / share, $ 125 . 00 / share')
token_pattern = [{'NORM': '$'}, {'IS_DIGIT': True}, {'ORTH': '.', 'OP': '?'},
{'IS_DIGIT': True, 'OP': '?'}, {'ORTH': '/'}, {'LOWER': 'share'}]
def matched_pattern (matcher, doc, i, matches):
match_id, start, end = matches[i]
span = doc[start: end]
print ('matched!', span)
matcher.add('SharePrice', matched_pattern, token_pattern)
matches = matcher(doc)
I get back,
('matched!', $ 125 / share)
('matched!', $ 125 . 00 / share)
Instead, I want to match patterns like '$125.00/share' without the spaces in between. On trying,
token_pattern = [{'NORM': '$'}, {'IS_SPACE': False}, {'IS_DIGIT': True}, {'IS_SPACE': False},{'ORTH': '.', 'OP': '?'}, {'IS_SPACE': False},
{'IS_DIGIT': True, 'OP': '?'}, {'IS_SPACE': False}, {'ORTH': '/'}, {'IS_SPACE': False}, {'LOWER': 'share'}]
My expression doesn't match any pattern. Please, help!

The problem here is that each dictionary in the match pattern describes an actual, existing token – so {'IS_SPACE': False} will match any token that is not a whitespace character (for example, a token with the text "dog" or "123" or anything, really). There is no way for the matcher to match on the absence of a token.
I just tried your example and by default, spaCy's tokenizer splits "$125.00/share" into only two tokens: ['$', '125.00/share']. As the matcher steps through the tokens, it won't match, as it's looking for a currency symbol + a non-space character + a digit + a bunch of other tokens.
So in order to match on more specific parts of the token "125.00/share" – like the number, the forward slash and "share" – you'll have to make sure that spaCy splits those into separate tokens. You can do this by customising the tokenization rules and adding a new infix rule that splits tokens on / characters. This will result in "$125.00/share" → ['$', '125.00', '/', 'share'], which will be matched by your pattern.
Btw, some background on whitespace tokens: During tokenization, spaCy splits tokens on single whitespace characters. Those characters won't be available as individual tokens (but to make sure that no information is lost, they can be accessed via the .text_with_ws_ attribute). However, if there is more than one whitespace character present, spaCy will preserve those as tokens, which will return True for IS_SPACE. All other tokens will return False for IS_SPACE.

Related

How to commit to an alternation branch in a Raku grammar token?

Suppose I have a grammar with the following tokens
token paragraph {
(
|| <header>
|| <regular>
)
\n
}
token header { ^^ '---' '+'**1..5 ' ' \N+ }
token regular { \N+ }
The problem is that a line starting with ---++Foo will be parsed as a regular paragraph because there is no space before "Foo". I'd like to fail the parse in this case, i.e. somehow "commit" to this branch of the alternation, e.g. after seeing --- I want to either parse the header successfully or fail the match completely.
How can I do this? The only way I see is to use a negative lookahead assertion before <regular> to check that it does not start with ---, but this looks rather ugly and impractical, considering that my actual grammar has many more than just these 2 branches. Is there some better way? Thanks in advance!
If I understood your question correctly, you could do something like this:
token header {
^^ '---' [
|| '+'**1..5 ' ' \N+
|| { die "match failed near position $/.pos()" }
]
}

Node depth encoded as number of stars

Documents in this language look like
* A top-level Headline
Some text about that headline.
** Sub-Topic 1
Text about the sub-topic 1.
*** Sub-sub-topic
More text here about the sub-sub-topic
** Sub-Topic 2
Extra text here about sub-topic 2
*** Other Sub-sub-topic
More text here about the other sub-sub-topic
The number of depth levels is unlimited. I'm wondering how to get a parser that'll build the nested trees appropriately. I've been looking at the indenter example for inspiration, but I haven't figured it out.
The problem would require a context-sensitive grammar, so we use the work-around from the indenter example you linked:
We write a custom postlex processor that keeps a stack of the observed indent levels. When a star token (*, **, ***, ...) is read, the stack is popped until the indent level on the stack is smaller, then the new level is pushed on the stack. For each push/pop, corresponding INDENT/DEDENT helper tokens are injected into the token stream. These helper tokens can then be used in the grammar to obtain a parse tree that reflects the nesting level.
from lark import Lark, Token
tree_grammar = r"""
start: NEWLINE* item*
item: STARS nest
nest: _INDENT (nest | LINE+ item*) _DEDENT
STARS.2: /\*+/
LINE.1: /.*/ NEWLINE
%declare _INDENT _DEDENT
%import common.NEWLINE
"""
class StarIndenter():
STARS_type = 'STARS'
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
def dedent(self, level, token):
""" When the given level leaves the current nesting of the stack,
inject corresponding number of DEDENT tokens into the stream.
"""
while level <= self.indent[-1]:
pop_level = self.indent.pop()
pop_diff = pop_level - self.indent[-1]
for _ in range(pop_diff):
yield token
def handle_stars(self, token):
""" Handle tokens of the form '*', '**', '***', ...
"""
level = len(token.value)
dedent_token = Token.new_borrow_pos(self.DEDENT_type, '', token)
yield from self.dedent(level, dedent_token)
diff = level-self.indent[-1]
self.indent.append(level)
# Put star token into stream
yield token
indent_token = Token.new_borrow_pos(self.INDENT_type, '', token)
for _ in range(diff):
yield indent_token
def process(self, stream):
self.indent = [0]
# Process token stream
for token in stream:
if token.type == self.STARS_type:
yield from self.handle_stars(token)
else:
yield token
# Inject closing dedent tokens
yield from self.dedent(1, Token(self.DEDENT_type, ''))
# No idea why this is needed
#property
def always_accept(self):
return ()
parser = Lark(tree_grammar, parser='lalr', postlex=StarIndenter())
Note the STARS terminal has been assigned higher priority than LINES (via .2 vs. .1), to prevent LINES+ from eating up lines starting with a star.
Using a stripped down version of your example:
test_tree = """
* A
** AA
*** AAA
** AB
*** ABA
"""
print(parser.parse(test_tree).pretty())
Results in:
start
item
*
nest
A
item
**
nest
AA
item
***
nest AAA
item
**
nest
AB
item
***
nest ABA

Lingo Code "[cc]" is coming up as a fault

The game is a word search game in an advanced lingo book and the lingo code is using [cc] which is coming up as a code fault. What is wrong or is this use of [cc] obsolete? And if so, how can it be corrected?
on getPropertyDescriptionList me
list = [:]
-- the text member with the words in it
addProp list, #pWordSource,[cc]
[#comment: "Word Source",[cc]
#format: #text,[cc]
#default: VOID]
addProp list, #pEndGameFrame,[cc]
[#comment: "End Game Frame",[cc]
#format: #marker,[cc]
#default: #next]
return list
end
I guess this is code from here, right?
That seems like an older version of Lingo syntax. [cc], apparently, stands for "continuation character". It basically makes the compiler ignore the linebreak right after it, so that it sees everything from [#comment: to #default: VOID] as one long line, which is the syntactically correct way to write it.
If I remember correctly, once upon a time, the guys who made Lingo made one more crazy decision and made the continuation character look like this: ¬ Of course, this didn't print in lots of places, so some texts like your book used things like [cc] in its place.
In modern versions of Lingo, the continuation character is \, just like in C.
I programmed in early director but have gone on to other languages in the many years since. I understand this code. The function attempts to generate a dictionary of dictionaries. in quasi-JSON:
{
'pWordSource': { ... } ,
'pEndGameFrame': { ... }
}
It is creating a string hash, then storing a "pWordSource" as a new key pointing to a 3 item hash of it's own. The system then repeats the process with a new key "pEndGameFrame", providing yet another 3 item hash. So just to expand the ellipses ... from the above code example:
{
'pWordSource': { 'comment': 'Word Source', 'format': 'text', 'default': null } ,
'pEndGameFrame': { 'End Game Frame': 'Word Source', 'format': 'marker', 'default': 'next' }
}
So I hope that explains the hash characters. It's lingo's way of saying "this is not just a string, it's a special director-specific system we're calling a symbol. It can be described in more conventional programming terms as a constant. The lingo compiler will replace your #string1 with an integer, and it's always going to be the same integer associated with #string1. Because the hash keys are actually integers rather than strings, we can change the json model to look something more like this:
{
0: { 2: 'Word Source', 3: 'text', 4: null } ,
1: { 2:'End Game Frame', 3: 'marker', 4: 'next' }
}
where:
0 -> pWordSource
1 -> pEndGameFrame
2 -> comment
3 -> format
4 -> default
So to mimic the same construction behavior in 2016 lingo, we use the newer object oriented dot syntax for calling addProp on property lists.
on getPropertyDescriptionList me
list = [:]
-- the text member with the words in it
list.addProp(#pWordSource,[ \
#comment: "Word Source", \
#format: #text, \
#default: void \
])
list.addProp(#pEndGameFrame,[ \
#comment: "End Game Frame", \
#format: #marker, \
#default: #next \
])
return list
end
Likewise, the same reference shows examples of how to use square brackets to "access" properties, then initialize them by setting their first value.
on getPropertyDescriptionList me
list = [:]
-- the text member with the words in it
list[#pWordSource] = [ \
#comment: "Word Source", \
#format: #text, \
#default: void \
]
list[#pEndGameFrame] = [ \
#comment: "End Game Frame", \
#format: #marker, \
#default: #next \
]
return list
end
And if you are still confused about what the backslashes are doing, there are other ways to make the code more vertical.
on getPropertyDescriptionList me
list = [:]
-- the text member with the words in it
p = [:]
p[#comment] = "Word Source"
p[#format] = #text
p[#default] = void
list[#pWordSource] = p
p = [:] -- allocate new dict to avoid pointer bug
p[#comment] = "End Game Frame"
p[#format] = #marker
p[#default] = #next
list[#pEndGameFrame] = p
return list
end
The above screenshot shows it working in Director 12.0 on OS X Yosemite.

Url parameters values with special characters are providing errors in view

I have edited Url manager to provide SEO friendly urls but getting problem when the url have values with special characters such as . or () or - or any other special character
http://localhost/nbnd/search/city/delhi
In city action
var_dump($_GET);
output: array(1) { ["city"]=> string(6) "delhi" }
but when url is with some special character
http://localhost/nbnd/search/city/north-delhi or
http://localhost/nbnd/search/city/north.delhi or
http://localhost/nbnd/search/city/(north)delhi
In city action
var_dump($_GET);
Output : array(1) { ["north-delhi"]=> string(0) "" }
and so for other
this change in array values results in error.
As you want all sorts of characters, change your rule from the related question/answer:
'<controller:\w+>/<action:\w+>/<city>'=>'<controller>/<action>',
// omit the pattern \w+ from city:\w+
Documentation:
In case when ParamPattern is omitted, it means the parameter should match any characters except the slash /.

Antlr syntactic predicate no matching

I have the following grammar:
rule : (PATH)=> (PATH) SLASH WORD
{System.out.println("file: " + $WORD.text + " path: " + $PATH.text);};
WORD : ('a'..'z')+;
SLASH : '/';
PATH : (WORD SLASH)* WORD;
but it does not work for a string like "a/b/c/filename".
I thought I could solve this "path"-problem with the syntactic predicate feature. Maybe I am doing something wrong here and I have to redefine the grammar. Any suggestion for this problem?
You must understand that a syntactic predicate will not cause the parser to give the lexer some sort of direction w.r.t. what token the parser would "like" to retrieve. A syntactic predicate is used to force the parser to look ahead in an existing token stream to resolve ambiguities (emphasis on 'existing': the parser has no control over what token are created!).
The lexer operates independently from the parser, creating tokens in a systematic way:
it tries to match as much characters as possible;
whenever 2 (or more) rules match the same amount of characters, the rule defined first will get precedence over the rule(s) defined later.
So in your case, given the input "a/b/c/filename", the lexer will greedily match the entire input as a single PATH token.
If you want to get the file name, either retrieve it from the PATH:
rule : PATH
{
String file = $PATH.text.substring($PATH.text.lastIndexOf('/') + 1);
System.out.println("file: " + file + ", path: " + $PATH.text);
}
;
WORD : ('a'..'z')+;
SLASH : '/';
PATH : (WORD SLASH)* WORD;
or create a parser rule that matches a path:
rule : dir WORD
{
System.out.println("file: " + $WORD.text + ", dir: " + $dir.text);
}
;
dir : (WORD SLASH)+;
WORD : ('a'..'z')+;
SLASH : '/';