Add custom punctuation to spacy model - spacy

How do you add custom punctuation (e.g. asterisk) to the infix list in a Tokenizer and have that recognized by nlp.explain as punctuation? I would like to be able to add characters that are not currently recognized as punctuation to the punctuation list from the list of set infixes so that the Matcher can use them when matching {'IS_PUNCT': True} .
An answer to a similar issue was provided here
How can I add custom signs to spaCy's punctuation functionality?
The only problem is I am unable to package the newly recognized punctuation with the model. A side note: the tokenizer already recognizes infixes with the desired punctuation, so all that is left is propagating this to the Matcher.

The lexeme attribute IS_PUNCT is completely separate from any of the tokenizer settings. In a packaged pipeline, you'd either create a custom language (https://spacy.io/usage/linguistic-features#language-subclass) or run the customization in a callback in [nlp.before_creation] (https://spacy.io/usage/training#custom-code-nlp-callbacks).
Be aware that modifying EnglishDefaults affects all English pipelines loaded in the same script, so the custom language option is cleaner (in particular if you're distributing this model for general use), but also slightly more work to implement.
On the other hand, if you're just using the Matcher, it might be easier to use a REGEX pattern to match the tokens you want instead of customizing IS_PUNCT.

Related

JetBrains Language Support Plugin

I am trying to implement language support plugin for a basic language. I am following jetbrains's tutorial for simple language support (.properties file support basically) and on the side I have rust plugin for reference. However the complexity gap between them is huge so some questions are hard to find answers to from both sources.
Here is my question: What is the best way to allow spaces between tokens which DO NOT require spaces, however force spacing between tokens which do?
i.e. class Foo{ <- here first space is mandatory, but the second one (before '{' symbol) can be ommited.

ANTLR4 - Generate code from non-file inputs?

Where do we start to manually build a CST from scratch? Or does ANTLR4 always require the lex/parse process as our input step?
I have some visual elements in my program that represent code structures.
e.g. a square represents a class, while a circle embedded within that square represents a method.
Now I want to turn those into code. How do I use ANTLR4 to do this, at runtime (using ANTLR4.js)? Most of the ANTLR examples seem to rely on lexing and parsing existing code to get to a syntax tree. So rather than:
input code->lex->parse->syntax tree->output code (1)
I want
manually create syntax tree->output code (2)
(Later, as the user adds code to that class and its methods, then ANTLR will be used as in (1).)
EDIT Maybe I'm misunderstanding this. Do I create some custom data structure and then run the parser over it? i.e. write structures to some in-memory format->parse->output code (3)?
IIUC, you could use StringTemplate directly.
By, way of background, Antlr itself builds an in-memory parse-tree and then walks it, incrementally calling StringTemplate to output code snippets qualified by corresponding parse-tree node data. That Antlr uses an internal parse-tree is just a convenience for simplifying walking (since Antlr is built using Antlr).
If you have your own data structure, regardless of its specific implementation, procedurally process it to progressively call ST templates to emit the corresponding code. And, you can directly use the same templates that Antlr uses (JavaScript.stg), if they meet your requirements.
Of course, if your data structure is of a nature that can be lex'd/parsed into a standard Antlr parse-tree, you can then use a standard Antlr visitor to call and populate node-specific templates.

How can I access hidden tokens in ANTLR AST?

I am trying to write a manual tree walker in Java for an AST generated by ANTLR V3. The AST is built using island grammers as similar to the one specified in ANTLR: call a rule from a different grammar.
In the AST, I have a node for expression list with each expression as child node. Now I need to know the line numbers of the COMMAs which seperated the expressions. The COMMAs were present in parsing but removed during AST rewrite.
I see some resources(here and here) pointing to the usage of CommonTokenStream.getTokens but I am not sure how I can access the CommonTokenStream while processing the AST. Is there anyway I can get the CommonTokenStream used to build the AST?
The complete list of tokens is accessible through CommonTokenStream.getTokens(), which you can call before you call the tree walker. The list of tokens would be an argument to the walker. There's no need to change CommonTree, unless you want the recovered information embedded in the tree.
I've used the token list to associate hidden tokens such as comments and explicit line numbers (think FORTRAN) with the closest visible token. This was done post-processing the AST and looking at the line, column, and char-index information which is available for both the tokens in the list and the nodes in the AST.
My attempts at trying to that during AST construction resulted in hacky, unmaintainable code. The post-processing code, OTOH, is Programming-101 algorithmic.

Convert simple Antlr grammar to Xtext

I want to convert a very simple Antlr grammar to Xtext, so no syntactic predicates, no fancy features of Antlr not provided by Xtext. Consider this grammar
grammar simple; // Antlr3
foo: number+;
number: NUMBER;
NUMBER: '0'..'9'+;
and its Xtext counterpart
grammar Simple; // Xtext
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate Simple "http://www.example.org/Simple"
Foo: dummy=Number+;
Number: NUMBER_TOKEN;
terminal NUMBER_TOKEN: '0'..'9'+;
Xtext uses Antlr behind the scenes, but the two format are not exactly the same. There are quite a few annoying (and partly understandable) things I have to modify, including:
Prefix terminals with the terminal keyword
Include import "http://www.eclipse.org/emf/2002/Ecore" as ecore to make terminals work
Add a feature to the top-level rule, e.g. foo: dummy=number+
Keep in mind that rule and terminal names have to be unique even case-insensitive.
Optionally, capitalize the first letter of rule names to follow Java convention.
Is there a tool to make this conversion automatically at least for simple cases? If not, is there a more complete checklist of such required modifications?
It's basically not possible to do this conversion automatically since the Antlr grammar lacks information that is required in the Xtext grammar. The rule names in Xtext will be used to create classes from them. There are assignments in Xtext that will become getters and setters in those classes. However, these assignments should not be used for every rule call since there are special patterns in Xtext that allow to reduce the noise in the resulting AST. Stuff like that makes it hardly possible to do this transformation automatically. However, it's usually straight forward to copy the Antlr grammar into the Xtext editor and fix the issues manually.

Regex grammar for action mapping in Struts 1.x to map the URLs?

I know the documentation says, only (*) and escape sequence is supported. I want to differentiate between /somepath/category/subcategory and /somepath/category/article.html
I need to route both to different action handlers. I want to use wildcard characters in path mappings.
What is the best approach to do this?