Making a complex Relax NG attribute without using pattern? - relaxng

I have an attribute called 'page'. It is made up of two to three doubles, separated by commas, not spaces, with an optional '!' at the end. All of the following are valid:
page="8.5,11,3!"
page="8.5,11.4,3.1"
page="8.5,11!"
page="8.5,2.1"
I know I could use patterns, the following would work:
attribute page { xsd:string { pattern="[0-9]+(\.[0-9]+)?,[0-9]+(\.[0-9]+)(,[0-9]+(\.[0-9]+)?)?(!)?" } }
But if possible, I'd rather use something like this:
attribute page { xsd:double, ",", xsd:double, ( ",", xsd:double )?, ("!")? }
I can make the above sort-of work, using 'list':
attribute page { list { xsd:double, ",", xsd:double, ( ",", xsd:double )?, ("!")? } }
But then I end up with spaces between each of the pieces:
page="8.5 , 11 !"
Is there any way to do this without using pattern?

Relax NG has no particular rules for how simple types are defined; it is designed to be able to use simple type libraries which make such rules. So in principle, yes, you can do what you like in Relax NG: just use a simple type library that provides the functionality you seek.
In practice, you seem to be using the XSD library of simple types. And while XSD does allow the definition of list types whose values are sequences of other simple values, for the sake of simplicity in the definition and in the validator, XSD list values are broken by the parser on white space; XSD does not allow arbitrary separators for the values. So, no you cannot do what you say you would like to do, with Relax NG's XSD-based library of simple types.

Related

Spacy tokenizer rule for exceptions that contain whitespace?

When I create a pipeline with the default tokenizer for say English, I can then call the method for adding a special case:
tokenizer.add_special_case("don't", case)
The tokenizer will happily accept a special case that contains whitespace:
tokenizer.add_special_case("some odd case", case)
but it appears that does not actually change the behavior of the tokenizer or will never match?
More generally, what is the best way of extending an existing tokenizer so that the some patterns which normally would result in multiple tokens only create one token? For example something like [A-Za-z]+\([A-Za-z0-9]+\)[A-Za-z]+ should not result in three tokens because of the parentheses but in a single token, e.g. for asdf(a33b)xyz while the normal English rules should still apply if that pattern does not match.
Is this something that can be done somehow by augmenting an existing tokenizer or would I have to first tokenize, then find entities that match the corresponding token patterns and then merge the entity tokens?
As you found, Tokenizer.add_special_case() doesn't work for handling tokens that contain whitespace. That's for adding strings like "o'clock" and ":-)", or expanding e.g. "don't" to "do not".
Modifying the prefix, suffix and infix rules (either by setting them on an existing tokenizer or creating a new tokenizer with custom parameters) also doesn't work since those are applied after whitespace splitting.
To override the whitespace splitting behavior, you have four options:
Merge after tokenization. You use Retokenizer.merge(), or possibly merge_entities or merge_noun_chunks. The relevant documentation is here:
https://spacy.io/usage/linguistic-features#retokenization and https://spacy.io/api/pipeline-functions#merge_entities and https://spacy.io/api/pipeline-functions#merge_noun_chunks
This is your best bet for keeping as much of the default behavior as possible.
Subclass Tokenizer and override __call__. Sample code:
from spacy.tokenizer import Tokenizer
def custom_tokenizer(nlp):
class MyTokenizer(Tokenizer):
def __call__(self, string):
# do something before
doc = super().__call__(string)
# do something after
return doc
return MyTokenizer(
nlp.vocab,
prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=nlp.tokenizer.infix_finditer,
token_match=nlp.tokenizer.token_match,
)
# usage:
nlp.tokenizer = custom_tokenizer(nlp)
Implement a completely new tokenizer (without subclassing Tokenizer). Relevant docs here: https://spacy.io/usage/linguistic-features#custom-tokenizer-example
Tokenize externally and instantiate Doc with words. Relevant docs here: https://spacy.io/usage/linguistic-features#own-annotations
To answer the second part of your question, if you don't need to change whitespace splitting behavior, you have two other options:
Add to the default prefix, suffix and infix rules. The relevant documentation is here: https://spacy.io/usage/linguistic-features#native-tokenizer-additions
Note from https://stackoverflow.com/a/58112065/594211: "You can add new patterns without defining a custom tokenizer, but there's no way to remove a pattern without defining a custom tokenizer."
Instantiate Tokenizer with custom prefix, suffix and infix rules. The relevant documentation is here: https://spacy.io/usage/linguistic-features#native-tokenizers
To get the default rules, you read the existing tokenizer's attributes (as shown above) or use the nlp object’s Defaults. There are code samples for the latter approach in https://stackoverflow.com/a/47502839/594211 and https://stackoverflow.com/a/58112065/594211.
Use token match for combining multiple tokens to single one

what are the various ways to fetch the properties from payload?

What is the difference between reading the properties from payload. for example there is a property in the payload which is named as con_id. when i read this property like this #[payload.con_id] then it is coming as null. where as #[payload.'con_id'] is returning the value.
few other notations which i know of is #[payload['con_id']] or #[json:con_id]
which one should be used at which scenario? if there are any special cases to use any specific notation then please let me know the scenario also.
Also, what is the common notation that has to be used from a mule soft platform supported point of view.
In Mule 3 any of those syntax are valid. Except the json: evaluator is for querying json documents where as the others are for querying maps/objects. Also the json: evaluator is deprecated in Mule 3 in favor of transforming to a map and using the MEL expressions below.
payload.property
payload.'property'
payload['property']
The reason the first fails in your case, is beacaue of the special character '_'. The underscore forces the field name to be wrapped in quotes.
Typically the . notation is preferred over the [''] as its shorter for accessing map fields. And then simply wrap property names in '' for any fields with special chars.
Note in Mule 4, you don't need to transform to a map/object first. Dataweave expression replace MEL as the expression language and allow you to directly query json or any type of payload without transforming to a map first.

.bind vs string interpolation in aurelia

In our code base we have a mixture of the following:
attribute="${something}", attribute="${something | converter}", etc.
attribute.bind="something", attribute.bind="something | converter"
I find the latter easier to read.
The examples I'm referring to are exactly like the above; i.e., they do not add any additional string content.
I think that it's easier on Aurelia too. Am I correct?
Also, for these specific cases where no actual interpolation is
involved, is there any benefit to the first form? (other than it is
two characters less to type.)
Given the examples you have shown, I would recommend using option 2. It really isn't "easier on Aurelia," but it is more explicit that you are binding the value of that attribute to the property listed.
Original Answer Below
The benefit of the first option is when you have, for example, an attribute that accepts many values but as a single string. The most common example of this is the class attribute. The class attribute accepts multiple classes in a space-separated list:
<div class="foo bar baz"></div>
Imagine we only want to add or remove the class baz from this list based on a prop on our VM someProp while leaving the other classes. To do this using the .bind syntax, we would have to create a property on our VM that has the full list but adds or removes baz as determined by the value of someProp. But using the string interpolated binding, this becomes much simpler:
<div class="foo bar ${someProp ? 'baz' : ''}"></div>
You can imagine how this could be extended with multiple classes being added or removed. You could maybe create a value converter to do this using the .bind syntax, but it might end up with something that wasn't as readable.
I could imagine a value converter being created that might look something like this in use:
<div class.bind="someProp | toggleClass:'baz':'foo':bar'"></div>
I really think this is much less readable than using the string interpolation syntax.
By the way, the value converter I imagined above would look like this:
export class ToggleClassValueConverter {
toView(value, toggledClass, ...otherProps) {
return `${otherProps.join(' ')} ${value ? toggledClass : ''}`;
}
}
The best part is that I'm still using string interpolation in the value converter :-)
After wading through the tabs I'd already opened I found this. Although it's not quite the same thing, and it's a bit old, there's a similar thing talked about on https://github.com/aurelia/templating-binding/issues/24#issuecomment-168112829 by Mr Danyow (emphasis mine)
yep, the symbol for binding behaviors is & (as opposed to | for value converters).
<input type="text" data-original="${name & oneTime}" value.bind="name" />
Here's the standard way to write a one-time binding. This will be a bit more light-weight in terms of parsing and binding:
<input type="text" data-original.one-time="name" value.bind="name" />
I don't know if it applies to the .bind/${name} case as well as the oneTime one in the example, but perhaps if it comes to his attention he can say either way.
Given this isn't a cut and dry answer, I'll be marking Ashley's as the answer as it confirms the legibility question and provides useful information on other use cases should anyone else search on similar terms.

Finding classes/files/symbols with umlauts (ä, ö, ü) by their transliteration (ae, oe, ue)

I am working with code which uses German for naming of classes, symbols and files. All German special characters like ä, ü, ö and ß are used in their transliterated form, i.e. "ae" for "ä", "oe" for "ö" etc.
Since there is no technical reason for doing that anymore, I am evaluating whether allowing umlauts and the like in their natural form is feasible. The biggest problem here is that there will be inconsistent naming once using umlauts will be allowed. I.e. a class may be named "ÖffentlicheAuftragsübernahme" (new form) or "OeffentlicheAuftragsuebernahme" (old form). This will make searching for classes, symbols and files more difficult.
Is there a way to extend the search (code navigation to be exact) of IntelliJ IDEA in a way that it will ignore whether a name is written using umlauts or their transliteration?
I suppose, this would require modifying the way IDEA indexes files. Would that be possible with a plugin? Or is there a different way to acomplish the desired result?
Example
Given the classes "KlasseÄ" "KlasseAe", "KlasseOe", "KlasseÜ"
IDEA "navigate to class" (CTRL+N) --> find result
"KlasseÄ" --> ["KlasseÄ" "KlasseAe"]
"KlasseAe" --> ["KlasseÄ" "KlasseAe"]
"KlasseUe" --> ["KlasseÜ"]
"KlasseÖ" --> ["KlasseOe"]
This is actually fairly easy to implement and does not require indexing changes. All you need to do is implement the ChooseByNameContributor interface and register it as an extension for gotoClassContributor extension point. In your implementation, you can use IDEA's existing indices to find classes with the alternatively spelled names, and to return them from getItemsByName.
You could provide your own instance of com.intellij.navigation.GotoClassContributor and in getItemsByName() look for the different variants of the input name in the PSI class index.
For example, if you extend com.intellij.ide.util.gotoByName.DefaultClassNavigationContributor, you can implement the method like this:
#Override
#NotNull
public NavigationItem[] getItemsByName(String name, final String pattern, Project project, boolean includeNonProjectItems) {
List<NavigationItem> result = new ArrayList<>();
Processor<NavigationItem> processor = Processors.cancelableCollectProcessor(result);
List<String> variants = substituteUmlauts(name);
for (String variant : variants) {
processElementsWithName(variant, processor, FindSymbolParameters.wrap(pattern, project, includeNonProjectItems));
}
return result.isEmpty() ? NavigationItem.EMPTY_NAVIGATION_ITEM_ARRAY :
result.toArray(new NavigationItem[result.size()]);
}
substituteUmlauts() will compute all the different versions of what you type in the search box, and all the results will be aggregated in result.

Writing a TemplateLanguage/VewEngine

Aside from getting any real work done, I have an itch. My itch is to write a view engine that closely mimics a template system from another language (Template Toolkit/Perl). This is one of those if I had time/do it to learn something new kind of projects.
I've spent time looking at CoCo/R and ANTLR, and honestly, it makes my brain hurt, but some of CoCo/R is sinking in. Unfortunately, most of the examples are about creating a compiler that reads source code, but none seem to cover how to create a processor for templates.
Yes, those are the same thing, but I can't wrap my head around how to define the language for templates where most of the source is the html, rather than actual code being parsed and run.
Are there any good beginner resources out there for this kind of thing? I've taken a ganer at Spark, which didn't appear to have the grammar in the repo.
Maybe that is overkill, and one could just test-replace template syntax with c# in the file and compile it. http://msdn.microsoft.com/en-us/magazine/cc136756.aspx#S2
If you were in my shoes and weren't a language creating expert, where would you start?
The Spark grammar is implemented with a kind-of-fluent domain specific language.
It's declared in a few layers. The rules which recognize the html syntax are declared in MarkupGrammar.cs - those are based on grammar rules copied directly from the xml spec.
The markup rules refer to a limited subset of csharp syntax rules declared in CodeGrammar.cs - those are a subset because Spark only needs to recognize enough csharp to adjust single-quotes around strings to double-quotes, match curley braces, etc.
The individual rules themselves are of type ParseAction<TValue> delegate which accept a Position and return a ParseResult. The ParseResult is a simple class which contains the TValue data item parsed by the action and a new Position instance which has been advanced past the content which produced the TValue.
That isn't very useful on it's own until you introduce a small number of operators, as described in Parsing expression grammar, which can combine single parse actions to build very detailed and robust expressions about the shape of different syntax constructs.
The technique of using a delegate as a parse action came from a Luke H's blog post Monadic Parser Combinators using C# 3.0. I also wrote a post about Creating a Domain Specific Language for Parsing.
It's also entirely possible, if you like, to reference the Spark.dll assembly and inherit a class from the base CharGrammar to create an entirely new grammar for a particular syntax. It's probably the quickest way to start experimenting with this technique, and an example of that can be found in CharGrammarTester.cs.
Step 1. Use regular expressions (regexp substitution) to split your input template string to a token list, for example, split
hel<b>lo[if foo]bar is [bar].[else]baz[end]world</b>!
to
write('hel<b>lo')
if('foo')
write('bar is')
substitute('bar')
write('.')
else()
write('baz')
end()
write('world</b>!')
Step 2. Convert your token list to a syntax tree:
* Sequence
** Write
*** ('hel<b>lo')
** If
*** ('foo')
*** Sequence
**** Write
***** ('bar is')
**** Substitute
***** ('bar')
**** Write
***** ('.')
*** Write
**** ('baz')
** Write
*** ('world</b>!')
class Instruction {
}
class Write : Instruction {
string text;
}
class Substitute : Instruction {
string varname;
}
class Sequence : Instruction {
Instruction[] items;
}
class If : Instruction {
string condition;
Instruction then;
Instruction else;
}
Step 3. Write a recursive function (called the interpreter), which can walk your tree and execute the instructions there.
Another, alternative approach (instead of steps 1--3) if your language supports eval() (such as Perl, Python, Ruby): use a regexp substitution to convert the template to an eval()-able string in the host language, and run eval() to instantiate the template.
There are sooo many thing to do. But it does work for on simple GET statement plus a test. That's a start.
http://github.com/claco/tt.net/
In the end, I already had too much time in ANTLR to give loudejs' method a go. I wanted to spend a little more time on the whole process rather than the parser/lexer. Maybe in version 2 I can have a go at the Spark way when my brain understands things a little more.
Vici Parser (formerly known as LazyParser.NET) is an open-source tokenizer/template parser/expression parser which can help you get started.
If it's not what you're looking for, then you may get some ideas by looking at the source code.