Mixing data(types) and elements in Relax NG - relaxng

I am trying to define a Relax NG compact syntax model for the following content:
<coordinates class="blue">-132.976733 56.437924 <span class="red">-132.735747 56.459832 -132.631685 56.421493 -132.664547 56.273616</span> -132.878148 56.240754 -133.069841 56.333862 -132.976733 56.437924</coordinates>
Where the element could wrap one or more coordinate pairs, and could indeed have nested elements with the same content model.
Is this possible? I read here that it may not be possible using, for example xsd:double.

I believe this question is the same as this one, so I'm going to guess the answer is similar, something like "No, use the 'mixed' pattern, then Schematron the datatype constraints on the tokens".

Related

Force 'parser' to not segment sentences?

Is there an easy way to tell the "parser" pipe not to change the value of Token.is_sent_start ?
So, here is the story:
I am working with documents that are pre-sentencized (1 line = 1 sentence), this segmentation is all I need. I realized the parser's segmentation is not always the same as in my documents, so I don't want to rely on the segmentation made by it.
I can't change the segmentation after the parser has done it, so I cannot correct it when it makes mistakes (you get an error). And if I segment the text myself and then apply the parser, it overrules the segmentation I've just made, so it doesn't work.
So, to force keeping the original segmentation and still use a pretrained transformer model (fr_dep_news_trf), I either :
disable the parser,
add a custom Pipe to nlp to set Token.is_sent_start how I want,
create the Doc with nlp("an example")
or, I simply create a Doc with
doc = Doc(words=["an", "example"], sent_starts=[True, False])
and then I apply every element of the pipeline except the parser.
However, if I still do need the parser at some point (which I do, because I need to know some subtrees), If I simply apply it on my Doc, it overrules the segmentation already in place, so, in some cases, the segmentation is incorrect. So I do the following workaround:
Keep the correct segmentation in a list sentences = list(doc.sents)
Apply the parser on the doc
Work with whatever syntactic information the parser computed
Retrieve whatever sentencial information I need from the list I previously made, as I now cannot trust Token.is_sent_start.
It works, but it doesn't really feel right imho, it feels a bit messy. Is there an easier, cleaner way I missed ?
Something else I am considering is setting a custom extension, so that I would, for instance, use Token._.is_sent_start instead of the default Token.is_sent_start, and a custom Doc._.sents, but I fear it might be more confusing than helpful ...
Some user suggested using span.merge() for a pretty similar topic, but the function doesn't seem to exist in recent releases of spaCy (Preventing spaCy splitting paragraph numbers into sentences)
The parser is supposed to respect sentence boundaries if they are set in advance. There is one outstanding bug where this doesn't happen, but that was only in the case where some tokens had their sentence boundaries left unset.
If you set all the token boundaries to True or False (not None) and then run the parser, does it overwrite your values? If so it'd be great to have a specific example of that, because that sounds like a bug.
Given that, if you use a custom component to set your true sentence boundaries before the parser, it should work.
Regarding some of your other points...
I don't think it makes any sense to keep your sentence boundaries separate from the parser's - if you do that you can end up with subtrees that span multiple sentences, which will just be weird and unhelpful.
You didn't mention this in your question, but is treating each sentence/line as a separate doc an option? (It's not clear if you're combining multiple lines and the sentence boundaries are wrong, or if you're passing in a single line but it's turning into multiple sentences.)

How to filter by tag in Jaeger

When trying to filter by tag, there is a small popup:
I have been looking for logfmt around, but all I can find is key=value format.
My questions are:
Is there a way for something more sophisticated? (starts_with, not equal, contains, etc)
I am trying to filter by url using http.url="http://example.com?bla=bla&foo=bar". I am pretty sure the value exists because I am copy/pasting from my trace. I am getting no results. Do I need to escape characters or do something else for this to work?
I did some research around logfmt as well. Based on the documentation of the original implementation and in the Python implementation of the parser (and respective tests), I would say that it doesn't support anything more sophisticated (like starts_with, not equal, contains). And this is because the output of the parser is a simple dictionary (with no regex involved in the values).
As for the second question, using the same mentioned Python parser, I was able to double-check that your filter looks fine:
from logfmt import parse_line
parse_line('http.url="http://example.com?bla=bla&foo=bar"')
Output:
{'http.url': 'http://example.com?bla=bla&foo=bar'}
This makes me suspect of an issue on the Jaeger side, but this is as far as I could go.

Is it acceptable to use `to` to create a `Pair`?

to is an infix function within the standard library. It can be used to create Pairs concisely:
0 to "hero"
in comparison with:
Pair(0, "hero")
Typically, it is used to initialize Maps concisely:
mapOf(0 to "hero", 1 to "one", 2 to "two")
However, there are other situations in which one needs to create a Pair. For instance:
"to be or not" to "be"
(0..10).map { it to it * it }
Is it acceptable, stylistically, to (ab)use to in this manner?
Just because some language features are provided does not mean they are better over certain things. A Pair can be used instead of to and vice versa. What becomes a real issue is that, does your code still remain simple, would it require some reader to read the previous story to understand the current one? In your last map example, it does not give a hint of what it's doing. Imagine someone reading { it to it * it}, they would be most likely confused. I would say this is an abuse.
to infix offer a nice syntactical sugar, IMHO it should be used in conjunction with a nicely named variable that tells the reader what this something to something is. For example:
val heroPair = Ironman to Spiderman //including a 'pair' in the variable name tells the story what 'to' is doing.
Or you could use scoping functions
(Ironman to Spiderman).let { heroPair -> }
I don't think there's an authoritative answer to this.  The only examples in the Kotlin docs are for creating simple constant maps with mapOf(), but there's no hint that to shouldn't be used elsewhere.
So it'll come down to a matter of personal taste…
For me, I'd be happy to use it anywhere it represents a mapping of some kind, so in a map{…} expression would seem clear to me, just as much as in a mapOf(…) list.  Though (as mentioned elsewhere) it's not often used in complex expressions, so I might use parentheses to keep the precedence clear, and/or simplify the expression so they're not needed.
Where it doesn't indicate a mapping, I'd be much more hesitant to use it.  For example, if you have a method that returns two values, it'd probably be clearer to use an explicit Pair.  (Though in that case, it'd be clearer still to define a simple data class for the return value.)
You asked for personal perspective so here is mine.
I found this syntax is a huge win for simple code, especial in reading code. Reading code with parenthesis, a lot of them, caused mental stress, imagine you have to review/read thousand lines of code a day ;(

conditional component declaration and a following if equation

I am trying to build a model that will have slightly different equations based on whether or not certain components exist (in my case, fluid ports).
A code like the following will not work:
parameter Boolean use_component=false;
Component component if use_component;
equation
if use_component then
component.x = 0;
end if;
How can I work around this?
If you want to use condition components, there are some restrictions you need to be aware of. Section 4.4.5 of the Modelica 3.3 specification sums it up nicely. It says "If the condition is false, the component, its modifiers, and any connect equations
involving the component, are removed". I'll show you how to use this to solve your problem in just a second, but first I want to explain why your solution doesn't work.
The issue has to do with checking the model. In your case, it is obvious that the equation component.x and the component component either both exist or neither exist. That is because you have tied them to the same Boolean variable. But what if you had don't this:
parameter Real some_number;
Component component if some_number*some_number>4.0;
equation
if some_number>=-2 and some_number<=2 then
component.x = 0;
end if;
We can see that this logically identical to your case. There is no chance for component.x to exist when component is absent. But can we prove such things in general? No.
So, when conditional components were introduced, conservative semantics were implemented which can always trivially ensure that the sets of variables and equations involved never get "out of sync".
Let us to return to what the specification says: "If the condition is false, the component, its modifiers, and any connect equations
involving the component, are removed"
For your case, the solution could potentially be quite simple. Depending on how you declare "x", you could just add a modification to component, i.e.
parameter Boolean use_component=false;
Component component(x=0) if use_component;
The elegance of this is that the modification only applies to component and if component isn't present, neither is the modification (equation). So the variable x and its associated equation are "in sync". But this doesn't work for all cases (IIRC, x has to have an input qualifier for this to work...maybe that is possible in your case?).
There are two remaining alternatives. First, put the equation component.x inside component. The second is to introduce a connector on component that, if connected, will generate the equation you want. As with the modification case (this is not a coincidence), you could associate x with an input connector of some kind and then do this:
parameter Boolean use_component;
Component component if use_component;
Constant zero(k=0);
equation
connect(k.y, component.x);
Now, I could imagine that after considering all three cases (modification, internalize equation and use connect), you come to the conclusion that none of them will work. If this is the case, then I would humbly suggest that you have an issue with how you have designed the component. The reason these restrictions arise is related to the necessity to check components by themselves for correctness. This requires that the component be complete ("balanced" in the terminology of the specification).
If you cannot solve the problem with approaches I mentioned above, then I suspect you really have a balancing issue and that you probably need to redefine the boundaries of your component somehow. If this is the case, I would suggest you open another question here with details of what you are trying to do.
I think that the reason why this will not work is that the parser will look for the declaration of the variable "component.x" that, if the component is not active, does not exist. It does not work even if you insert the "Evaluate=true" in the annotation.
The cleanest solution in my opinion is to work at equation level and enable different sets of equations in the same block. You can create a wrapper model with the correct connectors and paramenters, and then if it is a causal model for example you can use replaceable classes in order to parameterize the models as functions, or else, in case of acausal models, put the equations inside if statements.
Another possible workaround is to place two different models inside one block, so you can use their variables into the equation section, and then build up conditional connections that will enable the usage of the block with the choosen behaviour. In other words you can build up a "wrap model" with two blocks inside, and then place the connection equations to the connectors of the wrap model inside if statements. Remember to build up the model so that there will be a consistent system of quations even for the blocks that are not used.
But this is not the best solution, because if the blocks are big you will have to wait longer time for compilation since everything will be compiled.
I hope this will help,
Marco
You can also make a dummy component that is not visible in the graphical layer:
connector DummyHeatPort
"Dummy heatport to facilitate optional heatport. Use this with a conditional heatport by connecting it to the heatport. Then use the -DummyHeatPort.Q_flow in the thermal energy balance."
Modelica.SIunits.Temperature T "Port temperature";
flow Modelica.SIunits.HeatFlowRate Q_flow
"Heat flow rate (positive if flowing from outside into the component)";
end DummyHeatPort;
Then when this gets used in a two port model
Modelica.Thermal.HeatTransfer.Interfaces.HeatPort_a heatport if use_heat_port;
DummyHeatPort dummy_heatport;
...
equation
flowport_a.H_flow + flowport_b.H_flow - dummy_heatport.Q_flow = storage
"thermal energy balance";
connect(dummy_heatport, heatport);
This way the heatport gets used if present but does not cause an error otherwise.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).