How to filter by tag in Jaeger - jaeger

When trying to filter by tag, there is a small popup:
I have been looking for logfmt around, but all I can find is key=value format.
My questions are:
Is there a way for something more sophisticated? (starts_with, not equal, contains, etc)
I am trying to filter by url using http.url="http://example.com?bla=bla&foo=bar". I am pretty sure the value exists because I am copy/pasting from my trace. I am getting no results. Do I need to escape characters or do something else for this to work?

I did some research around logfmt as well. Based on the documentation of the original implementation and in the Python implementation of the parser (and respective tests), I would say that it doesn't support anything more sophisticated (like starts_with, not equal, contains). And this is because the output of the parser is a simple dictionary (with no regex involved in the values).
As for the second question, using the same mentioned Python parser, I was able to double-check that your filter looks fine:
from logfmt import parse_line
parse_line('http.url="http://example.com?bla=bla&foo=bar"')
Output:
{'http.url': 'http://example.com?bla=bla&foo=bar'}
This makes me suspect of an issue on the Jaeger side, but this is as far as I could go.

Related

regex limitations within live template / can use live-template functions as replacement?

I'm removing builder pattern on multiple places. Following example would help me with the task, but mainly I'd like to learn how to use live templates more.
Preexisting code:
Something s = s.builder
.a(...)
.b(bbb)
.build();
and I'd like to remove it to:
Something s = new Something();
s.setA(...);
s.setB(bbb);
part of it can be trivially done using intellij regex, pattern: \.(.*)$ and replacement .set\u$1. Well it could be improved, but lets keep it simple.
I can create surround live template using variable:
regularExpression(SELECTION, "\\.(.*)", "\\u$1")
but \\u will be evaluated as u.
question 1: is it possible to get into here somehow \u functionality?
but I might get around is differently, so why not try to use live temlate variable:
regularExpression(SELECTION, "\\.(.)(.*)", concat(capitalize($1), "$2"))
but this does not seem to work either. .abc is replaced to bc
question 2: why? How would correct template look like? And probably, if it worked, this would behave incorrectly for multiline input. How to make it working and also for multiline inputs?
sorry for questions, I didn't find any harder examples of live templates than trivial replacements.
No, there is no \u functionality in the regularExpression() Live Template macro. It is just a way to call String.replaceAll(), which doesn't support \u.
You can create a Live Template like this:
set$VAR$
And set the following expression for the $VAR$ variable:
capitalize(regularExpression(SELECTION, "\\.(.*)", "$1"))

Force 'parser' to not segment sentences?

Is there an easy way to tell the "parser" pipe not to change the value of Token.is_sent_start ?
So, here is the story:
I am working with documents that are pre-sentencized (1 line = 1 sentence), this segmentation is all I need. I realized the parser's segmentation is not always the same as in my documents, so I don't want to rely on the segmentation made by it.
I can't change the segmentation after the parser has done it, so I cannot correct it when it makes mistakes (you get an error). And if I segment the text myself and then apply the parser, it overrules the segmentation I've just made, so it doesn't work.
So, to force keeping the original segmentation and still use a pretrained transformer model (fr_dep_news_trf), I either :
disable the parser,
add a custom Pipe to nlp to set Token.is_sent_start how I want,
create the Doc with nlp("an example")
or, I simply create a Doc with
doc = Doc(words=["an", "example"], sent_starts=[True, False])
and then I apply every element of the pipeline except the parser.
However, if I still do need the parser at some point (which I do, because I need to know some subtrees), If I simply apply it on my Doc, it overrules the segmentation already in place, so, in some cases, the segmentation is incorrect. So I do the following workaround:
Keep the correct segmentation in a list sentences = list(doc.sents)
Apply the parser on the doc
Work with whatever syntactic information the parser computed
Retrieve whatever sentencial information I need from the list I previously made, as I now cannot trust Token.is_sent_start.
It works, but it doesn't really feel right imho, it feels a bit messy. Is there an easier, cleaner way I missed ?
Something else I am considering is setting a custom extension, so that I would, for instance, use Token._.is_sent_start instead of the default Token.is_sent_start, and a custom Doc._.sents, but I fear it might be more confusing than helpful ...
Some user suggested using span.merge() for a pretty similar topic, but the function doesn't seem to exist in recent releases of spaCy (Preventing spaCy splitting paragraph numbers into sentences)
The parser is supposed to respect sentence boundaries if they are set in advance. There is one outstanding bug where this doesn't happen, but that was only in the case where some tokens had their sentence boundaries left unset.
If you set all the token boundaries to True or False (not None) and then run the parser, does it overwrite your values? If so it'd be great to have a specific example of that, because that sounds like a bug.
Given that, if you use a custom component to set your true sentence boundaries before the parser, it should work.
Regarding some of your other points...
I don't think it makes any sense to keep your sentence boundaries separate from the parser's - if you do that you can end up with subtrees that span multiple sentences, which will just be weird and unhelpful.
You didn't mention this in your question, but is treating each sentence/line as a separate doc an option? (It's not clear if you're combining multiple lines and the sentence boundaries are wrong, or if you're passing in a single line but it's turning into multiple sentences.)

difference between pandas methods, data frame methods and how to distinguish between them

It has been a while I am confused between these and I would like to see if there is a way to easily distinguish between these in a practical and fast way.
assuming df is a pandas data frame object, please see below:
while using pandas, this is what I noticed. To access/perform some methods, you have to use pd.method(df,*args) sometimes. To access some other ones, you need to use df.method(*args). Interestingly, there are some methods that work either way ...
Let's clarify this a bit more with some examples: while it totally makes sense to me to use pd.read_csv (), not df.read_csv, since there is no df created yet, I have a hard time making sense of the following examples:
1- correct: pd.getdummies(df,*args) --- incorrect: df.getdummies(*args)
2- correct: df.groupby(*args) --- incorrect: pd.groupby(df,*args)
3- correct: df.isnull() AND pd.isnull(df)
I am pretty sure you can also come up with many other examples as above. I personally find this challenging to keep in mind which one is which and found myself wasting a lot of time in total code development/analysis cycle trying to guessing if I should use pd.method (df) or df.method() for different things.
My main question is: how do you guys handle this? did you also find this issue challenging? is there any way to quickly understand which one to use ? am I missing something here?
Thanks

Doing multiple gsub in a rails 3 app on email body template

I have an email template where the user can enter text like this:
Hello {first_name}, how are you?
and when the email actually gets sent it replaces the placeholder text {first_name} with the actual value.
There will be several of these placeholders, though, and I wasn't sure that gsub is meant to be used like this.
body = #email.body.gsub("{first_name}", #person.first_name)gsub("{last_name}", #person.last_name).gsub("",...).gsub("",...).gsub("",...).etc...
Is there a cleaner solution to achieving this functionality? Also, if anyone's done something similar to this, did they find that they eventually hit a point where using multiple gsubs on a few paragraphs for hundreds of emails was just too slow?
EDIT
I ran some tests comparing multiple gsubs vs using regex and it came out that the regex was usually 3x FASTER than using multiple gsubs. However, I think the regex code is a littler harder to read as-is, so I'm going to have to clean it up a bit but it does indeed seem that using regex is significantly faster than multiple gsubs. Since my use case will involve multiple substitutions for a large number of documents, the faster solution is better for me, even though I'll have to add a little more documentation.
You have to put in regular expressions all strings you want to catch and in the hash you put the replacement of all catches:
"123456789".gsub /(123|456)/, "123" => "ABC",
"456" => "DEF"
This code only works for ruby 1.9.
If you can use a template library like erb or haml, they are the proper tool for this kind of task.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).