How to get Synonym for small sentence using Python/NodeJS/Java? - wordnet

i wish to get synonyms for small sentences.. How can i do that?
My python code using wordnet is like this:
from nltk.corpus import wordnet as wn
print(wn.synsets('work'))
Then i will get some synonyms returned like this:
employment, work, exercise etc...
But can i get the synonyms for some small sentences like "not working", "not feeling well"
Example i am expecting synonym for "not working" as-- faulty, not
functioning etc..
Is there any libraries available to do that? I have tried SimpleNLG. However its not supporting my case.

WordNet supports compound words, but they tend to mostly be nouns. E.g. the entry for faulty only has "defective", "imperfect". It does not have "not working", "not functioning".
However, you might be able to use antonyms to find them, and then just put "not" before them? This answer https://stackoverflow.com/a/24199627/841830 shows how to use NLTK to get adjectival antonyms from WordNet.
It is a little risky, because you won't know if "not" is commonly used to negate that adjective, and you may end up with some awkward phrasing. e.g. "not nonabsorbent". (Some corpus analysis could help decide the popularity of compounds you generate from antonyms.)

Related

How to filter by tag in Jaeger

When trying to filter by tag, there is a small popup:
I have been looking for logfmt around, but all I can find is key=value format.
My questions are:
Is there a way for something more sophisticated? (starts_with, not equal, contains, etc)
I am trying to filter by url using http.url="http://example.com?bla=bla&foo=bar". I am pretty sure the value exists because I am copy/pasting from my trace. I am getting no results. Do I need to escape characters or do something else for this to work?
I did some research around logfmt as well. Based on the documentation of the original implementation and in the Python implementation of the parser (and respective tests), I would say that it doesn't support anything more sophisticated (like starts_with, not equal, contains). And this is because the output of the parser is a simple dictionary (with no regex involved in the values).
As for the second question, using the same mentioned Python parser, I was able to double-check that your filter looks fine:
from logfmt import parse_line
parse_line('http.url="http://example.com?bla=bla&foo=bar"')
Output:
{'http.url': 'http://example.com?bla=bla&foo=bar'}
This makes me suspect of an issue on the Jaeger side, but this is as far as I could go.

Filter naming convention

I have a mysql function that takes a string as input and returns only the alphanumeric characters.
101-FFKS 99S-5 would output 101FFKS99S5 for instance.
It was called alphanum which isn't very descriptive.
I was considering something along the lines of alphanum_filter or punctuation_filter.
What is the agreed convention? The thing you're filtering out, or the thing you're left with?
A few google searches didn't yield anything helpful.
I don't think there's any agreed convention in programming.
Note that there are three, not two options, as to the thing to put near "filter":
what you're filtering out,
what you're left with
and the combination of both, that is the substance which gets put into the filter.
Indeed I think that in general when in English you see a word near "filter" you think that that word is what will be put into the filter.
Note that while "air filter", "water filter", "oil filter" and "fuel filter" might seem to refer to what is left after the filtering, I strongly believe that there's an implicit "unfiltered" before all of them.
However there also nouns that really include "what you're left with", such as "high-pass filter" or "red filter", and ones that include "what you're filtering out", such as "spam filter".
So it will probably be equivocal in any case.
Personally at first glance I'd think that the word would be what is being filtered out, so I'd be more comfortable with punctuation_filter, but it's probably subjective.
So it would be better to find a less ambiguous name (although in some cases it's ok to use something ambiguous and let each programmer understand what it does by looking at it's source; what's much more important is to be consistent, you sure don't want a punctuation_filter that filters out punctuation and a num_filter that lets only numbers through in the same code).
An idea for a clearer name might be LeaveOnlyAlphanums.
Yes, it's an assertion rather than a noun, but the only unambiguous noun phrases you could use are probably FilterThatLeavesOnlyAlphanums, or ThingThatLeavesOnlyAlphanums.
An additional idea is to use FilterOut instead of simply Filter. That's for sure unambiguous, although it's an assertion as well.
If you're looking for something with some sort of authority in the field of programming, I found an explicit mention that filter is ambiguous in The Art of Readable Code.

How to index only words with a minimum length using Apache Lucene 5.3.1?

may someone give me a hint on how to index only words with a minimum length using Apache Lucene 5.3.1?
I've searched through the API but didn't find anything which suits my needs except this, but I couldn't figure out how to use that.
Thanks!
Edit:
I guess that's important info, so here's a copy of my explanation of what I want to achieve from my reply below:
"I don't intend to use queries. I want to create a source code summarization tool for which I created a doc-term matrix using Lucene. Now it also shows single- or double-character words. I want to exclude them so they don't show up in the results as they have little value for a summary. I know I could filter them when outputting the results, but that's not a clean solution imo. An even worse would be to add all combinations of single- or double-character words to the stoplist. I am hoping there is a more elegant way then one of those."
You should use a custom Analyzer with LengthTokeFilter. E.g.
Analyzer ana = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("standard")
.addTokenFilter("lowercase")
.addTokenFilter("length", "min", "4", "max", "50")
.addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
.build();
But it is better to use a stopword (words what occur in almost all documents, like articles for English language) list. This gives a more accurate result.

Lucene.net GetFieldQuery vs TermQuery

Using Lucene's standard analyzer. Title field in question is non-stored, analyzed. The query is as follows:
title:"Some-Url-Friendly-Title"
In Luke, this query gets correctly re-written as:
title:"some url friendly title" (- replaced by whitespace, everything lowercased).
I thought the Lucene.net version would be:
new TermQuery(new Term("title","Some-Url-Friendly-Title"))
However, no results are returned.
Then I tried:
_parser.GetFieldQuery("title","Some-Url-Friendly-Title")
And it worked as expected!
Both queries were executed via:
_searcher.Search([query object], [sort object])
Can somebody point me in the right direction to see what the differences between TermQuery and _parser.GetFieldQuery() are?
A TermQuery is much simpler than running a query through a queryparser. Not only is it not lowercased, and doesn't understand to break up hyphenated terms, it isn't even tokenized. It just searches for the term you tell it to look for. That means it is looking for the term "Some-Url-Friendly-Title" as a single untokenized keyword, in your index. I assume you are using an analyzer, so chances are no such tokens exist.
To take it a step further, if you had been searching for "Some Url Friendly Title" as the Term text, you still wouldn't come up with anything, since it's looking for "Some Url Friendly Title" as a Single Token, not as the four tokens (or rather, terms) in your index.
If you look at what a the standard query parser generates when you parse your query, you'll see that TermQueries are only one of the building blocks it uses to generate the complete query, along with BooleanQuery, and possibly PhraseQuery, PrefixQueriy, etc.
In Lucene.Net version 3.0.3 the GetFieldQuery is inaccessible due to it's protection modifier. Use
MultiFieldQueryParser.Parse(searchText, field)
instead.

How exact phrase search is performed by a Search Engine?

I am using Lucene to search in a Data-set, I need to now how "" search (I mean exact phrase search) mechanism has been implemented?
I want to make it able to result all "little cat" hits when the user enters "littlecat". I now that I should manipulate the indexing code, but at least I should now how the "" search works.
I want to make it able to result all "little cat" hits when the user enters "littlecat"
This might sound easy but it is very tough to implement. For a human being little and cat are two different words but for a computer it does not know little and cat seperately from littlecat, unless you have a dictionary and your code check those two words in dictionary. On the other hand searching for "little cat" can easily search for "littlecat" aswell. And i believe that this goes beyong the concept of an exact phrase search. Exact phrase search will only return littlecat if you search for "littlecat" and vice versa. Even google seemingly (expectedly too), doesnt return "little cat" on littlecat search
A way to implement this is Dynamic programming - using a dictionary/corpus to compare your individual words against(and also the left over words after you have parsed the text into strings).
Think of it like you were writing a custom spell-checker or likewise. In this, there's also a scenario when more than one combination of words may be left over eg -"walkingmydoginrain" - here you could break the 1st word as "walk", or as "walking" , and this is the beauty of DP - since you know (from your corpus) that you can't form legitimate words from "ingmydoginrain" (ie rest of the string - you have just discovered that in this context - you should pick the segmented word as "Walking" and NOT walk.
Also think of it like not being able to find a match is adding to a COST function that you define, so you should get optimal results - meaning you can be sure that your text(un-separated with white spaces) will for sure be broken into legitimate words- though there may be MORE than one possible word sequences in that line(and hence, possibly also intent of the person seeking this)
You should be able to find pretty good base implementations over the web for your use case (read also : How does Google implement - "Did you mean" )
For now, see also -
How to split text without spaces into list of words?