How to get noun phrases without "the", "a", or other adjectives? - spacy

spaCy shows "the adipose tissue" and "chronic energy imbalance". However, I'd like the noun phrase to be made of only nouns instead of "the", "a", or other adjectives. Is there a way to do so?
https://demos.explosion.ai/displacy/?text=Obesity%20or%20excessive%20expansion%20of%20the%20adipose%20tissue%20is%20a%20consequence%20of%20chronic%20energy%20imbalance%20between%20energy%20intake%20and%20dissipation.&model=en_core_web_sm&cpu=1&cph=1

Related

Does exptrk has a built-in operator to allow a search in a vector of strings

I was reading the exptrk documentation. I could not see an operator that would work to find if a given string is a part of vector of strings. Does exptrk support this
e.g. I am looking for an operator like "A" in ["A", "B", "C"]

about prep doesn't exist in GF

"He was talking about his last night."
In the previous sentence, the preposition is about, but this preposition exists neither under the prep syntax nor under the English language Morphology in RGL as shown below.
Is there is a reason for this or this preposition acts differently in different languages?
I don't know of any particular reason. "About" is not among the most common English prepositions like "in", "on", "for", but there are less frequent prepositions on that list, like "during", so it's not frequency based. But a list like that can never be complete, that's why we have mkPrep in the lexical paradigms for English:
mkPrep : Str -> Prep -- e.g. "in front of"
mkPost : Str -> Prep -- e.g. "ago"
noPrep : Prep -- no preposition
So whenever you want to use a preposition that isn't in the RGL API, just use mkPrep. In this case, mkPrep "about".

Is it correct to say "<language A> is written in <language B>?"

If we have a language, say "Foo", and it's (only) interpreter is written in C, we could of course say:
"The Foo interpreter is written in C"
However, would it make sense to shorten this to:
"Foo is written in C"
I found this example on a Wikipedia page discussing a language and I feel like this second sentence is ambiguous and confusing enough to be incorrect.
Thoughts?

Comparison of Lucene Analyzers

Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.
In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.
Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.
Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.
KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.
Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].
And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.
And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.
In my perspective, I have used StandAnalyzer and SmartCNAnalyzer. As I have to search text in Chinese. Obviously, SmartCnAnalyzer is better at handling Chinese. For diiferent purposes, you have to choose properest analyzer.

How to match against subsets of a search string in SOLR/lucene

I've got an unusual situation. Normally when you search a text index you are searching for a small number of keywords against documents with a larger number of terms.
For example you might search for "quick brown" and expect to match "the quick brown fox jumps over the lazy dog".
I have the situation where I have lots of small phrases in my document store and I wish to match them against a larger query phrase.
For example if I have a query:
"the quick brown fox jumps over the lazy dog"
and the documents
"quick brown"
"fox over"
"lazy dog"
I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).
Is this sort of query possible with SOLR/lucene?
It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.
At index time your documents are then indexed as such:
"quick brown" -> quick_brown
"fox over" -> fox_over
"lazy dog" -> lazy_dog
At query time your query becomes:
"the quick brown fox jumps over the lazy dog" -> "the_quick quick_brown brown_fox fox_jumps jumps_over over_the the_lazy lazy_dog"
This is still no good, by default it will form a phrase query.
So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):
BooleanQuery:
the_quick OR
quick_brown OR
brown_fox OR
...
this should be the most performant way, as then its really just a booleanquery of termqueries.
Sounds like you want the DisMax "minimum match" parameter. I wrote a blog article on the concept here a little while: http://blog.websolr.com/post/1299174416. There's also the Solr wiki on minimum match.
The "minimum match" concept is applied against all the "optional" terms in your query -- terms that aren't explicitly specified, using +/-, whether they are "+mandatory" or "-prohibited". By default, the minimum match is 100%, meaning that 100% of the optional terms must be present. In other words, all of your terms are considered mandatory.
This is why your longer query isn't currently matching documents containing shorter fragments of that phrase. The other keywords in the longer search phrase are treated as mandatory.
If you drop the minimum match down to 1, then only one of your optional terms will be considered mandatory. In some ways this is the opposite of the default of 100%. It's like your query of quick brown fox… is turned into quick OR brown OR fox OR … and so on.
If you set your minimum match to 2, then your search phrase will get broken up into groups of two terms. A search for quick brown fox turns into (quick brown) OR (brown fox) OR (quick fox) … and so on. (Excuse my psuedo-query there, I trust you see the point.)
The minimum match parameter also supports percentages -- say, 20% -- and some even more complex expressions. So there's a fair amount of tweakability.
only setting mm parameter will not satisfy your needs since
"the quick brown fox jumps over the lazy dog"
will match all three documents
"quick brown"
"fox over"
"lazy dog"
and as you said:
I'd like to find the documents that
have a phrase that occurs in the
query. In this case "quick brown" and
"lazy dog" (but not "fox over" because
although the tokens match it's not a
phrase in the search string).