Using wildcard and required operator in an Elasticsearch search - lucene

We have various rows inside our Elasticsearch index that contain the text
"... 2% milk ...".
User enters a query like "2% milk" into a search field and we transform it internally to a query
title:(+milk* +2%*)
because all terms should be required and we are possibly interested into rows that contain "2% milkfat".
This query above return zero hits. Changing the query to
title:(+milk* +2%)
returns reasonable results. So why does the '*' operator in the first query not work?

Unless you set a mapping, the "%" sign will get removed in the tokenization process. Basically "2% milk" will get turned into the tokens 2 and milk.
When you search for "2%*" it looks for tokens like: 2%, 2%a, 2%b, etc... and not match any indexed tokens, giving no hits.
When you search for "2%", it will go through the same tokenization process as at index-time (you can specify this, but the default tokenization is the same) and you will be looking for documents matching the token 2, which will give you a hit.
You can read more about the analysis/tokenization process here and you can set up the analysis you want by defining a custom mapping
Good luck!

Prefix and Wildcard queries do not appear to apply the Analyzer to their content. To provide a few examples:
title:(+milk* +2%) --> +title:milk* +title:2
title:(+milk* +2%*) --> +title:milk* +title:2%*
title:(+milk* +2%3) --> +title:milk* +(title:2 title:3)
title:(+milk* +2%3*) --> +title:milk* +title:2%3*
+title:super\\-milk --> +title:super title:milk
+title:super\\-milk* --> +title:super-milk*
It does make some sense to prevent tokenization of wildcard queries, since wildcard phrase queries are not allowed. If tokenization were allowed, it would seem to beg the question, especially with embeded wildcards, of just how many terms that wildcard can span.

Related

user wants to apply a quite complex "User Search Filter" in his LDAP Configuration

user have to apply a quite complex "User Search Filter" in his LDAP Configuration.
The filter is too big and exceed the 256 allowed character. For customer business policy is not possible to modify the LDAP structure or data How can we proceed?
Here there is a sample of the filter:
(&
(|
(memberOf=CN=Applicazione_DocB_AmmApplicativo,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_AmmPiattaforma,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_ArchFIRead,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_ArchFIWrite,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_AreaFinanza,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_Arm,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_BoGestCanc,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_BoUpdDocum,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_Crif,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_VisualBase,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
(memberOf=CN=Applicazione_DocB_VisualEsteso,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
)(|
(userAccountControl=512)
(userAccountControl=544)
(userAccountControl=66048)
)
)
Have the customer create one single group to control access to the application, then they can add all of those groups to that one group. Then you only need to look at that one group. However, you will need to use the LDAP_MATCHING_RULE_IN_CHAIN operator so that it will look at the members of nested groups.
If the name of that new group is Applicazione_DocB, that would look something like this:
(memberOf:1.2.840.113556.1.4.1941:=CN=Applicazione_DocB,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)
Your conditions on userAccountControl can also be simplified. That attribute is a bit flag, which means that each bit in the binary value is a flag that means something. Those values are listed in the documentation for userAccountControl. The three conditions you are using are:
512: ADS_UF_NORMAL_ACCOUNT
544: ADS_UF_NORMAL_ACCOUNT | ADS_UF_PASSWD_NOTREQD (password not required)
66048: ADS_UF_NORMAL_ACCOUNT | ADS_UF_DONT_EXPIRE_PASSWD (password does not expire)
If the intent is to exclude disabled accounts (514: ADS_UF_NORMAL_ACCOUNT | ADS_UF_ACCOUNTDISABLE), then you can do that by using the LDAP_MATCHING_RULE_BIT_AND operator to check if the second bit is not set (which indicates a disabled account), like this:
(!userAccountControl:1.2.840.113556.1.4.803:=2)
Putting that all together, you get a query that is less than 256 characters:
(&(memberOf:1.2.840.113556.1.4.1941:=CN=Applicazione_DocB,OU=Intranet,OU=Gruppi,DC=CBMAIN,DC=CBDOM,DC=IT)(!userAccountControl:1.2.840.113556.1.4.803:=2))

Why does CONTAINS find inequal text strings in JCR-SQL2?

Working with a JCR-SQL2 query I noticed that the CONTAINS operator finds nodes
which do not have exactly the same string that was in the condition.
Example
The following query:
SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, 'my/search-expression')
would not find only nodes that contain the my/search-expression string, but also nodes with strings like my/another/search/expression.
Why does the query not find only the exact string provided? How could it be changed to narrow down the results?
This question is intended to be answered by myself, for knowledge sharing - but feel free to add your own answer or improve an existing one.
An execution plan for the example query reveals the root cause of the problem:
[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) +:fulltext:my +:fulltext:search +:fulltext:expression ft:("my/search-expression") where contains([s].[*], 'my/search-expression') */
The CONTAINS operator triggers a full text search. Non-word characters, like "/" or "-", are used as word delimiters. As a result, the query looks for all nodes that contain the words: "my", "search" and "expression".
What can be done with it? There are several options.
1. Use double quotes
If you want to limit results to phrases with given words in exact order and without any other words between them, put the search expression inside double quotes:
SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"')
Now, the execution plan is different:
[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where contains([s].[*], '"my/search-expression"') */
The query will now look for the whole phrase, not single words. However, it still ignores non-word characters, so such phrases would also be found: "my search expression" or "my-search-expression".
2. Use LIKE expression (not recommended)
If you want to find only the exact phrase, keeping non-word characters, you can use the LIKE expression:
SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%'
This is, however, much slower. I needed to add another condition to avoid timeout during explaining the execution plan. For this query:
SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%' AND ISDESCENDANTNODE([/content/my/content])
the execution plan is:
[nt:base] as [s] /* traverse "/content/my/content//*" where ([s].[*] like '%my/search-expression%') and (isdescendantnode([s], [/content/my/content])) */
It would find only nodes with this phrase: "my/search-expression".
3. Use double quotes and refine the results
It would be probably better to use the first approach (CONTAINS with double quotes) and refine the results later, for example in application code if the query is run from an application.
4. Mix CONTAINS and LIKE
Another option is to mix full-text search and LIKE expression with AND:
SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"') AND s.* LIKE '%my/search-expression%'
The execution plan is now:
[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where (contains([s].[*], '"my/search-expression"')) and ([s].[*] like '%my/search-expression%') */
Now, it should be fast and strict in the same time.
Had the same problem.
So basically you should define different tokenizer for your lucene index, in my case "Whitespace" tokenizer was just fine.
With Standard tokenizer "my/search-expression" is splitted in 3 tokens "my", "search", "expression".
Standard tokenizer use some special characters as delimiter.
Thats the reason why for "my/search-expression" you get 0 results.
Another example:
"some-other my search/expression" with Whitespace tokenizer this is splitted into:
"some-other", "my", "search/expression"
When you search for "some-other my" this should return results.
List of tokenizers
Lucene index example:
<yourLucene
jcr:primaryType="oak:QueryIndexDefinition"
type="lucene"
async="async"
evaluatePathRestrictions="{Boolean}true"
includedPaths="[/somepath]"
queryPaths="[/somepath]"
compatVersion="{Long}2">
<analyzers jcr:primaryType="nt:unstructured">
<default jcr:primaryType="nt:unstructured">
<tokenizer
jcr:primaryType="nt:unstructured"
name="Whitespace"/>
<filters jcr:primaryType="nt:unstructured">
<Standard jcr:primaryType="nt:unstructured"/>
<LowerCase jcr:primaryType="nt:unstructured"/>
<Stop jcr:primaryType="nt:unstructured"/>
</filters>
</default>
</analyzers>
<indexRules jcr:primaryType="nt:unstructured">
<nt:unstructured jcr:primaryType="nt:unstructured">
<properties jcr:primaryType="nt:unstructured">
<someprop
jcr:primaryType="nt:unstructured"
name="someprop"
propertyIndex="{Boolean}true"
type="String"/>
</properties>
</nt:unstructured>
</indexRules>

Amazon CloudSearch returns false results

I have a DB of articles, and i would like to search for all the articles who:
1. contain the word 'RIO' in either the title or the excerpt
2. contain the word 'BRAZIL' in the parent_post_content
3. and in a certain time range
The query I search with (structured) was:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (phrase field=title 'RIO') (phrase field=excerpt 'RIO')))
but for some reason i get results that contain 'RIO' in the title, but do not contain 'BRAZIL' in the parent_post_content.
This is especially weird because i tried to condition only on the title (and not the excerpt) with this query:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (phrase field=name 'RIO'))
and the results seem OK.
I'm fairy new to CloudSearch, so i very likely have syntax errors, but i can't seem to find them. help?
You're using the phrase operator but not actually searching for a phrase; it would be best to use the term operator (or no operator) instead. I can't see why it should matter but using something outside of how it was intended to be used can invite unintended consequences.
Here is how I'd re-write your queries:
Using term (mainly just used if you want to boost fields):
(and (term field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (term field=title 'RIO') (term field=excerpt 'RIO')))
Without an operator (I find this simplest):
(and parent_post_content:'BRAZIL' (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or title:'RIO' excerpt:'RIO'))
If that fails, can you post the complete query? I'd like to check that, for example, you're using the structured query parser since you mentioned you're new to CloudSearch.
Here are some relevant docs from Amazon:
Compound queries for more on the various operators
Searching text for specifics on the phrase operator
Apparently the problem was not with the query, but with the displayed content. I foolishly trusted that the content displaying in the CloudSearch site was complete, and so concluded that it does not contain Brazil. But alas, it is not the full content, and when i check the full content, Brazil was there.
Sorry for the foolery.

How to make LIKE in SQL look for specific string instead of just a wildcard

My SQL Query:
SELECT
[content_id] AS [LinkID]
, dbo.usp_ClearHTMLTags(CONVERT(nvarchar(600), CAST([content_html] AS XML).query('root/Physicians/name'))) AS [Physician Name]
FROM
[DB].[dbo].[table1]
WHERE
[id] = '188'
AND
(content LIKE '%Urology%')
AND
(contentS = 'A')
ORDER BY
--[content_title]
dbo.usp_ClearHTMLTags(CONVERT(nvarchar(600), CAST([content_html] AS XML).query('root/Physicians/name')))
The issue I am having is, if the content is Neurology or Urology it appears in the result.
Is there any way to make it so that if it's Urology, it will only give Urology result and if it's Neurology, it will only give Neurology result.
It can be Urology, Neurology, Internal Medicine, etc. etc... So the two above used are what is causing the issue.
The content is a ntext column with XML tag inside, for example:
<root><Location><location>Office</location>
<office>Office</office>
<Address><image><img src="Rd.jpg?n=7513" /></image>
<Address1>1 Road</Address1>
<Address2></Address2>
<City>Qns</City>
<State>NY</State>
<zip>14404</zip>
<phone>324-324-2342</phone>
<fax></fax>
<general></general>
<from_north></from_north>
<from_south></from_south>
<from_west></from_west>
<from_east></from_east>
<from_connecticut></from_connecticut>
<public_trans></public_trans>
</Address>
</Location>
</root>
With the update this content column has the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<Physicians>
<name>Doctor #1</name>
<picture>
<img src="phys_lab coat_gradation2.jpg?n=7529" />
</picture>
<gender>M</gender>
<langF1>
English
</langF1>
<specialty>
<a title="Neurology" href="neu.aspx">Neurology</a>
</specialty>
</Physicians>
</root>
If I search for Lab the result appears because there is the text lab in the column.
This is what I would do if you're not into making a CLR proc to use Regexes (SQL Server doesn't have regex capabilities natively)
SELECT
[...]
WHERE
(content LIKE #strService OR
content LIKE '%[^a-z]' + #strService + '[^a-z]%' OR
content LIKE #strService + '[^a-z]%' OR
content LIKE '%[^a-z]' + #strService)
This way you check to see if content is equal to #strService OR if the word exists somewhere within content with non-letters around it OR if it's at the very beginning or very end of content with a non-letter either following or preceding respectively.
[^...] means "a character that is none of these". If there are other characters you don't want to accept before or after the search query, put them in every 4 of the square brackets (after the ^!). For instance [^a-zA-Z_].
As I see it, your options are to either:
Create a function that processes a string and finds a whole match inside it
Create a CLR extension that allows you to call .NET code and leverage the REGEX capabilities of .NET
Aaron's suggestion is a good one IF you can know up front all the terms that could be used for searching. The problem I could see is if someone searches for a specific word combination.
Databases are notoriously bad at semantics (i.e. they don't understand the concept of neurology or urology - everything is just a string of characters).
The best solution would be to create a table which defines the terms (two columns, PK and the name of the term).
The query is then a join:
join table1.term_id = terms.term_id and terms.term = 'Urology'
That way, you can avoid the LIKE and search for specific results.
If you can't do this, then SQL is probably the wrong tool. Use LIKE to get a set of results which match and then, in an imperative programming language, clean those results from unwanted ones.
Judging from your content, can you not leverage the fact that there are quotes in the string you're searching for?
SELECT
[...]
WHERE
(content LIKE '%""Urology""%')

Apache solr - more like this score

I have a small index with ~1000 documents with only two fields:
- id (string)
- content (text_general)
I noticed that when I do MLT search by id for similar content, the original document(which id is the searched id) have a score 5.241327.
There is 1:1 duplicated document and for the duplicated content it is returning score = 1.5258181. Why? Why it is not 5.241327 when it is 100% duplicate.
Another question is can I in any way to get similarity documents by content by passing some text in the query.
Example:
/mlt/?q=content:Some encoded long text&mlt.fl=content
I am trying to check if there is similar content uploaded and the check must be performed at new content upload time.
It might be worth to try some different parameters. I also use MLT on only one field, I use the following parameters:
'mlt.boost': 'true',
'mlt.fl': 'my_field_name',
'mlt.maxqt': 1000,
'mlt.mindf': '0',
'mlt.mintf': '0',
'qt': 'mlt',
'rows': '10'
See http://wiki.apache.org/solr/MoreLikeThis for an explanation of the parameters. I think with a small index mindf might be important and I see the default mintf (term frequency) is 2, so I assume an ID is only one term, so this is probably ignored!
First, how does Solr More-Like-This works?
A regular Solr query is conducted (e.g. "?q=content:Some encoded long text&.....".
For each document returned by the above query, More-Like-This conduct More like this query...
So, the first result set "response", is just like any Solr query results set.
The More-Like-This appears below and start with something like that (Json format):
"moreLikeThis":{
"57375":{"numFound":18155,"start":0,"docs":["
For an explanation about More Like This algorithm, please read that:
http://blog.brattland.no/node/18
and: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
If you didn't solved the problem yet, please let me know and I will guide you through.