Why does CONTAINS find inequal text strings in JCR-SQL2? - jcr

Working with a JCR-SQL2 query I noticed that the CONTAINS operator finds nodes
which do not have exactly the same string that was in the condition.
Example
The following query:
SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, 'my/search-expression')
would not find only nodes that contain the my/search-expression string, but also nodes with strings like my/another/search/expression.
Why does the query not find only the exact string provided? How could it be changed to narrow down the results?
This question is intended to be answered by myself, for knowledge sharing - but feel free to add your own answer or improve an existing one.

An execution plan for the example query reveals the root cause of the problem:
[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) +:fulltext:my +:fulltext:search +:fulltext:expression ft:("my/search-expression") where contains([s].[*], 'my/search-expression') */
The CONTAINS operator triggers a full text search. Non-word characters, like "/" or "-", are used as word delimiters. As a result, the query looks for all nodes that contain the words: "my", "search" and "expression".
What can be done with it? There are several options.
1. Use double quotes
If you want to limit results to phrases with given words in exact order and without any other words between them, put the search expression inside double quotes:
SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"')
Now, the execution plan is different:
[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where contains([s].[*], '"my/search-expression"') */
The query will now look for the whole phrase, not single words. However, it still ignores non-word characters, so such phrases would also be found: "my search expression" or "my-search-expression".
2. Use LIKE expression (not recommended)
If you want to find only the exact phrase, keeping non-word characters, you can use the LIKE expression:
SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%'
This is, however, much slower. I needed to add another condition to avoid timeout during explaining the execution plan. For this query:
SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%' AND ISDESCENDANTNODE([/content/my/content])
the execution plan is:
[nt:base] as [s] /* traverse "/content/my/content//*" where ([s].[*] like '%my/search-expression%') and (isdescendantnode([s], [/content/my/content])) */
It would find only nodes with this phrase: "my/search-expression".
3. Use double quotes and refine the results
It would be probably better to use the first approach (CONTAINS with double quotes) and refine the results later, for example in application code if the query is run from an application.
4. Mix CONTAINS and LIKE
Another option is to mix full-text search and LIKE expression with AND:
SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"') AND s.* LIKE '%my/search-expression%'
The execution plan is now:
[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where (contains([s].[*], '"my/search-expression"')) and ([s].[*] like '%my/search-expression%') */
Now, it should be fast and strict in the same time.

Had the same problem.
So basically you should define different tokenizer for your lucene index, in my case "Whitespace" tokenizer was just fine.
With Standard tokenizer "my/search-expression" is splitted in 3 tokens "my", "search", "expression".
Standard tokenizer use some special characters as delimiter.
Thats the reason why for "my/search-expression" you get 0 results.
Another example:
"some-other my search/expression" with Whitespace tokenizer this is splitted into:
"some-other", "my", "search/expression"
When you search for "some-other my" this should return results.
List of tokenizers
Lucene index example:
<yourLucene
jcr:primaryType="oak:QueryIndexDefinition"
type="lucene"
async="async"
evaluatePathRestrictions="{Boolean}true"
includedPaths="[/somepath]"
queryPaths="[/somepath]"
compatVersion="{Long}2">
<analyzers jcr:primaryType="nt:unstructured">
<default jcr:primaryType="nt:unstructured">
<tokenizer
jcr:primaryType="nt:unstructured"
name="Whitespace"/>
<filters jcr:primaryType="nt:unstructured">
<Standard jcr:primaryType="nt:unstructured"/>
<LowerCase jcr:primaryType="nt:unstructured"/>
<Stop jcr:primaryType="nt:unstructured"/>
</filters>
</default>
</analyzers>
<indexRules jcr:primaryType="nt:unstructured">
<nt:unstructured jcr:primaryType="nt:unstructured">
<properties jcr:primaryType="nt:unstructured">
<someprop
jcr:primaryType="nt:unstructured"
name="someprop"
propertyIndex="{Boolean}true"
type="String"/>
</properties>
</nt:unstructured>
</indexRules>

Related

Amazon CloudSearch returns false results

I have a DB of articles, and i would like to search for all the articles who:
1. contain the word 'RIO' in either the title or the excerpt
2. contain the word 'BRAZIL' in the parent_post_content
3. and in a certain time range
The query I search with (structured) was:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (phrase field=title 'RIO') (phrase field=excerpt 'RIO')))
but for some reason i get results that contain 'RIO' in the title, but do not contain 'BRAZIL' in the parent_post_content.
This is especially weird because i tried to condition only on the title (and not the excerpt) with this query:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (phrase field=name 'RIO'))
and the results seem OK.
I'm fairy new to CloudSearch, so i very likely have syntax errors, but i can't seem to find them. help?
You're using the phrase operator but not actually searching for a phrase; it would be best to use the term operator (or no operator) instead. I can't see why it should matter but using something outside of how it was intended to be used can invite unintended consequences.
Here is how I'd re-write your queries:
Using term (mainly just used if you want to boost fields):
(and (term field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (term field=title 'RIO') (term field=excerpt 'RIO')))
Without an operator (I find this simplest):
(and parent_post_content:'BRAZIL' (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or title:'RIO' excerpt:'RIO'))
If that fails, can you post the complete query? I'd like to check that, for example, you're using the structured query parser since you mentioned you're new to CloudSearch.
Here are some relevant docs from Amazon:
Compound queries for more on the various operators
Searching text for specifics on the phrase operator
Apparently the problem was not with the query, but with the displayed content. I foolishly trusted that the content displaying in the CloudSearch site was complete, and so concluded that it does not contain Brazil. But alas, it is not the full content, and when i check the full content, Brazil was there.
Sorry for the foolery.

lucene query special characters

I have trouble understandnig handling of special characters in lucene.
My analyzer has no stopwords, so that special chars are not removed:
CharArraySet stopwords = new CharArraySet(0, true);
return new GermanAnalyzer(stopwords);
than I create docs like:
doc.add(new TextField("tags", "23", Store.NO));
doc.add(new TextField("tags", "Brüder-Grimm-Weg", Store.NO));
Query tags:brüder\-g works fine, but fuzzy query tags:brüder\-g~ does not return anything. When the street name would be Eselgasse query tags:Esel~ would work fine.
I use lucene 5.3.1
Thanks for help!
Fuzzy Queries (as well as wildcard or regex queries) are not analyzed by the QueryParser.
If you are using StandardAnalyzer, for instance, "Brüder-Grimm-Weg" will be indexed as three terms, "brüder", "grimm", and "weg". So, after analysis you have:
"tags:brüder\-g" --> tags:brüder tags:g
This matches on tags:brüder
"tags:brüder\-g~" --> tags:brüder-g~2
Since this is not analyzed, it remains a single term, and you have no matches, since there is no single term in your index like "brüder-g"

How to make LIKE in SQL look for specific string instead of just a wildcard

My SQL Query:
SELECT
[content_id] AS [LinkID]
, dbo.usp_ClearHTMLTags(CONVERT(nvarchar(600), CAST([content_html] AS XML).query('root/Physicians/name'))) AS [Physician Name]
FROM
[DB].[dbo].[table1]
WHERE
[id] = '188'
AND
(content LIKE '%Urology%')
AND
(contentS = 'A')
ORDER BY
--[content_title]
dbo.usp_ClearHTMLTags(CONVERT(nvarchar(600), CAST([content_html] AS XML).query('root/Physicians/name')))
The issue I am having is, if the content is Neurology or Urology it appears in the result.
Is there any way to make it so that if it's Urology, it will only give Urology result and if it's Neurology, it will only give Neurology result.
It can be Urology, Neurology, Internal Medicine, etc. etc... So the two above used are what is causing the issue.
The content is a ntext column with XML tag inside, for example:
<root><Location><location>Office</location>
<office>Office</office>
<Address><image><img src="Rd.jpg?n=7513" /></image>
<Address1>1 Road</Address1>
<Address2></Address2>
<City>Qns</City>
<State>NY</State>
<zip>14404</zip>
<phone>324-324-2342</phone>
<fax></fax>
<general></general>
<from_north></from_north>
<from_south></from_south>
<from_west></from_west>
<from_east></from_east>
<from_connecticut></from_connecticut>
<public_trans></public_trans>
</Address>
</Location>
</root>
With the update this content column has the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<Physicians>
<name>Doctor #1</name>
<picture>
<img src="phys_lab coat_gradation2.jpg?n=7529" />
</picture>
<gender>M</gender>
<langF1>
English
</langF1>
<specialty>
<a title="Neurology" href="neu.aspx">Neurology</a>
</specialty>
</Physicians>
</root>
If I search for Lab the result appears because there is the text lab in the column.
This is what I would do if you're not into making a CLR proc to use Regexes (SQL Server doesn't have regex capabilities natively)
SELECT
[...]
WHERE
(content LIKE #strService OR
content LIKE '%[^a-z]' + #strService + '[^a-z]%' OR
content LIKE #strService + '[^a-z]%' OR
content LIKE '%[^a-z]' + #strService)
This way you check to see if content is equal to #strService OR if the word exists somewhere within content with non-letters around it OR if it's at the very beginning or very end of content with a non-letter either following or preceding respectively.
[^...] means "a character that is none of these". If there are other characters you don't want to accept before or after the search query, put them in every 4 of the square brackets (after the ^!). For instance [^a-zA-Z_].
As I see it, your options are to either:
Create a function that processes a string and finds a whole match inside it
Create a CLR extension that allows you to call .NET code and leverage the REGEX capabilities of .NET
Aaron's suggestion is a good one IF you can know up front all the terms that could be used for searching. The problem I could see is if someone searches for a specific word combination.
Databases are notoriously bad at semantics (i.e. they don't understand the concept of neurology or urology - everything is just a string of characters).
The best solution would be to create a table which defines the terms (two columns, PK and the name of the term).
The query is then a join:
join table1.term_id = terms.term_id and terms.term = 'Urology'
That way, you can avoid the LIKE and search for specific results.
If you can't do this, then SQL is probably the wrong tool. Use LIKE to get a set of results which match and then, in an imperative programming language, clean those results from unwanted ones.
Judging from your content, can you not leverage the fact that there are quotes in the string you're searching for?
SELECT
[...]
WHERE
(content LIKE '%""Urology""%')

SQL query to bring all results regardless of punctuation with JSF

So I have a database with articles in them and the user should be able to search for a keyword they input and the search should find any articles with that word in it.
So for example if someone were to search for the word Alzheimer's I would want it to return articles with the word spell in any way regardless of the apostrophe so;
Alzheimer's
Alzheimers
results should all be returned. At the minute it is search for the exact way the word is spell and wont bring results back if it has punctuation.
So what I have at the minute for the query is:
private static final String QUERY_FIND_BY_SEARCH_TEXT = "SELECT o FROM EmailArticle o where UPPER(o.headline) LIKE :headline OR UPPER(o.implication) LIKE :implication OR UPPER(o.summary) LIKE :summary";
And the user's input is called 'searchText' which comes from the input box.
public static List<EmailArticle> findAllEmailArticlesByHeadlineOrSummaryOrImplication(String searchText) {
Query query = entityManager().createQuery(QUERY_FIND_BY_SEARCH_TEXT, EmailArticle.class);
String searchTextUpperCase = "%" + searchText.toUpperCase() + "%";
query.setParameter("headline", searchTextUpperCase);
query.setParameter("implication", searchTextUpperCase);
query.setParameter("summary", searchTextUpperCase);
List<EmailArticle> emailArticles = query.getResultList();
return emailArticles;
}
So I would like to bring back all results for alzheimer's regardless of weather their is an apostrophe or not. I think I have given enough information but if you need more just say. Not really sure where to go with it or how to do it, is it possible to just replace/remove all punctuation or just apostrophes from a user search?
In my point of view, you should change your query,
you should add alter your table and add a FULLTEXT index to your columns (headline, implication, summary).
You should also use MATCH-AGAINST rather than using LIKE query and most important, read about SOUNDEX() syntax, very beautiful syntax.
All I can give you is a native query example:
SELECT o.* FROM email_article o WHERE MATCH(o.headline, o.implication, o.summary) AGAINST('your-text') OR SOUNDEX(o.headline) LIKE SOUNDEX('your-text') OR SOUNDEX(o.implication) LIKE SOUNDEX('your-text') OR SOUNDEX(o.summary) LIKE SOUNDEX('your-text') ;
Though it won't give you results like Google search but it works to some extent. Let me know what you think.

Using wildcard and required operator in an Elasticsearch search

We have various rows inside our Elasticsearch index that contain the text
"... 2% milk ...".
User enters a query like "2% milk" into a search field and we transform it internally to a query
title:(+milk* +2%*)
because all terms should be required and we are possibly interested into rows that contain "2% milkfat".
This query above return zero hits. Changing the query to
title:(+milk* +2%)
returns reasonable results. So why does the '*' operator in the first query not work?
Unless you set a mapping, the "%" sign will get removed in the tokenization process. Basically "2% milk" will get turned into the tokens 2 and milk.
When you search for "2%*" it looks for tokens like: 2%, 2%a, 2%b, etc... and not match any indexed tokens, giving no hits.
When you search for "2%", it will go through the same tokenization process as at index-time (you can specify this, but the default tokenization is the same) and you will be looking for documents matching the token 2, which will give you a hit.
You can read more about the analysis/tokenization process here and you can set up the analysis you want by defining a custom mapping
Good luck!
Prefix and Wildcard queries do not appear to apply the Analyzer to their content. To provide a few examples:
title:(+milk* +2%) --> +title:milk* +title:2
title:(+milk* +2%*) --> +title:milk* +title:2%*
title:(+milk* +2%3) --> +title:milk* +(title:2 title:3)
title:(+milk* +2%3*) --> +title:milk* +title:2%3*
+title:super\\-milk --> +title:super title:milk
+title:super\\-milk* --> +title:super-milk*
It does make some sense to prevent tokenization of wildcard queries, since wildcard phrase queries are not allowed. If tokenization were allowed, it would seem to beg the question, especially with embeded wildcards, of just how many terms that wildcard can span.