GraphDB Lucene Connector - uppercase problem when indexing URIs - lucene

There seems to be a strange behaviour with GraphDB Lucene connectors when a URI is indexed (either because of using the $self quasiproperty or because the property chain leads to a URI). I would summarize the issue as follows:
Uppercase letters in the URI must be escaped in the query text (i.e. query text must be "*\Merlo" instead of "*Merlo" in the wine example provided here)
No snippet can be extracted from URIs
Any idea how this could be overcome?

The Lucene connectors treats URIs as non-analyzed fields, i.e. as a single chunk of string whose tokens aren't words and aren't tokenized or analyzed as words. The logic is that URIs are identifiers and they have a meaning only in their entirety. URIs don't contain text, even if they sometimes make sense to people. This also means that normal full-text queries will not work on such fields, nor any snippets can be extracted. They can be search but since it's not full-text they might behave unexpectedly.
In your particular example with a query like "*Merlo", Lucene will run the query through the analyzer in order to be able to match analyzed fields (i.e. the fields normally used for full-text search). By escaping the capital letter you're preventing the analyzer from normalizing it to a lowercase m and you get a match.
If you need an exact match you can do this too (note there's no need to escape the capital letters):
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT * {
?search a inst:my_index ;
# Surround with double quotes to force an exact match
:query "\"http://www.ontotext.com/example/wine#Merlo\"" ;
:entities ?entity .
}

Related

Search for part of the word in the phrase with full text search in SQL Server 2016

In the Microsoft SQL Server, our searches are limited to starting words when we use a full-text search to search for values. That is, we cannot search contains the word looks like the LIKE operator in the middle.
I try to execute this query but the result is not my opinion.
I want to search for the middle of the term. For example, if my term is "Microsoft" and my query is :
SELECT *
FROM dbo.SMS_Outbox
WHERE CONTAINS(MessageText, N'"*soft*"')
There is no result returned!
The documentation is quite clear that wildcards are allowed only at the end of search terms:
The CONTAINS predicate supports the use of the asterisk (*) as a wildcard character to represent words and phrases. You can add the asterisk only at the end of the word or phrase. The presence of the asterisk enables the prefix-matching mode. In this mode, matches are returned if the column contains the specified search word followed by zero or more other characters.
You cannot do what you want easily. One simple option is to switch to LIKE and take the performance hit:
WHERE MessageText LIKE N'%soft%'
Another option might be to parse your text in such a way that soft is always at the beginning of a search term.

Lucene: problems when keeping punctuation

I need Lucene to keep some punctuation marks when indexing my texts, so I'm now using a WhitespaceAnalyzer which doesn't remove the symbols.
If there is a sentence like oranges, apples and bananas in the text, I want the phrase query "oranges, apples" to be a match (not without the comma), and this is working fine.
However, I'd also want the simple query oranges to produce a hit, but it seems the indexed token contains the comma too (oranges,) so it won't be a match unless I write the comma in the query too, which is undesirable.
Is there any simple way to make this work the way I need?
Thanks in advance.
I know this is a very old question, but I'm bored and I'll reply anyway. I see two ways of doing this:
Creating a TokenFilter that will create a synonym (e.g. insert a token into the stream with a 0 position length) each time a word contains punctuation with the non-puncuation version.
Add another field with the same content but using a standard tokenizer that removes all punctuation. Both fields will be matched.

Rails 3 Sunspot Fulltext Search Usage

So I've implemented the sunspot_rails gem into my application to utilize the powerful Solr search engine. I recently checked out Ryan's railscast on full-text searching and I noticed he was using additional parameters in his search queries such as "-" to denote something that should NOT be including in the full-text search.
I never heard about this until now, I was wondering if there was a user-friendly usage guide somewhere both me and my users can reference to take my search functionality to it's maximum capability.
I think ideally I would like to make an abridged version similar to Github's markdown cheat-sheet for my search forms that users can quickly reference.
Sunspot uses Solr's DisMax Query Parser, which has a very simple query syntax. For the most part, it is intended to flexibly parse user-created queries.
DisMax recognizes three special characters: +, -, and ". From the documentation:
[DisMax] is designed to be support raw input strings provided by users with no special escaping. '+' and '-' characters are treated as "mandatory" and "prohibited" modifiers for the subsequent terms. Text wrapped in balanced quote characters '"' are treated as phrases, any query containing an odd number of quote characters is evaluated as if there were no quote characters at all.
There are a few other "behind the scenes" options to tune the relevancy of matched documents. For example, "minimum match" specifies the number or proportion of "optional" fields (i.e., not prefixed with - or +) which must be present. As well as options to boost term matches in specific fields, or term matches within close proximity to each other, and so on.
In Sunspot, these are all exposed in the options parameter to the fulltext method, or as methods within a block supplied to that method.

Lucene character sequence search in a term

I want to find a character sequence (more than three characters) within a Term. I have tried *character_sequence*(I know this is not recommended), but it does not return result if the character sequence itself is equivalent to the Term.
For example, if the terms are "testsomething", "somethingtest" and "sometestthing", I want all these Terms in my search result if the text "test" is searched.
Is there any way to do it?
Thanks!
Prefix queries are by default supported in lucene and to support suffix queries you might have to do a little work. You can check How to query lucene with "like" operator?

Mysql query: it's not fetching first result

I have below values in my database.
been Lorem Ipsum and scrambled ever
scrambledtexttextofandtooktooktypetexthastheunknownspecimenstandardsincetypesett
Here is my query:
SELECT
nBusinessAdID,
MATCH (`sHeadline`) AGAINST ("text" IN BOOLEAN MODE) AS score
FROM wiki_businessads
WHERE MATCH (`sHeadline`) AGAINST ("text" IN BOOLEAN MODE)
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
It's not fetching first result, why? It should fetch first result because its contain text word in it. I have disabled the stopword filtering.
This one is also not working
SELECT
nBusinessAdID,
MATCH (`sHeadline`) AGAINST ('"text"' IN BOOLEAN MODE) AS score
FROM wiki_businessads
WHERE MATCH (`sHeadline`) AGAINST ('"text"' IN BOOLEAN MODE)
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
The full text search only matches words and word prefixes. Because your data in the database does not contain word boundaries (spaces) the words are not indexed, so they are not found.
Some possible choices you could make are:
Fix your data so that it contains spaces between words.
Use LIKE '%text%' instead of a full text search.
Use an external full-text search engine.
I will expand on each of these in turn.
Fix your data so that it contains spaces between words.
Your data seems to have been corrupted somehow. It looks like words or sentences but with all the spaces removed. Do you know how that happened? Was it intentional? Perhaps there is a bug elsewhere in the system. Try to fix that. Find out where the data came from and see if it can be reimported correctly.
If the original source doesn't contain spaces, perhaps you could use some natural language toolkit to guess where the spaces should be and insert them. There most likely already exist libraries that can do this, although I don't happen to know any. A Google search might find something.
Use LIKE '%text%' instead of a full text search.
A workaround is to use LIKE '%text%' instead but note that this will be much slower as it will not be able to use the index. However it will give the correct result.
Use an external full-text search engine.
You could also look at Lucene or Sphinx. For example I know that Sphinx supports finding text using *text*. Here is an extract from the documentation which explains how to enable infix searching, which is what you need.
9.2.16. min_infix_len
Minimum infix prefix length to index. Optional, default is 0 (do not index infixes).
Infix indexing allows to implement wildcard searching by 'start*', '*end', and 'middle' wildcards (refer to enable_star option for details on wildcard syntax). When mininum infix length is set to a positive number, indexer will index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. Too short infixes (below the minimum allowed length) will not be indexed.
For instance, indexing a keyword "test" with min_infix_len=2 will result in indexing "te", "es", "st", "tes", "est" infixes along with the word itself. Searches against such index for "es" will match documents that contain "test" word, even if they do not contain "es" on itself. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times.