How to customise Lucene whiteSpaceAnalyzer to index words without special characters?

How to customise Lucene whiteSpaceAnalyzer to index words without special characters? - indexing

I don't want special characters when I am indexing words of a string. I understand StandardAnalyzer removes the special characters but it also does not index stopwords and single characters and I want to index stopwords and single characters.
Eg: list of hotel management organisation (hmo) site
Indexed words: list, of, hotel, management, organisation, hmo, site
Is there a filter for this? How can I build a custom Analyzer for this purpose?
Maybe a filter that replaces non-alphanumeric characters with ""?

StandardAnalyzer sounds like a good fit. Just construct it with an empty stopword set:
Analyzer analyzer = new StandardAnalyzer(CharArraySet.EMPTY_SET);
As far as building your own analyzer, check the Analyzer docs. There is an example there of how building your own analyzer should look. If StandardAnalyzer is close, you might copy the createComponents from it as a starting point.

Related

GraphDB Lucene Connector - uppercase problem when indexing URIs

There seems to be a strange behaviour with GraphDB Lucene connectors when a URI is indexed (either because of using the $self quasiproperty or because the property chain leads to a URI). I would summarize the issue as follows:
Uppercase letters in the URI must be escaped in the query text (i.e. query text must be "*\Merlo" instead of "*Merlo" in the wine example provided here)
No snippet can be extracted from URIs
Any idea how this could be overcome?

The Lucene connectors treats URIs as non-analyzed fields, i.e. as a single chunk of string whose tokens aren't words and aren't tokenized or analyzed as words. The logic is that URIs are identifiers and they have a meaning only in their entirety. URIs don't contain text, even if they sometimes make sense to people. This also means that normal full-text queries will not work on such fields, nor any snippets can be extracted. They can be search but since it's not full-text they might behave unexpectedly.
In your particular example with a query like "*Merlo", Lucene will run the query through the analyzer in order to be able to match analyzed fields (i.e. the fields normally used for full-text search). By escaping the capital letter you're preventing the analyzer from normalizing it to a lowercase m and you get a match.
If you need an exact match you can do this too (note there's no need to escape the capital letters):
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT * {
?search a inst:my_index ;
# Surround with double quotes to force an exact match
:query "\"http://www.ontotext.com/example/wine#Merlo\"" ;
:entities ?entity .
}

Lucene: combining ASCII folding and stemming for French

I am implementing a Lucene search for French text. The search must work regardless of whether the user has typed accents or not, and it must also support stemming. I am currently using the Snowball-based French stemmer in Lucene 3.
On the indexing side, I have added an ASCIIFoldingFilter into my analyzer, which runs after the stemmer.
However, on the search side, the operation is not reversible: the stemmer only works given the input content contains accents. For example, it stems the ité from the end of université, but with a user search input of universite, the stemmer returns universit during query analysis. Of course, since the index contains the term univers, the search for universit returns no results.
A solution seems to be to change the order of stemming and folding in the analyzer: instead of stemming and then folding, do the folding before stemming. This effectively makes the operation reversible, but also significantly hobbles the stemmer since many words no longer match the stemming rules.
Alternatively, the stemmer could be modified to operate on folded input i.e. ignore accents, but could this result in over-stemming?
Is there a way to effectively do folded searches without changing the behavior of the stemming algorithm?

Step 1.) Use an exhaustive lemma synonym mapping file
Step 2.) ASCII (ICU) Fold after lemmatizing.
You can get exhaustive French lemmas here:
http://www.lexiconista.com/datasets/lemmatization/
Also, because lemmatizers are NOT destructive like stemmers you can apply the lemmatizer multiple times, perhaps your lemmatizer also contains accent-free normalizations... Just apply the lemmatizer again.

Lucene: problems when keeping punctuation

I need Lucene to keep some punctuation marks when indexing my texts, so I'm now using a WhitespaceAnalyzer which doesn't remove the symbols.
If there is a sentence like oranges, apples and bananas in the text, I want the phrase query "oranges, apples" to be a match (not without the comma), and this is working fine.
However, I'd also want the simple query oranges to produce a hit, but it seems the indexed token contains the comma too (oranges,) so it won't be a match unless I write the comma in the query too, which is undesirable.
Is there any simple way to make this work the way I need?
Thanks in advance.

I know this is a very old question, but I'm bored and I'll reply anyway. I see two ways of doing this:
Creating a TokenFilter that will create a synonym (e.g. insert a token into the stream with a 0 position length) each time a word contains punctuation with the non-puncuation version.
Add another field with the same content but using a standard tokenizer that removes all punctuation. Both fields will be matched.

SQL2008 fulltext index search without word breakers

I are trying to search an FTI using CONTAINS for Twitter-style usernames, e.g. #username, but word breakers will ignore the # symbol. Is there any way to disable word breakers? From research, there is a way to create a custom word breaker DLL and install it and assign it but that all seems a bit intensive and, frankly, over my head. I disabled stop words so that dashes are not ignored but I need that # symbol. Any ideas?

You're not going to like this answer. But full text indexes only consider the characters _ and ` while indexing. All the other characters are ignored and the words get split where these characters occur. This is mainly because full text indexes are designed to index large documents and there only proper words are considered to make it a more refined search.
We faced a similar problem. To solve this we actually had a translation table, where characters like #,-, / were replaced with special sequences like '`at`','`dash`','`slash`' etc. While searching in the full text, u've to again replace ur characters in the search string with these special sequences and search. This should take care of the special characters.

Preserving dots of an acronym while indexing in Lucene

If i want Lucene to preserve dots of acronyms(example: U.K,U.S.A. etc), which analyzer do i need to use and how?
I also want to input a set of stop words to Lucene while doing this.

A WhiteSpaceAnalyzer will preserve the dots. A StopFilter removes a list of stop words. You should define exactly the analysis you need, and then combine analyzers and token filters to achieve it, or write your own analyzer.

StandardTokenizer preserves the dots occurring between letters. You can use StandardAnalyzer which uses StandardTokenizer. Or you could create your own analyzer with StandardTokenizer.
Correction: StandardAnalyzer will not help as it uses StandardFilter, which removes the dots from the acronym. You can construct your own analyzer with StandardTokenizer and additional filters (such as lower case filter) minus the StandardFilter.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to customise Lucene whiteSpaceAnalyzer to index words without special characters? - indexing

Related

GraphDB Lucene Connector - uppercase problem when indexing URIs

Lucene: combining ASCII folding and stemming for French

Lucene: problems when keeping punctuation

SQL2008 fulltext index search without word breakers

Preserving dots of an acronym while indexing in Lucene

Categories

Resources