I want to ignore special characters during query time in SOLR .
For example :
Lets assume we have a doc in SOLR with content:My name is A-B-C .
content:A-B-C retunrs documents
but content:ABC doesnt return any document .
My requirement is that content:ABC should return that one document .
So basically i want to ignore that - during query time .
To get the tokens concatenated when they have a special character between them (i.e. A-B-C should match ABC and not just A), you can use a PatternReplaceCharFilter. This will allow you to replace all those characters with an empty string, effectively giving ABC to the next step of the analysis process instead.
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^a-zA-Z0-9 ]" replacement=""/>
<tokenizer ...>
[...]
</analyzer>
This will keep all regular ascii letters, numbers and spaces, while replacing any other character with the empty string. You'll probably have to tweak that character group to include more, but that will depend on your raw content and how it should be processed.
This should be done both when indexing and when querying (as long as you want the user to be able to query for A-B-C as well). If you want to score these matches differently, use multiple fields with different analysis chains - for example keeping one field to only tokenize on whitespace, and then boosting it higher (with qf=text_ws^5 other_field) if you have a match on A-B-C.
This does not change what content is actually stored for the field, so the data returned will still be the same - only how a match is performed.
Here you must be having a field Type for your field content.
The fields type can have 2 separate analyzer. One for index and one for query.
Here you can either create indexes of content "A-B-C" like ABC, A-B-C by using the "Word Delimiter Token Filter" .
Use catenateWords. add as catenateWords = 1.
It will work as follow :
"hot-spot-sensor’s" → "hotspotsensor". In your case "A-B-C". it will generate "ABC"
Here is the example of it Word Delimiter Filter
Usage :
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="true" catenateWords="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
This will create the multiple indexes and you will be able search with ABC and A-B-C
Related
I tried Lucene official demo by running IndexFiles with arguments -index . -docs . , and console prints including pom.xml and *.java and *.class are added into index.
Then I tried SearchFiles with arguments -index . -query "lucene AND main", and console prints only IndexFiles.class and SearchFiles.class and IndexFiles.java, but not SearchFiles.java (which I think should be one of searched results).
Your search results are correct (for the .java files, at least).
The sample code uses the StandardAnalyzer which, in turn, uses the StandardTokenizer.
The StandardTokenizer splits input text into tokens using the rules described in this document. For example, from section 4 of that document:
When you have text such as the following, in the source files
org.apache.lucene.analysis.Analyzer
this is tokenized as a single token. There are no word boundaries.
Looking in the IndexFiles.java source file, there is the following text:
demonstrating simple Lucene indexing
This is tokenized into 4 separate tokens.
But in the SearchFiles.java source file, the text "lucene" only ever appears in text such as org.apache.lucene.analysis.Analyzer - and therefore the single token lucene is never created.
Your query therefore does not find any hits in the IndexFiles.java document because the query matches exact tokens. Both source files contain the word "main" but only one contains the word "lucene".
For the .class files, because these are compiled bytecode files, I would say they should not be indexed in the first place. Lucene works with text files, not binary files. Yes, the class files will contain fragments of text, but they will also typically contain unprintable control characters, which are not suitable to be indexed. I think indexing results could be unpredictable because of this.
You can explore the indexed data using Luke - which is bundled in the binary releases:
How can I request Solr to search for special characters.
e.g. to search for Strings containing the '#' character
When I am quering
"name_tsi : "#*" AND type_ssi :program"
It is giving me all the available entries in the index.
Which I get through
"type_ssi :program"
I am getting same results in both the cases, but it should filter the result on the basis of (name_tsi : "#*").
And use of back slash \ before # is not working.
Is there anything I can do in solrconfig.xml or schema.xml
I am using Lucene version 5.0.0.
In my search string, there is a minus character like “test-”.
I read that the minus sign is a special character in Lucene. So I have to escape that sign, as in the queryparser documentation:
Escaping Special Characters:
Lucene supports escaping special characters that are part of the query syntax. The current list special characters are:
- + - && || ! ( ) { } [ ] ^ " ~ * ? : \ /`
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:
\(1\+1\)\:2
To do that I use the QueryParser.escape method:
query = parser.parse(QueryParser.escape(searchString));
I use the classic Analyzer because I noticed that the standard Analyzer has some problems with escaping special characters.
The problem is that the Parser deletes the special characters and so the Query has the term
content:test
How can I set up the parser and searcher to search for the real value “test-“?
I also created my own query with the content test- but that also didn’t work. I recieved 0 results but my index has entries like:
Test-VRF
Test-IPLS
I am really confused about this problem.
While escaping special characters for the queryparser deals with part of the problem, it doesn't help with analysis.
Neither classic nor standard analyzer will keep punctuation in the indexed form of the field. For each of these examples, the indexed form will be in two terms:
test and vrf
test and ipls
This is why a manually constructed query for "test-" finds nothing. That term does not exist in the index.
The goal of these analyzers is to attempt to index words. As such, punctuation is mostly eliminated, and is not searchable. A phrase query for "test vrf" or "test-vrf" or "test_vrf" are all effectively identical. If that is not what you need, you'll need to look to other analyzers.
The goal to fix this issue is to store the value content in an NOT_ANALYZED way.
Field fieldType = new Field(key.toLowerCase(),value, Field.Store.YES, Field.Index.NOT_ANALYZED);
Someone who has the same problem has to take care how to store the contents in the index.
To request the result create a query in this way
searchString = QueryParser.escape(searchString);
and use for example a WhitespaceAnalyzer.
I'm using solr, I'm using StandardTokenizerFactory in the text field but I don't want to split on the underscore.
Do I have to use another toknizer like PatternTokenizerFactory or I can do this with StandardTokenizerFactory ? as I need the same functionality of StandardTokenizerFactory but without split on underscore.
I don't think you can do it in StandardTokenizerFactory. One solution is to first replace underscores with something the StandardTokenizerFactory won't process and something your documents won't otherwise contain. For example, you can first replace _ with QQ everywhere with PatternReplaceCharFilterFactory and pass through StandardTokenizerFactory and then replace QQ with _ using PatternReplaceFilterFactory. Here is the fieldType definition to do it:
<fieldType name="text_std_prot" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="_"
replacement="QQ"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="QQ"
replacement="_"/>
...
</analyzer>
</fieldType>
And here is a screen shot of what happens:
Adding just following seems to do trick for StandardTokenizerFactory as StandardTokenizerFactory splits at hyphen "-".
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="_"
replacement="-"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
Is the forward slash "/" a reserved character in solr field names?
I'm having trouble writing a solr sort query which will parse for fields containing a forward slash "/"
When making an http query to my solr server:
q=*&sort=normal+desc
Will work but
q=*&sort=with/slash+desc
q=*&sort=with%2Fslash+desc
Both fail say "can not use FieldCache on multivalued field: with"
Each solr document contains two int fields "normal", and "with/slash". With my solr schema indexing the fields as so
...
<field name="normal" type="int" indexed="true" stored="true" required="false" />
<field name="with/slash" type="int" indexed="true" stored="true" required="false" />
...
Is there any special way I need to encode forward slashes in solr? Or are there any other delimiter characters I can use? I'm already using '-' and "." for other purposes.
I just came across the same problem, and after some experimentation found that if you have a forward-slash in the field name, you must escape it with a backslash in the Solr query (but note that you do not have to do this in the field list parameter, so a search looking for /my/field/name containing my_value is entered in the "q" field as:
\/my\/field\/name:my_value
I haven't tried the sort field, but try this and let us know :)
This is on Solr 4.0.0 alpha.
From the solr wiki at https://wiki.apache.org/solr/SolrQuerySyntax :
Solr 4.0 added regular expression support, which means that '/' is now
a special character and must be escaped if searching for literal
forward slash.
In my case I needed to search for forward slash / with wild card *, e.g.:
+(*/*)
+(*2016/17*)
I Tried to escape it like so:
+(*2016\/*)
+(*2016\/17*)
but that didn't work also.
the solution was to wrap the text with double quote " like do:
+("*\/*")
+("*/*")
+("*2016\/17*")
+("*2016/17*")
both returned the same result with and without escaping the forward slash