Don't split on underscore with solr.StandardTokenizerFactory - ruby-on-rails-3

I'm using solr, I'm using StandardTokenizerFactory in the text field but I don't want to split on the underscore.
Do I have to use another toknizer like PatternTokenizerFactory or I can do this with StandardTokenizerFactory ? as I need the same functionality of StandardTokenizerFactory but without split on underscore.

I don't think you can do it in StandardTokenizerFactory. One solution is to first replace underscores with something the StandardTokenizerFactory won't process and something your documents won't otherwise contain. For example, you can first replace _ with QQ everywhere with PatternReplaceCharFilterFactory and pass through StandardTokenizerFactory and then replace QQ with _ using PatternReplaceFilterFactory. Here is the fieldType definition to do it:
<fieldType name="text_std_prot" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="_"
replacement="QQ"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="QQ"
replacement="_"/>
...
</analyzer>
</fieldType>
And here is a screen shot of what happens:

Adding just following seems to do trick for StandardTokenizerFactory as StandardTokenizerFactory splits at hyphen "-".
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="_"
replacement="-"/>
<tokenizer class="solr.StandardTokenizerFactory"/>

Related

Ignoring special characters during query time in SOLR

I want to ignore special characters during query time in SOLR .
For example :
Lets assume we have a doc in SOLR with content:My name is A-B-C .
content:A-B-C retunrs documents
but content:ABC doesnt return any document .
My requirement is that content:ABC should return that one document .
So basically i want to ignore that - during query time .
To get the tokens concatenated when they have a special character between them (i.e. A-B-C should match ABC and not just A), you can use a PatternReplaceCharFilter. This will allow you to replace all those characters with an empty string, effectively giving ABC to the next step of the analysis process instead.
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^a-zA-Z0-9 ]" replacement=""/>
<tokenizer ...>
[...]
</analyzer>
This will keep all regular ascii letters, numbers and spaces, while replacing any other character with the empty string. You'll probably have to tweak that character group to include more, but that will depend on your raw content and how it should be processed.
This should be done both when indexing and when querying (as long as you want the user to be able to query for A-B-C as well). If you want to score these matches differently, use multiple fields with different analysis chains - for example keeping one field to only tokenize on whitespace, and then boosting it higher (with qf=text_ws^5 other_field) if you have a match on A-B-C.
This does not change what content is actually stored for the field, so the data returned will still be the same - only how a match is performed.
Here you must be having a field Type for your field content.
The fields type can have 2 separate analyzer. One for index and one for query.
Here you can either create indexes of content "A-B-C" like ABC, A-B-C by using the "Word Delimiter Token Filter" .
Use catenateWords. add as catenateWords = 1.
It will work as follow :
"hot-spot-sensor’s" → "hotspotsensor". In your case "A-B-C". it will generate "ABC"
Here is the example of it Word Delimiter Filter
Usage :
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="true" catenateWords="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
This will create the multiple indexes and you will be able search with ABC and A-B-C

auto delete of documents is not working in Apache Solr

I am using Apache Solr 7 and trying to automate deletion of documents. I've done following steps as per Lucene's documentation.
step 1. in solrschema.xml
<updateRequestProcessorChain default="true">
<processor class="solr.processor.DocExpirationUpdateProcessorFactory">
<int name="autoDeletePeriodSeconds">30</int>
<str name="ttlFieldName">time_to_live_s</str>
<str name="expirationFieldName">expire_at_dt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
step 2. in managed-schema.xml
<field name="id" type="string" indexed="true" stored="true" multiValued="false" />
<field name="time_to_live_s" type="string" stored="true" multiValued="false" />
<field name="expire_at_dt" type="date" stored="true" multiValued="false" />
step 3. I created a core by name sample1 and add the following document
curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/sample1/update?commit=true' -d '[{ "id":"sample_doc_1","expire_at_dt":"NOW+10SECONDS"}]'
after 10 Seconds, the document is still there. Am i missing any of the step, here or am I doing something wrong ?
I think in the indexing you should set the field time_to_live_s not the expire_at_dt, and value +10SECONDS or whatever you want will be fine.
As a reference:
ttlFieldName - Name of a field this process should look for in each
document processed, defaulting to ttl. If the specified field name
exists in a document, the document field value will be parsed as a
Date Math Expression relative to NOW and the result will be added to
the document using the expirationFieldName. Use to disable this feature.
If you want to directly set the expiration date - you should set the proper date string, not the Date Math Expression.
I have full working example of the auto delete code here

Lucene Tokenizer - Include Spaces

We have an application that tokenises certain data. The problem I have is that I have a comma delimited field I need to tokenize but not on spaces. For Example:
"Age 6, Age 7, Age 8"
Becomes
Age
6
Age
7
Age
8
I need
Age 6
Age 7
Age 8
Is there a way for me to change the default behaviour on certain fields only?
The config setting I have at present:
<field fieldName="SizeGroup" storageType="YES" indexType="TOKENIZED" vectorType="NO"
boost="1f" type="System.String"
settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration,
Sitecore.ContentSearch.LuceneProvider" />
Unfortunately, I do not know C#, but I know Lucene. So for needed behavior you need to use PatternAnalyzer, which allows you to specify a regexp, that will be used for tokenizing. In your case, pattern like \\, should work for splitting on commas.

Solr Query not parsing forward slash

Is the forward slash "/" a reserved character in solr field names?
I'm having trouble writing a solr sort query which will parse for fields containing a forward slash "/"
When making an http query to my solr server:
q=*&sort=normal+desc
Will work but
q=*&sort=with/slash+desc
q=*&sort=with%2Fslash+desc
Both fail say "can not use FieldCache on multivalued field: with"
Each solr document contains two int fields "normal", and "with/slash". With my solr schema indexing the fields as so
...
<field name="normal" type="int" indexed="true" stored="true" required="false" />
<field name="with/slash" type="int" indexed="true" stored="true" required="false" />
...
Is there any special way I need to encode forward slashes in solr? Or are there any other delimiter characters I can use? I'm already using '-' and "." for other purposes.
I just came across the same problem, and after some experimentation found that if you have a forward-slash in the field name, you must escape it with a backslash in the Solr query (but note that you do not have to do this in the field list parameter, so a search looking for /my/field/name containing my_value is entered in the "q" field as:
\/my\/field\/name:my_value
I haven't tried the sort field, but try this and let us know :)
This is on Solr 4.0.0 alpha.
From the solr wiki at https://wiki.apache.org/solr/SolrQuerySyntax :
Solr 4.0 added regular expression support, which means that '/' is now
a special character and must be escaped if searching for literal
forward slash.
In my case I needed to search for forward slash / with wild card *, e.g.:
+(*/*)
+(*2016/17*)
I Tried to escape it like so:
+(*2016\/*)
+(*2016\/17*)
but that didn't work also.
the solution was to wrap the text with double quote " like do:
+("*\/*")
+("*/*")
+("*2016\/17*")
+("*2016/17*")
both returned the same result with and without escaping the forward slash

Solr Highlighting Problem

Hi All I have a problem that when i Query Solr it matches results, but when i enable highlighting on the results of this query the highlighting does not work..
My Query is
+Contents:"item 503"
Contents is of type text and one important thing in text item 503 appear as "item 503(c)", can open parenthesis at the end create problem?? please help
here is highlighting section in SolrSonfig.xml
<highlighting>
<!-- Configure the standard fragmenter -->
<!-- This could most likely be commented out in the "default" case -->
<fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter" default="true">
<lst name="defaults">
<int name="hl.fragsize">100</int>
</lst>
</fragmenter>
<!-- A regular-expression-based fragmenter (f.i., for sentence extraction) -->
<fragmenter name="regex" class="org.apache.solr.highlight.RegexFragmenter">
<lst name="defaults">
<!-- slightly smaller fragsizes work better because of slop -->
<int name="hl.fragsize">70</int>
<!-- allow 50% slop on fragment sizes -->
<float name="hl.regex.slop">0.5</float>
<!-- a basic sentence pattern -->
<str name="hl.regex.pattern">[-\w ,/\n\"']{20,200}</str>
</lst>
</fragmenter>
<!-- Configure the standard formatter -->
<formatter name="html" class="org.apache.solr.highlight.HtmlFormatter" default="true">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
</highlighting>
and here is fieldtype definition in schema.xml
<fieldtype name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" luceneMatchVersion="LUCENE_29"/>
<filter class="solr.StandardFilterFactory"/>
<!-- <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" luceneMatchVersion="LUCENE_29"/>
<filter class="solr.EnglishPorterFilterFactory"/>-->
</analyzer>
</fieldtype>
and here is field definition
<field name="Contents" type="text" indexed="true" stored="true" />
Regards
Ahsan.
Have you tried storing the term vectors too? If you're using the fast vector highlighter (which I think Solr might by default) you'll need those.