Solr stopwords not working with wildcard search - indexing

I am having problem with solr wildcard search and stopwords. I have added few stopwords "to", "for" ,"is" in stopwords.txt . When i am not doing a wildcard search , its working perfectly.
Query --> q=learningObjectTopic:to&rows=1
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">3</int>
<lst name="params">
<str name="q">learningObjectTopic:to</str>
<str name="rows">1</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
</response>
When i do a wildcard search its returning data .
Query --> q=learningObjectTopic:*to*&rows=1
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">5</int>
<lst name="params">
<str name="q">learningObjectTopic:*to*</str>
<str name="rows">1</str>
</lst>
</lst>
<result name="response" numFound="75" start="0">
<doc>
<str name="id">56f4bc54b2de79277297dcab</str>
<str name="learningObjectId">LO1_SK1_18</str>
<str name="learningObjectTopic">Introduction to Web Development</str>
<str name="category">learningObject</str>
<long name="_version_">1537824533459763200</long>
</doc>
</result>
</response>
This is my analyzer
<fieldType name="text_general" class="solr.TextField" multiValued="false" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
What i require is , "to" should not match in the wildcard query too. What am i missing here ?
Note : learningObjectTopic:to search skipped "to" words in results when i added "to" in stopwords , so stopword indexing is working .

Solr StopFilterFactory is not a multi term aware component and hence stopFilterFactory will not work on for wildcard queries.Reference link.
And also, the scenario may not be a valid one. since, if there is a keyword "Tokyo" in index, then the search keyword "to*" should return this result instead of showing "0" result which is not correct.

Related

Solr: Exact query syntax for synonyms?

I modified the techproducts example to learn more about synonyms. The added field has the type text2_de
<fieldType name="text2_de" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="index_synonyms.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
The index_synonyms.txt was expanded by hierarchical synonyms starting with a level for faceting according https://wiki.apache.org/solr/HierarchicalFaceting
aaafoo => aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz
fooaaa,baraaa,bazaaa
Umwelt => 1/HS , 2/HS/Bereich , 3/HS/Bereich/Umwelt
Mensch => 1/HS , 2/HS/Bereich , 3/HS/Bereich/Mensch
...
The loaded term info shows, that the analyzer works very well and found 60x "2/hs/bereich" in a document set.
loaded term info for the testfield
I'm not able to make a solr-query to find these 60 documents. The auto generated hyperlink of the loaded term info
http://localhost:8983/solr/#/test/query?q=testfield:2%2Fhs%2Fbereich
found no matches (numFound="0"):
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="q">testfield:2/hs/bereich</str>
<str name="indent">on</str>
<str name="wt">xml</str>
<str name="_">1463321610566</str>
</lst>
</lst>
<result name="response" numFound="0" start="0">
</result>
</response>
Please give me a tip to make an exact solr query syntax for synonyms to find these 60 documents!
Solution found: Please add wildcards in the auto generated query of the loaded term info in this example: testfield:2/hs/strahlung
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">24</int>
<lst name="params">
<str name="q">*:*</str>
<str name="facet.field">testfield</str>
<str name="indent">on</str>
<str name="facet.prefix">3/hs/strahlung</str>
<str name="fq">testfield:*2/hs/strahlung*</str>
<str name="rows">0</str>
<str name="facet">on</str>
<str name="wt">xml</str>
<str name="_">1463590654764</str>
</lst>
</lst>
<result name="response" numFound="68" start="0">
</result>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="testfield">
<int name="3/hs/strahlung/neutronen">44</int>
<int name="3/hs/strahlung/wirkung">37</int>
<int name="3/hs/strahlung/strahlensschutz">34</int>
<int name="3/hs/strahlung/exposition">22</int>
<int name="3/hs/strahlung/radioaktivitaet">22</int>
<int name="3/hs/strahlung/radiologisch">12</int>
<int name="3/hs/strahlung/strahlenart">7</int>
</lst>
</lst>
<lst name="facet_ranges"/>
<lst name="facet_intervals"/>
<lst name="facet_heatmaps"/>
</lst>
</response>
In combination with the facet.prefix 3/hs/strahlung it was possible to drill down the question for the hierarchical synonyms.

solr multiple pdf files indexing all at once.

Using this command
curl '://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfile=#maven_tutorial.pdf"
we can index single pdf files,by specifying our own id(DOC1), in solr. But I want to index many pdf files to solr all at once. let solr keep track of id automatically.
Please help me.
You can use UUID type field as unique key.
First define the UUID field type
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
Add your id field in the schema.xml
<field name="id" type="uuid" indexed="true" stored="true" multiValued="false"/>
Make this field as the unique key
<uniqueKey>id</uniqueKey>
In solrconfig.xml update the chain for autogenerating the id
<updateRequestProcessorChain name="uuid">
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Now attach this update chain to the request handler which is extracting the content from the pdf files that you are submitting to solr.
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
<str name="update.chain">uuid</str>
</lst>

How can I boost a result in SOLR based on a parameter?

I am new to SOLR and I am trying to boost a result based on a parameter "country".
For example, I want to set the country to US and move all the results with US to the top.
This is how I am doing it right now but it doesn't work. :
sort=query({!qf=market v='US'}) desc
This is how the dismax request handler is set up:
<requestHandler name="dismax" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
</str>
<str name="pf">
text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9
</str>
<str name="bf">
popularity^0.5 recip(price,1,1000,1000)^0.3
</str>
<str name="fl">
id,name,price,score
</str>
<str name="mm">
2<-1 5<-2 6<90%
</str>
<int name="ps">100</int>
<str name="q.alt">*:*</str>
<!-- example highlighter config, enable per-query with hl=true -->
<str name="hl.fl">text features name</str>
<!-- for this field, we want no fragmenting, just highlighting -->
<str name="f.name.hl.fragsize">0</str>
<!-- instructs Solr to return the field itself if no query terms are
found -->
<str name="f.name.hl.alternateField">name</str>
<str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
</lst>
</requestHandler>
If you are using dismax/edismax query parser you can follow the way suggested by Karthik.
But, even if you use the standard query parser, you can query q=market:US^2. This should give you results having more priority where market:US.
Note: "market" is a field and "US" is it's value.

Solr Highlighting Problem

Hi All I have a problem that when i Query Solr it matches results, but when i enable highlighting on the results of this query the highlighting does not work..
My Query is
+Contents:"item 503"
Contents is of type text and one important thing in text item 503 appear as "item 503(c)", can open parenthesis at the end create problem?? please help
here is highlighting section in SolrSonfig.xml
<highlighting>
<!-- Configure the standard fragmenter -->
<!-- This could most likely be commented out in the "default" case -->
<fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter" default="true">
<lst name="defaults">
<int name="hl.fragsize">100</int>
</lst>
</fragmenter>
<!-- A regular-expression-based fragmenter (f.i., for sentence extraction) -->
<fragmenter name="regex" class="org.apache.solr.highlight.RegexFragmenter">
<lst name="defaults">
<!-- slightly smaller fragsizes work better because of slop -->
<int name="hl.fragsize">70</int>
<!-- allow 50% slop on fragment sizes -->
<float name="hl.regex.slop">0.5</float>
<!-- a basic sentence pattern -->
<str name="hl.regex.pattern">[-\w ,/\n\"']{20,200}</str>
</lst>
</fragmenter>
<!-- Configure the standard formatter -->
<formatter name="html" class="org.apache.solr.highlight.HtmlFormatter" default="true">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
</highlighting>
and here is fieldtype definition in schema.xml
<fieldtype name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" luceneMatchVersion="LUCENE_29"/>
<filter class="solr.StandardFilterFactory"/>
<!-- <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" luceneMatchVersion="LUCENE_29"/>
<filter class="solr.EnglishPorterFilterFactory"/>-->
</analyzer>
</fieldtype>
and here is field definition
<field name="Contents" type="text" indexed="true" stored="true" />
Regards
Ahsan.
Have you tried storing the term vectors too? If you're using the fast vector highlighter (which I think Solr might by default) you'll need those.

apache solr : sum of data resulted from group by

We have a requirement where we need to group our records by a particular field and take the sum of a corresponding numeric field
e.x. select userid, sum(click_count) from user_action group by userid;
We are trying to do this using apache solr and found that there were 2 ways of doing this:
Using the field collapsing feature (http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/) but found 2 problems with this:
1.1. This is not part of release and is available as patch so we are not sure if we can use this in production.
1.2. We do not get the sum back but individual counts and we need to sum it at the client side.
Using the Stats Component along with faceted search (http://wiki.apache.org/solr/StatsComponent). This meets our requirement but it is not fast enough for very large data sets.
I just wanted to know if anybody knows of any other way to achieve this.
Appreciate any help.
Thanks,
Terance.
Why instead don't you use the StatsComponent? - Available from Solr 1.4 up.
$ curl 'http://search/select?q=*&rows=0&stats=on&stats.field=click_count' |
tidy -xml -indent -quiet -wrap 2000000
<?xml version="1.0" encoding="utf-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">17</int>
<lst name="params">
<str name="q">*</str>
<str name="stats">on</str>
<arr name="stats.field">
<str>click_count</str>
</arr>
<str name="rows">0</str>
</lst>
</lst>
<result name="response" numFound="577" start="0" />
<lst name="stats">
<lst name="stats_fields">
<lst name="click_count">
<double name="min">1.0</double>
<double name="max">3487.0</double>
<double name="sum">47912.0</double>
<long name="count">577</long>
<long name="missing">0</long>
<double name="sumOfSquares">4.0208702E7</double>
<double name="mean">83.0363951473137</double>
<double name="stddev">250.79824725438448</double>
</lst>
</lst>
</lst>
</response>