apache solr : sum of data resulted from group by - lucene

We have a requirement where we need to group our records by a particular field and take the sum of a corresponding numeric field
e.x. select userid, sum(click_count) from user_action group by userid;
We are trying to do this using apache solr and found that there were 2 ways of doing this:
Using the field collapsing feature (http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/) but found 2 problems with this:
1.1. This is not part of release and is available as patch so we are not sure if we can use this in production.
1.2. We do not get the sum back but individual counts and we need to sum it at the client side.
Using the Stats Component along with faceted search (http://wiki.apache.org/solr/StatsComponent). This meets our requirement but it is not fast enough for very large data sets.
I just wanted to know if anybody knows of any other way to achieve this.
Appreciate any help.
Thanks,
Terance.

Why instead don't you use the StatsComponent? - Available from Solr 1.4 up.
$ curl 'http://search/select?q=*&rows=0&stats=on&stats.field=click_count' |
tidy -xml -indent -quiet -wrap 2000000
<?xml version="1.0" encoding="utf-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">17</int>
<lst name="params">
<str name="q">*</str>
<str name="stats">on</str>
<arr name="stats.field">
<str>click_count</str>
</arr>
<str name="rows">0</str>
</lst>
</lst>
<result name="response" numFound="577" start="0" />
<lst name="stats">
<lst name="stats_fields">
<lst name="click_count">
<double name="min">1.0</double>
<double name="max">3487.0</double>
<double name="sum">47912.0</double>
<long name="count">577</long>
<long name="missing">0</long>
<double name="sumOfSquares">4.0208702E7</double>
<double name="mean">83.0363951473137</double>
<double name="stddev">250.79824725438448</double>
</lst>
</lst>
</lst>
</response>

Related

Solr ExtractingRequestHandler giving empty content field

I'm using Solr 6.2.1 and ExtractingRequestHandler (already included in Solr 6.2.1) to index pdf and word documents. All documents (pdf and word) are indexed with metadata (title, date, cp_revision, compagny, ...) but the content field is always empty.
According to the documentation I should have a non-empty content field : "Tika adds all the extracted text to the content field."
Has anybody know why the content field is empty ? According to this post answer it's maybe because I open my file in a non-binary mode but how to do it in binary mode ?
This is my solrconfig.xml file :
<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />
...
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="xpath">/xhtml:html/xhtml:body/descendant:node()</str>
<str name="capture">content</str>
<str name="fmap.meta">attr_meta_</str>
<str name="uprefix">attr_</str>
<str name="lowernames">true</str>
</lst>
</requestHandler>
Try indexing with the files example in the examples/files, it is designed to parse rich-text format. If that works, you can figure out what goes wrong in your definition. I suspect the xpath parameter may be wrong and returning just empty content.
I was using the solr:alpine Docker image and had the same problem. Turns out the "content" field was getting mapped to Solr's "text" field which is indexed but not stored by default. See if "fmap.content=doc_content" in Curl does the trick.
I was having a similar problem and I fixed by setting the /update/extracthandler request handler to this:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content</str>
<str name="update.chain">uuid</str>
</lst>
The key part being the content where it maps the Tika obtained contents to your "content" field, which must be defined in your schema, probably as stored=true

Solr: Exact query syntax for synonyms?

I modified the techproducts example to learn more about synonyms. The added field has the type text2_de
<fieldType name="text2_de" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="index_synonyms.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
The index_synonyms.txt was expanded by hierarchical synonyms starting with a level for faceting according https://wiki.apache.org/solr/HierarchicalFaceting
aaafoo => aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz
fooaaa,baraaa,bazaaa
Umwelt => 1/HS , 2/HS/Bereich , 3/HS/Bereich/Umwelt
Mensch => 1/HS , 2/HS/Bereich , 3/HS/Bereich/Mensch
...
The loaded term info shows, that the analyzer works very well and found 60x "2/hs/bereich" in a document set.
loaded term info for the testfield
I'm not able to make a solr-query to find these 60 documents. The auto generated hyperlink of the loaded term info
http://localhost:8983/solr/#/test/query?q=testfield:2%2Fhs%2Fbereich
found no matches (numFound="0"):
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="q">testfield:2/hs/bereich</str>
<str name="indent">on</str>
<str name="wt">xml</str>
<str name="_">1463321610566</str>
</lst>
</lst>
<result name="response" numFound="0" start="0">
</result>
</response>
Please give me a tip to make an exact solr query syntax for synonyms to find these 60 documents!
Solution found: Please add wildcards in the auto generated query of the loaded term info in this example: testfield:2/hs/strahlung
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">24</int>
<lst name="params">
<str name="q">*:*</str>
<str name="facet.field">testfield</str>
<str name="indent">on</str>
<str name="facet.prefix">3/hs/strahlung</str>
<str name="fq">testfield:*2/hs/strahlung*</str>
<str name="rows">0</str>
<str name="facet">on</str>
<str name="wt">xml</str>
<str name="_">1463590654764</str>
</lst>
</lst>
<result name="response" numFound="68" start="0">
</result>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="testfield">
<int name="3/hs/strahlung/neutronen">44</int>
<int name="3/hs/strahlung/wirkung">37</int>
<int name="3/hs/strahlung/strahlensschutz">34</int>
<int name="3/hs/strahlung/exposition">22</int>
<int name="3/hs/strahlung/radioaktivitaet">22</int>
<int name="3/hs/strahlung/radiologisch">12</int>
<int name="3/hs/strahlung/strahlenart">7</int>
</lst>
</lst>
<lst name="facet_ranges"/>
<lst name="facet_intervals"/>
<lst name="facet_heatmaps"/>
</lst>
</response>
In combination with the facet.prefix 3/hs/strahlung it was possible to drill down the question for the hierarchical synonyms.

What are the ways to store and search complex numeric data?

I have some numerical data that must be searchable from a web front-end with the following format:
Toy type: Dog
Toy subtype: Spotted
Toy maker: John
Color: White
Estimated spots: 10
Actual spots: 11
Toy type: Cat
Toy subtype: Striped
Toy maker: Jane
Color: White
Estimated stripes: 5
Actual stripes: [Not yet counted]
A search query might be something like "Type:Cat, Stripes:4-6", or "Type:Dog, Subtype:Spotted", or "Color:White", or "Color:White, Maker:John".
I'm not sure if the data is best suited for a relational database because there are several types and subtypes, each with their own properties. On top of that, there are estimated and actual values for each property.
I'd like some recommendations for how to store and search this data. Please help!
EDIT: I changed the search queries so they are no longer free-form.
I recommend using Apache Solr to index and search your data.
How you use Solr depends on your requirements. I use it as a searchable cache of my data. Extremely useful when the raw master data must be keep as files. Lots of frameworks integrate Solr as their search backend.
For building front-ends to a Solr index, checkout solr-ajax.
Example
Install Solr
Download Solr distribution:
wget http://www.apache.org/dist/lucene/solr/4.7.0/solr-4.7.0.tgz
tar zxvf solr-4.7.0.tgz
Start Solr using embedded Jetty container:
cd solr-4.7.0/example
java -jar start.jar
Solr should now be running locally
http://localhost:8983/solr
data.xml
You did not specify a data format so I used the native XML supported by Solr:
<add>
<doc>
<field name="id">1</field>
<field name="toy_type_s">Dog</field>
<field name="toy_subtype_s">Spotted</field>
<field name="toy_maker_s">John</field>
<field name="color_s">White</field>
<field name="estimated_spots_i">10</field>
<field name="actual_spots_i">11</field>
</doc>
<doc>
<field name="id">2</field>
<field name="toy_type_s">Cat</field>
<field name="toy_subtype_s">Striped</field>
<field name="toy_maker_s">Jane</field>
<field name="color_s">White</field>
<field name="estimated_spots_i">5</field>
</doc>
</add>
Notes:
Every document in Solr must have a unique id
The field names have a trailing "_s" and "_i" in their names to indicate field types. This is a cheat to take advantage of Solr's dynamic field feature.
Index XML file
Lots of ways to get data into Solr. The simplest way is the curl command:
curl http://localhost:8983/solr/update?commit=true -H "Content-Type: text/xml" --data-binary #data.xml
It's worth noting that Solr supports other data formats, such as JSON and CSV.
Search indexed file
Again there are language libraries to support Solr searches, the following examples use curl. The Solr search syntax is along the lines you've required.
Here's a simple example:
$ curl http://localhost:8983/solr/select/?q=toy_type_s:Cat
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="q">toy_type_s:Cat</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">2</str>
<str name="toy_type_s">Cat</str>
<str name="toy_subtype_s">Striped</str>
<str name="toy_maker_s">Jane</str>
<str name="color_s">White</str>
<int name="estimated_spots_i">5</int>
<long name="_version_">1463999035283079168</long>
</doc>
</result>
</response>
A more complex search example:
$ curl "http://localhost:8983/solr/select/?q=toy_type_s:Cat%20AND%20estimated_spots_i:\[2%20TO%206\]"
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">2</int>
<lst name="params">
<str name="q">toy_type_s:Cat AND estimated_spots_i:[2 TO 6]</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">2</str>
<str name="toy_type_s">Cat</str>
<str name="toy_subtype_s">Striped</str>
<str name="toy_maker_s">Jane</str>
<str name="color_s">White</str>
<int name="estimated_spots_i">5</int>
<long name="_version_">1463999035283079168</long>
</doc>
</result>
</response>
You have structured the problem in such a way as to make this very difficult to solve. Your data is structured data, with specific columns. Yet, you are trying to use free-form queries to search through it.
So, the normal way to do this is to allow search terms for each of the fields.
The next way to approach this is as a full-text problem. This definitely has its issues. For instance, numbers are typically stop words. And values in different fields would get confused with each other.
Of course, you can try to do free form search on structured data. This is, after all, something that Google and Microsoft are doing. If you search "airfare from New York to London" on Google, you will get lists of flights. But this is a hard problem to approach through understanding the query.

How can I boost a result in SOLR based on a parameter?

I am new to SOLR and I am trying to boost a result based on a parameter "country".
For example, I want to set the country to US and move all the results with US to the top.
This is how I am doing it right now but it doesn't work. :
sort=query({!qf=market v='US'}) desc
This is how the dismax request handler is set up:
<requestHandler name="dismax" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
</str>
<str name="pf">
text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9
</str>
<str name="bf">
popularity^0.5 recip(price,1,1000,1000)^0.3
</str>
<str name="fl">
id,name,price,score
</str>
<str name="mm">
2<-1 5<-2 6<90%
</str>
<int name="ps">100</int>
<str name="q.alt">*:*</str>
<!-- example highlighter config, enable per-query with hl=true -->
<str name="hl.fl">text features name</str>
<!-- for this field, we want no fragmenting, just highlighting -->
<str name="f.name.hl.fragsize">0</str>
<!-- instructs Solr to return the field itself if no query terms are
found -->
<str name="f.name.hl.alternateField">name</str>
<str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
</lst>
</requestHandler>
If you are using dismax/edismax query parser you can follow the way suggested by Karthik.
But, even if you use the standard query parser, you can query q=market:US^2. This should give you results having more priority where market:US.
Note: "market" is a field and "US" is it's value.

how to get date strings from content of pdf with apache solr

Hi all i am new to apache solr. i have a pdf which is containing date informations like - bla bla bla 2012-11-23 11:11:12 bla bla ...- i want to get all dates from content.
i read some documentation (http://wiki.apache.org/solr/ExtractingRequestHandler) and i added date.formats to /update/extract
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
<lst name="date.formats">
<str>yyyy-MM-dd</str>
<str>yyyy-MM-dd'T'HH:mm:ss'Z'</str>
<str>yyyy-MM-dd'T'HH:mm:ss</str>
<str>yyyy-MM-dd</str>
<str>yyyy-MM-dd hh:mm:ss</str>
<str>yyyy-MM-dd HH:mm:ss</str>
</lst>
i am adding pdf like below
curl "http://localhost:8983/solr/update/extract?literal.id=sql.txt&uprefix=attr_&fmap.content=attr_content&commit=true"&stream.file="/home/example/example.pdf"
and there is noting about date ? and content ?
Thnks