auto delete of documents is not working in Apache Solr - apache

I am using Apache Solr 7 and trying to automate deletion of documents. I've done following steps as per Lucene's documentation.
step 1. in solrschema.xml
<updateRequestProcessorChain default="true">
<processor class="solr.processor.DocExpirationUpdateProcessorFactory">
<int name="autoDeletePeriodSeconds">30</int>
<str name="ttlFieldName">time_to_live_s</str>
<str name="expirationFieldName">expire_at_dt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
step 2. in managed-schema.xml
<field name="id" type="string" indexed="true" stored="true" multiValued="false" />
<field name="time_to_live_s" type="string" stored="true" multiValued="false" />
<field name="expire_at_dt" type="date" stored="true" multiValued="false" />
step 3. I created a core by name sample1 and add the following document
curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/sample1/update?commit=true' -d '[{ "id":"sample_doc_1","expire_at_dt":"NOW+10SECONDS"}]'
after 10 Seconds, the document is still there. Am i missing any of the step, here or am I doing something wrong ?

I think in the indexing you should set the field time_to_live_s not the expire_at_dt, and value +10SECONDS or whatever you want will be fine.
As a reference:
ttlFieldName - Name of a field this process should look for in each
document processed, defaulting to ttl. If the specified field name
exists in a document, the document field value will be parsed as a
Date Math Expression relative to NOW and the result will be added to
the document using the expirationFieldName. Use to disable this feature.
If you want to directly set the expiration date - you should set the proper date string, not the Date Math Expression.
I have full working example of the auto delete code here

Related

How do I import the content of .PDF files into a Solr index?

I have a directory of pdf files: document.01.pdf, document.02.pdf, and so on. I am running Solr 6.6.2. I have run
solr create -c documents
to create a core called documents. I want to upload the pdf files to Solr and have it index the text that they contain, not just their metadata.
I understand that it's Tikka's job to do the extracting. I understand that it's the job of the solr.extraction.ExtractingRequestHandler to call Tikka. My solarconfig.xml (which is just the default created by solr create) contains the following section:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
If I run
post -c documents path-to-pdf-directory
I end up with entries in the index that contain metadata about the PDF files and an id that's the full path to the file, but not the file content. What I want is these metadata fields plus an additional field called something like text or content to contain the text in the PDFs.
Following examples like the one here, I also tried commands like
curl 'http://localhost:8983/solr/documents/update/extract?literal.id=doc1&commit=true' -F "myfile=#document.01.pdf"
but this does the same thing.
I've been searching all over for documentation on how to do this, but everything I find makes it sound like I'm doing everything right.
How do I do this? This seems like such basic functionality that the fact it isn't obvious makes me think I'm misunderstanding something fundamental.
you are asking Solr to put all text in a field named _text (with a trailing underscore too, I can't make it show here) with this:
<str name="fmap.content">_text_</str>
If you don't see a field like this after indexing, check that you have such a field defined in schema.xml (with the right indexed/stored attributes). You don't necessarily need to have it defined in schema.xml, it can work via dynamicFields too, but for a quick verification just define it.
I changed the value of fmap.content for the ExtractingRequestHandler to text_en because text_en is listed as a field type in my managed schema and the text in my documents is in English.
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">text_en</str>
</lst>
</requestHandler>
Now when I run post the contents of my document are indexed as a text_en field along with all the other metadata.

Solr ExtractingRequestHandler giving empty content field

I'm using Solr 6.2.1 and ExtractingRequestHandler (already included in Solr 6.2.1) to index pdf and word documents. All documents (pdf and word) are indexed with metadata (title, date, cp_revision, compagny, ...) but the content field is always empty.
According to the documentation I should have a non-empty content field : "Tika adds all the extracted text to the content field."
Has anybody know why the content field is empty ? According to this post answer it's maybe because I open my file in a non-binary mode but how to do it in binary mode ?
This is my solrconfig.xml file :
<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />
...
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="xpath">/xhtml:html/xhtml:body/descendant:node()</str>
<str name="capture">content</str>
<str name="fmap.meta">attr_meta_</str>
<str name="uprefix">attr_</str>
<str name="lowernames">true</str>
</lst>
</requestHandler>
Try indexing with the files example in the examples/files, it is designed to parse rich-text format. If that works, you can figure out what goes wrong in your definition. I suspect the xpath parameter may be wrong and returning just empty content.
I was using the solr:alpine Docker image and had the same problem. Turns out the "content" field was getting mapped to Solr's "text" field which is indexed but not stored by default. See if "fmap.content=doc_content" in Curl does the trick.
I was having a similar problem and I fixed by setting the /update/extracthandler request handler to this:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">content</str>
<str name="update.chain">uuid</str>
</lst>
The key part being the content where it maps the Tika obtained contents to your "content" field, which must be defined in your schema, probably as stored=true

solr multiple pdf files indexing all at once.

Using this command
curl '://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfile=#maven_tutorial.pdf"
we can index single pdf files,by specifying our own id(DOC1), in solr. But I want to index many pdf files to solr all at once. let solr keep track of id automatically.
Please help me.
You can use UUID type field as unique key.
First define the UUID field type
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
Add your id field in the schema.xml
<field name="id" type="uuid" indexed="true" stored="true" multiValued="false"/>
Make this field as the unique key
<uniqueKey>id</uniqueKey>
In solrconfig.xml update the chain for autogenerating the id
<updateRequestProcessorChain name="uuid">
<updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Now attach this update chain to the request handler which is extracting the content from the pdf files that you are submitting to solr.
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
<str name="update.chain">uuid</str>
</lst>

What are the ways to store and search complex numeric data?

I have some numerical data that must be searchable from a web front-end with the following format:
Toy type: Dog
Toy subtype: Spotted
Toy maker: John
Color: White
Estimated spots: 10
Actual spots: 11
Toy type: Cat
Toy subtype: Striped
Toy maker: Jane
Color: White
Estimated stripes: 5
Actual stripes: [Not yet counted]
A search query might be something like "Type:Cat, Stripes:4-6", or "Type:Dog, Subtype:Spotted", or "Color:White", or "Color:White, Maker:John".
I'm not sure if the data is best suited for a relational database because there are several types and subtypes, each with their own properties. On top of that, there are estimated and actual values for each property.
I'd like some recommendations for how to store and search this data. Please help!
EDIT: I changed the search queries so they are no longer free-form.
I recommend using Apache Solr to index and search your data.
How you use Solr depends on your requirements. I use it as a searchable cache of my data. Extremely useful when the raw master data must be keep as files. Lots of frameworks integrate Solr as their search backend.
For building front-ends to a Solr index, checkout solr-ajax.
Example
Install Solr
Download Solr distribution:
wget http://www.apache.org/dist/lucene/solr/4.7.0/solr-4.7.0.tgz
tar zxvf solr-4.7.0.tgz
Start Solr using embedded Jetty container:
cd solr-4.7.0/example
java -jar start.jar
Solr should now be running locally
http://localhost:8983/solr
data.xml
You did not specify a data format so I used the native XML supported by Solr:
<add>
<doc>
<field name="id">1</field>
<field name="toy_type_s">Dog</field>
<field name="toy_subtype_s">Spotted</field>
<field name="toy_maker_s">John</field>
<field name="color_s">White</field>
<field name="estimated_spots_i">10</field>
<field name="actual_spots_i">11</field>
</doc>
<doc>
<field name="id">2</field>
<field name="toy_type_s">Cat</field>
<field name="toy_subtype_s">Striped</field>
<field name="toy_maker_s">Jane</field>
<field name="color_s">White</field>
<field name="estimated_spots_i">5</field>
</doc>
</add>
Notes:
Every document in Solr must have a unique id
The field names have a trailing "_s" and "_i" in their names to indicate field types. This is a cheat to take advantage of Solr's dynamic field feature.
Index XML file
Lots of ways to get data into Solr. The simplest way is the curl command:
curl http://localhost:8983/solr/update?commit=true -H "Content-Type: text/xml" --data-binary #data.xml
It's worth noting that Solr supports other data formats, such as JSON and CSV.
Search indexed file
Again there are language libraries to support Solr searches, the following examples use curl. The Solr search syntax is along the lines you've required.
Here's a simple example:
$ curl http://localhost:8983/solr/select/?q=toy_type_s:Cat
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="q">toy_type_s:Cat</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">2</str>
<str name="toy_type_s">Cat</str>
<str name="toy_subtype_s">Striped</str>
<str name="toy_maker_s">Jane</str>
<str name="color_s">White</str>
<int name="estimated_spots_i">5</int>
<long name="_version_">1463999035283079168</long>
</doc>
</result>
</response>
A more complex search example:
$ curl "http://localhost:8983/solr/select/?q=toy_type_s:Cat%20AND%20estimated_spots_i:\[2%20TO%206\]"
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">2</int>
<lst name="params">
<str name="q">toy_type_s:Cat AND estimated_spots_i:[2 TO 6]</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">2</str>
<str name="toy_type_s">Cat</str>
<str name="toy_subtype_s">Striped</str>
<str name="toy_maker_s">Jane</str>
<str name="color_s">White</str>
<int name="estimated_spots_i">5</int>
<long name="_version_">1463999035283079168</long>
</doc>
</result>
</response>
You have structured the problem in such a way as to make this very difficult to solve. Your data is structured data, with specific columns. Yet, you are trying to use free-form queries to search through it.
So, the normal way to do this is to allow search terms for each of the fields.
The next way to approach this is as a full-text problem. This definitely has its issues. For instance, numbers are typically stop words. And values in different fields would get confused with each other.
Of course, you can try to do free form search on structured data. This is, after all, something that Google and Microsoft are doing. If you search "airfare from New York to London" on Google, you will get lists of flights. But this is a hard problem to approach through understanding the query.

Solr Highlighting Problem

Hi All I have a problem that when i Query Solr it matches results, but when i enable highlighting on the results of this query the highlighting does not work..
My Query is
+Contents:"item 503"
Contents is of type text and one important thing in text item 503 appear as "item 503(c)", can open parenthesis at the end create problem?? please help
here is highlighting section in SolrSonfig.xml
<highlighting>
<!-- Configure the standard fragmenter -->
<!-- This could most likely be commented out in the "default" case -->
<fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter" default="true">
<lst name="defaults">
<int name="hl.fragsize">100</int>
</lst>
</fragmenter>
<!-- A regular-expression-based fragmenter (f.i., for sentence extraction) -->
<fragmenter name="regex" class="org.apache.solr.highlight.RegexFragmenter">
<lst name="defaults">
<!-- slightly smaller fragsizes work better because of slop -->
<int name="hl.fragsize">70</int>
<!-- allow 50% slop on fragment sizes -->
<float name="hl.regex.slop">0.5</float>
<!-- a basic sentence pattern -->
<str name="hl.regex.pattern">[-\w ,/\n\"']{20,200}</str>
</lst>
</fragmenter>
<!-- Configure the standard formatter -->
<formatter name="html" class="org.apache.solr.highlight.HtmlFormatter" default="true">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
</highlighting>
and here is fieldtype definition in schema.xml
<fieldtype name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" luceneMatchVersion="LUCENE_29"/>
<filter class="solr.StandardFilterFactory"/>
<!-- <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" luceneMatchVersion="LUCENE_29"/>
<filter class="solr.EnglishPorterFilterFactory"/>-->
</analyzer>
</fieldtype>
and here is field definition
<field name="Contents" type="text" indexed="true" stored="true" />
Regards
Ahsan.
Have you tried storing the term vectors too? If you're using the fast vector highlighter (which I think Solr might by default) you'll need those.