Fulltext Solr statistical search - lucene

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.

for the most frequent terms look into the terms- and statistical component

besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393

For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.

Related

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

How to exclude text indexed from PDF in solr query

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.
I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.
Is there another way?
Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.
You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.
However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.
You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.
You can look at field aliases
If you have 3 index fields
pdfmeta
pdftext
Then you can create two field aliases
quicksearch : pdfmeta
fullsearch : pdfmeta, pdftext
One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value, you can change the alias for quicksearch without affecting the user's bookmark.

Apache Solr 5 - deduplicating data within a field

Here is my question (pardon the wordiness):
I have millions of documents and all of them are unique.
However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all 10 million documents. This field is large-ish -400-800 words or so.
What is the most appropriate way to eliminate this repetition of data in the 'description' field?
Let me elaborate. Here is an example schema that been simplified:
Doc_id <-- this is unique
Title <-- always unique as well
Description <-- contains mostly dupe data
I search over both the title and description but only return the title itself.
I'm fairly new to Solr but have been unable to find any information on how to tackle a scenario like this. In case it matters, I'm running Solr 5 on Ubuntu.
Thanks for any help!
I will try to provide some strategies to tackle your problem.
You are saying that you search over title and description, this means you should set these fields to indexed=true in your schema.xml. Only title is returned, this means only title needs to be set to stored=true, description should be set to stored=false. See this posting for more information on stored vs. indexed: Solr index vs stored
Another useful option you could try is the field option compression. If you need to store a field, you can use gzip compression on certain fields, such as TextField and StrField, see: https://wiki.apache.org/solr/SchemaXml for more info.
Lastly, deduplication is supported in Solr, see: https://wiki.apache.org/solr/Deduplication. I did not try this feature, but from the sounds of it, you can prevent (nearly) duplicate documents to be indexed or tag duplicates. Maybe its goal "Allow for both duplicate collapsing in search results as well as deduplication on adding a document." is what you are looking for?

Lucene/Solr Searching problem?

I have a problem that i want to search in the specific locations in the indexed text, let we have a lucene document that contains text as
<Cover>
This document contains following items
1. Business overview.
2. Risk Factors.
3. Management
</Cover>
<BusinessOverview>
our business is xyz
</BusinessOverview>
<RiskFactors>
we have xyz risk factors
</RiskFactors>
<Management>
we have xyz type of management
</Mangement>
now in above code html tags(could be any thing) divide main document in sections now i want to have a functionality if user give some text to search and does not mention any specific section the text should be searched in whole document but user if user specify some section along with text to search, the search should be done only in that particular section. I want to know is this type of search is possible with solr/lucene.
Regards
Ahsan
You can use the <copyField> option to have a "field of fields"
se here:
http://wiki.apache.org/solr/FAQ#How_do_I_use_copyField_with_wildcards.3F
http://www.ibm.com/developerworks/java/library/j-solr1/
Your schema should reflect that need ; the data sent to the indexer would have then to match this schema properly. Once done, you'll be able to query against scpcific fields.
You could also use an xml importer.
I have never worked with solr but lucene itself has very flexible query language, see this link. So answer is yes it is possible.

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".