Lucene/Solr Searching problem? - lucene

I have a problem that i want to search in the specific locations in the indexed text, let we have a lucene document that contains text as
<Cover>
This document contains following items
1. Business overview.
2. Risk Factors.
3. Management
</Cover>
<BusinessOverview>
our business is xyz
</BusinessOverview>
<RiskFactors>
we have xyz risk factors
</RiskFactors>
<Management>
we have xyz type of management
</Mangement>
now in above code html tags(could be any thing) divide main document in sections now i want to have a functionality if user give some text to search and does not mention any specific section the text should be searched in whole document but user if user specify some section along with text to search, the search should be done only in that particular section. I want to know is this type of search is possible with solr/lucene.
Regards
Ahsan

You can use the <copyField> option to have a "field of fields"
se here:
http://wiki.apache.org/solr/FAQ#How_do_I_use_copyField_with_wildcards.3F
http://www.ibm.com/developerworks/java/library/j-solr1/

Your schema should reflect that need ; the data sent to the indexer would have then to match this schema properly. Once done, you'll be able to query against scpcific fields.
You could also use an xml importer.

I have never worked with solr but lucene itself has very flexible query language, see this link. So answer is yes it is possible.

Related

How to exclude text indexed from PDF in solr query

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.
I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.
Is there another way?
Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.
You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.
However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.
You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.
You can look at field aliases
If you have 3 index fields
pdfmeta
pdftext
Then you can create two field aliases
quicksearch : pdfmeta
fullsearch : pdfmeta, pdftext
One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value, you can change the alias for quicksearch without affecting the user's bookmark.

Lucene Field.Store.YES versus Field.Store.NO

Will someone please explain under what circumstance I may use Field.Store.NO instead of Field.Store.YES? I am extremely new to Lucene. And I am trying to create a document. Per my basic knowledge, I am doing
doc.add(new StringField(fieldNameA,fieldValueA,Field.Store.YES));
doc.add(new TextField(fieldNameB,fieldValueB,Field.Store.YES));
There are two basic ways a document can be written into Lucene.
Indexed - The field is analyzed and indexed, and can be searched.
Stored - The field's full text is stored and will be returned with search results.
If a document is indexed but not stored, you can search for it, but it won't be returned with search results.
One reasonably common pattern is to use lucene for search, but only have an ID field being stored which can be used to retrieve the full contents of the document/record from, for instance, a SQL database, a file system, or an web resource.
You might also opt not to store a field when that field is just a search tool, but you wouldn't display it to the user, such as a soundex/metaphone, or an alternate analysis of a content field.
Use Field.Store.YES when you need a document back from Lucene document. Use NO when you just need a search from document. Here is a link explained with a scenario.
https://handyopinion.com/java-lucene-saving-fields-or-not/

Apache Solr 5 - deduplicating data within a field

Here is my question (pardon the wordiness):
I have millions of documents and all of them are unique.
However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all 10 million documents. This field is large-ish -400-800 words or so.
What is the most appropriate way to eliminate this repetition of data in the 'description' field?
Let me elaborate. Here is an example schema that been simplified:
Doc_id <-- this is unique
Title <-- always unique as well
Description <-- contains mostly dupe data
I search over both the title and description but only return the title itself.
I'm fairly new to Solr but have been unable to find any information on how to tackle a scenario like this. In case it matters, I'm running Solr 5 on Ubuntu.
Thanks for any help!
I will try to provide some strategies to tackle your problem.
You are saying that you search over title and description, this means you should set these fields to indexed=true in your schema.xml. Only title is returned, this means only title needs to be set to stored=true, description should be set to stored=false. See this posting for more information on stored vs. indexed: Solr index vs stored
Another useful option you could try is the field option compression. If you need to store a field, you can use gzip compression on certain fields, such as TextField and StrField, see: https://wiki.apache.org/solr/SchemaXml for more info.
Lastly, deduplication is supported in Solr, see: https://wiki.apache.org/solr/Deduplication. I did not try this feature, but from the sounds of it, you can prevent (nearly) duplicate documents to be indexed or tag duplicates. Maybe its goal "Allow for both duplicate collapsing in search results as well as deduplication on adding a document." is what you are looking for?

How do I get accurate search result in Lucene using Query syntax

So far I have been testing the keywords that I inputted in Sitecore using the query syntax but the search result does not rank the page first.
For example if I put query syntax on the word book....(title:book)^1
I want the index page that is name book to appear first in the search result and not bookmark.
Also, every time I publish a new page in Sitecore the keywords for the word Book get push down to the last result or doesn't appear in the search page.
How do I get accurate result in Lucene for the search engine page?
Also I've been following http://www.lucenetutorial.com/lucene-query-syntax.html about how to increase search result but it doesn't work.
Can someone explain how the boost of the search term works.
I recommend you leverage the Advanced Database Crawler to get the best use of Lucene.NET with Sitecore. From that, there's a config file for the indexes with a section called <dynamicFields ... >. In that section, you can specify an individual Sitecore field and adjust the boost attribute. The default boost for every field is 1f which is 1 floating point.
More reading:
Sitecore Searcher and Advanced Database Crawler
Source code for the ADC

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.
for the most frequent terms look into the terms- and statistical component
besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393
For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.