Apache Solr - Need to index the same documents two times for plain text query to work

Apache Solr - Need to index the same documents two times for plain text query to work - indexing

I have a dataset of 76000 big json documents. I have it indexed into solr with post.jar
java -jar -Dc=sampleindex -Dauto example/exampledocs/post.jar "/home/sample/sample.json"
I am able to see that the 76000 documents are indexed in the solr UI dashboard. I can do field base search against the index like below
if customer_name is one of the field, I can search with
http://localhost:8983/solr/sampleindex/select?fl=customer_name&q=customer_name%3Arachel
I get the results from solr. However, when I just search using only (plain text)
http://localhost:8983/solr/sampleindex/select?q=rachel
I don't get any results.
I need to index the 76000 documents again and now the total document count is 152000
Now, if I search with with plain text "rachel"
http://localhost:8983/solr/sampleindex/select?q=rachel
I get results.
I am not sure if some thing is wrong with the way I am indexing.

Related

How Solr indexes documents?

I'm new to Solr and I want to understand exactly how it indexes documents.
Let's say I have a 100 MB document (document1) full of text. The text is not structured, it's just raw text. I send that document to Solr in order to be indexed.
As far as I understood, Lucene will parse the document, extract all the words, based on the default schema (let's assume we're using the default schema), and create an index that is basically a mapping between a word and a list of documents, like so:
word1 -> [document1]
word2 -> [document1]
etc
Now, if I want to search for the word "word1", Solr will give me the entire 100 MB document that contains the word "word1", correct?
Please correct me if I'm wrong, I need to understand exactly how it works.

You described most of the indexing part kinda okay, at least at high level. The reason, why you getting all your document back - it is because your field is a stored one in your Solr schema (which is true by default at least)
This means, that apart from having postings list of
word1 -> doc1, doc3
word2 -> doc2, doc3
etc.
Solr/Lucene also stores the original content of the field, so it will be able to return it back to you. You could either explictily turn it off by saying stored=false in your schema or by filtering it out in fl section and just request fl=id (or something similar)
If you would like to return part of the document only, around searched ones, you could do that by using Solr Highlighting feature. Highlighting in Solr allows fragments of documents that match the user’s query to be included with the query response.

Lucene Field.Store.YES versus Field.Store.NO

Will someone please explain under what circumstance I may use Field.Store.NO instead of Field.Store.YES? I am extremely new to Lucene. And I am trying to create a document. Per my basic knowledge, I am doing
doc.add(new StringField(fieldNameA,fieldValueA,Field.Store.YES));
doc.add(new TextField(fieldNameB,fieldValueB,Field.Store.YES));

There are two basic ways a document can be written into Lucene.
Indexed - The field is analyzed and indexed, and can be searched.
Stored - The field's full text is stored and will be returned with search results.
If a document is indexed but not stored, you can search for it, but it won't be returned with search results.
One reasonably common pattern is to use lucene for search, but only have an ID field being stored which can be used to retrieve the full contents of the document/record from, for instance, a SQL database, a file system, or an web resource.
You might also opt not to store a field when that field is just a search tool, but you wouldn't display it to the user, such as a soundex/metaphone, or an alternate analysis of a content field.

Use Field.Store.YES when you need a document back from Lucene document. Use NO when you just need a search from document. Here is a link explained with a scenario.
https://handyopinion.com/java-lucene-saving-fields-or-not/

lucene multiple documents created for large and unique number of resources?

I am a beginner in lucene search.If I have a collection resources like:
id,name,{list of products},{list of keywords}.If I want to search based on name or products or keyword.I have some doubts related to lucene and its usage:
1)For document creation, I create a document that has the structure of id,name,products(multiple values),keywords(multiple values).If I have a thousand unique resources, will it create 1000 unique documents?
2)Also, If I make name and products field as searchable fields(as StringField), then after searching, will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?

Q> <..> will it create 1000 unique documents?
A> Lucene doesn't have the concept of "uniqueness" - it is only in your head. Alternatively, think of this as if all documents are unique for Lucene. If you search by these fields, relevant documents will be returned.
Q> <..> will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?
A> Strange/unclear question. If you search for all documents, you will get all documents. If your search query will only match some documents, some documents will be returned. The internals are more complex - it all depends on how you analyze the text. Maybe you can more give concrete example with use cases?

How do I get accurate search result in Lucene using Query syntax

So far I have been testing the keywords that I inputted in Sitecore using the query syntax but the search result does not rank the page first.
For example if I put query syntax on the word book....(title:book)^1
I want the index page that is name book to appear first in the search result and not bookmark.
Also, every time I publish a new page in Sitecore the keywords for the word Book get push down to the last result or doesn't appear in the search page.
How do I get accurate result in Lucene for the search engine page?
Also I've been following http://www.lucenetutorial.com/lucene-query-syntax.html about how to increase search result but it doesn't work.
Can someone explain how the boost of the search term works.

I recommend you leverage the Advanced Database Crawler to get the best use of Lucene.NET with Sitecore. From that, there's a config file for the indexes with a section called <dynamicFields ... >. In that section, you can specify an individual Sitecore field and adjust the boost attribute. The default boost for every field is 1f which is 1 floating point.
More reading:
Sitecore Searcher and Advanced Database Crawler
Source code for the ADC

Will Lucene ALWAYS return ALL the documents that match my query same way as the SQL select query does?

I'm using Lucene to index the values that I'm storing in an object database. I'm storing a reference (UUID) to the object along with the field names and their corresponding values (Lucene Fields) in the Lucene Document.
My question is will Lucene ALWAYS return ALL the documents that match my query?
Thanks.

it depends on analyzer which you are using and also you can limit the no of result while searching.
for better searching you also can use Apache's open source search platform - Solr.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Apache Solr - Need to index the same documents two times for plain text query to work - indexing

Related

How Solr indexes documents?

Lucene Field.Store.YES versus Field.Store.NO

lucene multiple documents created for large and unique number of resources?

How do I get accurate search result in Lucene using Query syntax

Will Lucene ALWAYS return ALL the documents that match my query same way as the SQL select query does?

Categories

Resources