lucene multiple documents created for large and unique number of resources? - lucene

I am a beginner in lucene search.If I have a collection resources like:
id,name,{list of products},{list of keywords}.If I want to search based on name or products or keyword.I have some doubts related to lucene and its usage:
1)For document creation, I create a document that has the structure of id,name,products(multiple values),keywords(multiple values).If I have a thousand unique resources, will it create 1000 unique documents?
2)Also, If I make name and products field as searchable fields(as StringField), then after searching, will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?

Q> <..> will it create 1000 unique documents?
A> Lucene doesn't have the concept of "uniqueness" - it is only in your head. Alternatively, think of this as if all documents are unique for Lucene. If you search by these fields, relevant documents will be returned.
Q> <..> will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?
A> Strange/unclear question. If you search for all documents, you will get all documents. If your search query will only match some documents, some documents will be returned. The internals are more complex - it all depends on how you analyze the text. Maybe you can more give concrete example with use cases?

Related

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

Lucene Field.Store.YES versus Field.Store.NO

Will someone please explain under what circumstance I may use Field.Store.NO instead of Field.Store.YES? I am extremely new to Lucene. And I am trying to create a document. Per my basic knowledge, I am doing
doc.add(new StringField(fieldNameA,fieldValueA,Field.Store.YES));
doc.add(new TextField(fieldNameB,fieldValueB,Field.Store.YES));
There are two basic ways a document can be written into Lucene.
Indexed - The field is analyzed and indexed, and can be searched.
Stored - The field's full text is stored and will be returned with search results.
If a document is indexed but not stored, you can search for it, but it won't be returned with search results.
One reasonably common pattern is to use lucene for search, but only have an ID field being stored which can be used to retrieve the full contents of the document/record from, for instance, a SQL database, a file system, or an web resource.
You might also opt not to store a field when that field is just a search tool, but you wouldn't display it to the user, such as a soundex/metaphone, or an alternate analysis of a content field.
Use Field.Store.YES when you need a document back from Lucene document. Use NO when you just need a search from document. Here is a link explained with a scenario.
https://handyopinion.com/java-lucene-saving-fields-or-not/

Sitecore Search Ranking

Does Sitecore/Lucene support filtering/ranking of content?
I cannot find any related documentation.
Lucene returns ranked results, and you can structure queries to filter results using the QueryOccurance.MustNot clause, or to boost results using the QueryOccurance.Should clause.
From Sitecore's documentation of the QueryOccurance class:
Lucene uses the following operators for the search terms in complex
queries:
 Must – the search term must occur in the document to be
included into the search results.
 Should – the search term may occur
in the document but is not necessary, and the document may be
included in search results based on other criteria. However, the
documents containing the search term are ranked higher than
equivalent documents that do not contain the search term.
 Must not
– the search term must not occur in the document in order to be
included in the search results. Documents with the search term will
be excluded from the results
Some additional resources for Lucene in Sitecore:
Syntax of Lucene Queries: http://sitecoregadgets.blogspot.com/2009/11/working-with-lucene-search-index-in_25.html
Lucene Walkthrough: http://learnsitecore.cmsuniverse.net/en/Developers/Articles/2009/06/LuceneQuery1.aspx
Alex Shyba's Lucene posts: http://sitecoreblog.alexshyba.com/search/label/lucene
This question may also be useful: Sitecore + Lucene + QueryOccurance.Should not returning desired results
Sitecore has built-in sitecore_master_content, sitecore_web_content, sitecore_core_content indexes which are indexing all the content in Sitecore and already have an API to search for these indexes. You can specify boosting value in Sitecore "Indexing" item section (by default it's empty).
Also you can set boosting for the fields in your search query.

In accordance with the user name query in lucene

I want to provide an search function on a blog website. But I want to search not only on whole documents, but also I want search on just one author's documents.
As I want use lucene to provide Full-text index,how can I do this when create index?
Indexing the author's name as a separate field would let you search for all documents containing "Lucene" with an author of "fisher", for example ("lucene author:fisher" in QueryParser syntax).

What are indexes in Lucene?

What are the indexes in Lucene and how it works?
I have gone through some articles on net and google but I could not understand the concept of the index, documents etc fully.
Please help if anyone can explain in simple terms the term index and the indexing.
Thanks !
Say you have a bunch of information you would like to make searchable. For example, some HTML files, some PDFs, and some information stored in a database. When a user does a search, you could write a search engine that trawls through this information and return results that match. However, this is typically way too slow for large sets of data.
So in advance of running our application, we create an index of the information that needs to be searchable. The index contains a summary of each piece of information we would like to include in the search. In Lucene, the summary for an information piece is called a document. A document contains a number of fields.
When creating the index you decide which fields to include based on what you would like to be searchable. For example, you may include a title, an id, category string and so forth. Once the fields are defined you create a document in the index for each information item (html, pdf, database entries etc). This process is called indexing.
The search engine can now use the index to search for things. The index is highly optimized for the typical searches that we do. You can search for information in specific fields and do boolean logic. You can search for precise matches or fuzzy ones. And the search engine will weigh/score your documents in the index, returning the most relevant first.
Hope that helps at a high level.
Lucene creates an inverted full-text index, it splits the documents into words, builds an index for each word.
For Instance:
Document 1: "Apache Lucene Java"
Document 2: "Java Library"
Inverted Index:
Tokens Document Location
apache 1
Library 2
Java 1, 2
Lucene 1
Lets expand is further, now lets consider Document with two Fields. Body and Title.
Document doc = new Document()
doc.add(new Field("body", "This is my Test document", Field.Store.YES, Field.Index.TOKENIZED)
doc.add(new Field("title", "Test document", Field.Store.YES, Field.Index.UNTOKENIZED)
You have the flexibility to tokenize or not tokenize a Field.
Luncene has various analyzer, using the StandardAnalyzer
Analyzer analyzer = new StandardAnalyzer()
above document would be tokenized "my", "Test", "document", "test document"