What are indexes in Lucene? - lucene

What are the indexes in Lucene and how it works?
I have gone through some articles on net and google but I could not understand the concept of the index, documents etc fully.
Please help if anyone can explain in simple terms the term index and the indexing.
Thanks !

Say you have a bunch of information you would like to make searchable. For example, some HTML files, some PDFs, and some information stored in a database. When a user does a search, you could write a search engine that trawls through this information and return results that match. However, this is typically way too slow for large sets of data.
So in advance of running our application, we create an index of the information that needs to be searchable. The index contains a summary of each piece of information we would like to include in the search. In Lucene, the summary for an information piece is called a document. A document contains a number of fields.
When creating the index you decide which fields to include based on what you would like to be searchable. For example, you may include a title, an id, category string and so forth. Once the fields are defined you create a document in the index for each information item (html, pdf, database entries etc). This process is called indexing.
The search engine can now use the index to search for things. The index is highly optimized for the typical searches that we do. You can search for information in specific fields and do boolean logic. You can search for precise matches or fuzzy ones. And the search engine will weigh/score your documents in the index, returning the most relevant first.
Hope that helps at a high level.

Lucene creates an inverted full-text index, it splits the documents into words, builds an index for each word.
For Instance:
Document 1: "Apache Lucene Java"
Document 2: "Java Library"
Inverted Index:
Tokens Document Location
apache 1
Library 2
Java 1, 2
Lucene 1
Lets expand is further, now lets consider Document with two Fields. Body and Title.
Document doc = new Document()
doc.add(new Field("body", "This is my Test document", Field.Store.YES, Field.Index.TOKENIZED)
doc.add(new Field("title", "Test document", Field.Store.YES, Field.Index.UNTOKENIZED)
You have the flexibility to tokenize or not tokenize a Field.
Luncene has various analyzer, using the StandardAnalyzer
Analyzer analyzer = new StandardAnalyzer()
above document would be tokenized "my", "Test", "document", "test document"

Related

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

Lucene Field.Store.YES versus Field.Store.NO

Will someone please explain under what circumstance I may use Field.Store.NO instead of Field.Store.YES? I am extremely new to Lucene. And I am trying to create a document. Per my basic knowledge, I am doing
doc.add(new StringField(fieldNameA,fieldValueA,Field.Store.YES));
doc.add(new TextField(fieldNameB,fieldValueB,Field.Store.YES));
There are two basic ways a document can be written into Lucene.
Indexed - The field is analyzed and indexed, and can be searched.
Stored - The field's full text is stored and will be returned with search results.
If a document is indexed but not stored, you can search for it, but it won't be returned with search results.
One reasonably common pattern is to use lucene for search, but only have an ID field being stored which can be used to retrieve the full contents of the document/record from, for instance, a SQL database, a file system, or an web resource.
You might also opt not to store a field when that field is just a search tool, but you wouldn't display it to the user, such as a soundex/metaphone, or an alternate analysis of a content field.
Use Field.Store.YES when you need a document back from Lucene document. Use NO when you just need a search from document. Here is a link explained with a scenario.
https://handyopinion.com/java-lucene-saving-fields-or-not/

lucene multiple documents created for large and unique number of resources?

I am a beginner in lucene search.If I have a collection resources like:
id,name,{list of products},{list of keywords}.If I want to search based on name or products or keyword.I have some doubts related to lucene and its usage:
1)For document creation, I create a document that has the structure of id,name,products(multiple values),keywords(multiple values).If I have a thousand unique resources, will it create 1000 unique documents?
2)Also, If I make name and products field as searchable fields(as StringField), then after searching, will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?
Q> <..> will it create 1000 unique documents?
A> Lucene doesn't have the concept of "uniqueness" - it is only in your head. Alternatively, think of this as if all documents are unique for Lucene. If you search by these fields, relevant documents will be returned.
Q> <..> will the result also contains(ScoreDocs contains) exactly the same set of documents that has the text I searched?
A> Strange/unclear question. If you search for all documents, you will get all documents. If your search query will only match some documents, some documents will be returned. The internals are more complex - it all depends on how you analyze the text. Maybe you can more give concrete example with use cases?

index multi-language field with luncene

I have multi-language document records to index with lucene. That is, each document record is in one language, but there are different language records. I intend to keep them in one index so that I could search with multi-language queries. Currently the document records are in one data input file like this:
<DOCID>1<\DOCID>
<LANGUAGE>CHINESE<\LANGUAGE>
<TEXT>中文内容<\TEXT>
<DOCID>2<\DOCID>
<LANGUAGE>ENGLISH<\LANGUAGE>
<TEXT>Some English text<\TEXT>
My question is: Is there a way to use different analyzers for the same field with one index writer? Or should I split the document records into two input document in different languages to apply different index writer but append to the same index?
Thank you in advance for your advice!
You can provide the Analyzer you intend to use for a document when you call IndexWriter.addDocument.
However, you would probably benefit more from splitting different language texts into different fields, This would prevent having hits on the wrong language, and allow you to just create an AnalyzerWrapper to assign the appropriate analyzer after having detected the correct language.

In accordance with the user name query in lucene

I want to provide an search function on a blog website. But I want to search not only on whole documents, but also I want search on just one author's documents.
As I want use lucene to provide Full-text index,how can I do this when create index?
Indexing the author's name as a separate field would let you search for all documents containing "Lucene" with an author of "fisher", for example ("lucene author:fisher" in QueryParser syntax).