Lucene Field.Store.YES versus Field.Store.NO - lucene

Will someone please explain under what circumstance I may use Field.Store.NO instead of Field.Store.YES? I am extremely new to Lucene. And I am trying to create a document. Per my basic knowledge, I am doing
doc.add(new StringField(fieldNameA,fieldValueA,Field.Store.YES));
doc.add(new TextField(fieldNameB,fieldValueB,Field.Store.YES));

There are two basic ways a document can be written into Lucene.
Indexed - The field is analyzed and indexed, and can be searched.
Stored - The field's full text is stored and will be returned with search results.
If a document is indexed but not stored, you can search for it, but it won't be returned with search results.
One reasonably common pattern is to use lucene for search, but only have an ID field being stored which can be used to retrieve the full contents of the document/record from, for instance, a SQL database, a file system, or an web resource.
You might also opt not to store a field when that field is just a search tool, but you wouldn't display it to the user, such as a soundex/metaphone, or an alternate analysis of a content field.

Use Field.Store.YES when you need a document back from Lucene document. Use NO when you just need a search from document. Here is a link explained with a scenario.
https://handyopinion.com/java-lucene-saving-fields-or-not/

Related

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

How to exclude text indexed from PDF in solr query

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.
I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.
Is there another way?
Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.
You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.
However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.
You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.
You can look at field aliases
If you have 3 index fields
pdfmeta
pdftext
Then you can create two field aliases
quicksearch : pdfmeta
fullsearch : pdfmeta, pdftext
One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value, you can change the alias for quicksearch without affecting the user's bookmark.

Elastic Search Engine Without Saving Data

Does Elastic/Lucene really need to store all indexed data in a document? Couldn't you just pass data through it so that Lucene may index the words into its hash table and have a single field for each document with the URL (or what ever pointer makes sense for you) that returns where each document came from?
A quick example may be indexing Wikipedia.org. If I pass each webpage to Elastic/Lucene to index - why do I need to save each webpages' main text in a field if Lucene indexes it and has a corresponding URL field to reply for searches?
We pay the cloud so much money to store so much redundant data -- Im just wondering why if Lucene is searching from its hash table and not the actual fields we save data into... why save that data if we dont want it?
Is there a way to index full text documents in Elastic without having to save all of the full text data from those documents?
There are a lot of options for the _source field. This is the field that actually stored the original document. You can disable it completely or decide which fields to keep. More information can be found in the docs:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

index multi-language field with luncene

I have multi-language document records to index with lucene. That is, each document record is in one language, but there are different language records. I intend to keep them in one index so that I could search with multi-language queries. Currently the document records are in one data input file like this:
<DOCID>1<\DOCID>
<LANGUAGE>CHINESE<\LANGUAGE>
<TEXT>中文内容<\TEXT>
<DOCID>2<\DOCID>
<LANGUAGE>ENGLISH<\LANGUAGE>
<TEXT>Some English text<\TEXT>
My question is: Is there a way to use different analyzers for the same field with one index writer? Or should I split the document records into two input document in different languages to apply different index writer but append to the same index?
Thank you in advance for your advice!
You can provide the Analyzer you intend to use for a document when you call IndexWriter.addDocument.
However, you would probably benefit more from splitting different language texts into different fields, This would prevent having hits on the wrong language, and allow you to just create an AnalyzerWrapper to assign the appropriate analyzer after having detected the correct language.

What are indexes in Lucene?

What are the indexes in Lucene and how it works?
I have gone through some articles on net and google but I could not understand the concept of the index, documents etc fully.
Please help if anyone can explain in simple terms the term index and the indexing.
Thanks !
Say you have a bunch of information you would like to make searchable. For example, some HTML files, some PDFs, and some information stored in a database. When a user does a search, you could write a search engine that trawls through this information and return results that match. However, this is typically way too slow for large sets of data.
So in advance of running our application, we create an index of the information that needs to be searchable. The index contains a summary of each piece of information we would like to include in the search. In Lucene, the summary for an information piece is called a document. A document contains a number of fields.
When creating the index you decide which fields to include based on what you would like to be searchable. For example, you may include a title, an id, category string and so forth. Once the fields are defined you create a document in the index for each information item (html, pdf, database entries etc). This process is called indexing.
The search engine can now use the index to search for things. The index is highly optimized for the typical searches that we do. You can search for information in specific fields and do boolean logic. You can search for precise matches or fuzzy ones. And the search engine will weigh/score your documents in the index, returning the most relevant first.
Hope that helps at a high level.
Lucene creates an inverted full-text index, it splits the documents into words, builds an index for each word.
For Instance:
Document 1: "Apache Lucene Java"
Document 2: "Java Library"
Inverted Index:
Tokens Document Location
apache 1
Library 2
Java 1, 2
Lucene 1
Lets expand is further, now lets consider Document with two Fields. Body and Title.
Document doc = new Document()
doc.add(new Field("body", "This is my Test document", Field.Store.YES, Field.Index.TOKENIZED)
doc.add(new Field("title", "Test document", Field.Store.YES, Field.Index.UNTOKENIZED)
You have the flexibility to tokenize or not tokenize a Field.
Luncene has various analyzer, using the StandardAnalyzer
Analyzer analyzer = new StandardAnalyzer()
above document would be tokenized "my", "Test", "document", "test document"