How Solr indexes documents? - indexing

I'm new to Solr and I want to understand exactly how it indexes documents.
Let's say I have a 100 MB document (document1) full of text. The text is not structured, it's just raw text. I send that document to Solr in order to be indexed.
As far as I understood, Lucene will parse the document, extract all the words, based on the default schema (let's assume we're using the default schema), and create an index that is basically a mapping between a word and a list of documents, like so:
word1 -> [document1]
word2 -> [document1]
etc
Now, if I want to search for the word "word1", Solr will give me the entire 100 MB document that contains the word "word1", correct?
Please correct me if I'm wrong, I need to understand exactly how it works.

You described most of the indexing part kinda okay, at least at high level. The reason, why you getting all your document back - it is because your field is a stored one in your Solr schema (which is true by default at least)
This means, that apart from having postings list of
word1 -> doc1, doc3
word2 -> doc2, doc3
etc.
Solr/Lucene also stores the original content of the field, so it will be able to return it back to you. You could either explictily turn it off by saying stored=false in your schema or by filtering it out in fl section and just request fl=id (or something similar)
If you would like to return part of the document only, around searched ones, you could do that by using Solr Highlighting feature. Highlighting in Solr allows fragments of documents that match the user’s query to be included with the query response.

Related

Lucene index - Export/Query 'indexed' text field values that are not 'stored'

I have a Lucene index and the document text is 'indexed' but not 'stored'.
I am using Luke v7.6.0 and it's great for 'visualising' the index.
Obviously because my document text is indexed but not stored I cannot copy or query the 'stored' value (there isn't one), but can I somehow extract the indexed text values to the clipboard or text file to allow me to analyse exactly what is indexed from my file?
One of the available thing to you - is to check Lucene index files manually.
I suspect that the most important ones are the Term Dictionary files (*.tim)
I’ve indexed document with no stored values and terms - test#test.com in field email (TextField with Standard analyzer) and John in field name (StringField)
After this one, I opened tim file with hex editor and was able to see something like this:
You could clearly see the values of test, test, com which were tokenized by Standard one, also you could see John still stays the same, since I used StringField. In my other examples, I was able to see the work of lowercasing as well.
Just a reminder, if you would like to repeat it - by default for small indices Lucene will put everything into compound file, which I don’t prefer for this specific debug. You could disable this by setUseCompoundFile(false)

Lucene Field.Store.YES versus Field.Store.NO

Will someone please explain under what circumstance I may use Field.Store.NO instead of Field.Store.YES? I am extremely new to Lucene. And I am trying to create a document. Per my basic knowledge, I am doing
doc.add(new StringField(fieldNameA,fieldValueA,Field.Store.YES));
doc.add(new TextField(fieldNameB,fieldValueB,Field.Store.YES));
There are two basic ways a document can be written into Lucene.
Indexed - The field is analyzed and indexed, and can be searched.
Stored - The field's full text is stored and will be returned with search results.
If a document is indexed but not stored, you can search for it, but it won't be returned with search results.
One reasonably common pattern is to use lucene for search, but only have an ID field being stored which can be used to retrieve the full contents of the document/record from, for instance, a SQL database, a file system, or an web resource.
You might also opt not to store a field when that field is just a search tool, but you wouldn't display it to the user, such as a soundex/metaphone, or an alternate analysis of a content field.
Use Field.Store.YES when you need a document back from Lucene document. Use NO when you just need a search from document. Here is a link explained with a scenario.
https://handyopinion.com/java-lucene-saving-fields-or-not/

Elastic Search Engine Without Saving Data

Does Elastic/Lucene really need to store all indexed data in a document? Couldn't you just pass data through it so that Lucene may index the words into its hash table and have a single field for each document with the URL (or what ever pointer makes sense for you) that returns where each document came from?
A quick example may be indexing Wikipedia.org. If I pass each webpage to Elastic/Lucene to index - why do I need to save each webpages' main text in a field if Lucene indexes it and has a corresponding URL field to reply for searches?
We pay the cloud so much money to store so much redundant data -- Im just wondering why if Lucene is searching from its hash table and not the actual fields we save data into... why save that data if we dont want it?
Is there a way to index full text documents in Elastic without having to save all of the full text data from those documents?
There are a lot of options for the _source field. This is the field that actually stored the original document. You can disable it completely or decide which fields to keep. More information can be found in the docs:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

Neo4j - Querying with Lucene

I am using Neo4j embedded as database. I have to store thousands of articles daily and and I need to provide a search functionality where I should return the articles whose content match to the keywords entered by the users. I indexed the content of each and every article and queried on the index like below
val articles = article_content_index.query("article_content", search string)
This works fine. But, its taking lot of time when the search string contains common words like "the", "a" and etc which will be present in each and every article.
How do I solve this problem?
Probably a lucene issue.
You can configure your own analyzer which could leave off those frequent (stop-)words:
http://docs.neo4j.org/chunked/stable/indexing-create-advanced.html
http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/Analyzer.html
http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
You might configure article_content_index as fulltext index, see http://docs.neo4j.org/chunked/stable/indexing-create-advanced.html. To switch to using fulltext index, you first have to remove the index and the first usage of IndexManager.forNodes(String, Map) needs to configure the index on creation properly.

How do I get accurate search result in Lucene using Query syntax

So far I have been testing the keywords that I inputted in Sitecore using the query syntax but the search result does not rank the page first.
For example if I put query syntax on the word book....(title:book)^1
I want the index page that is name book to appear first in the search result and not bookmark.
Also, every time I publish a new page in Sitecore the keywords for the word Book get push down to the last result or doesn't appear in the search page.
How do I get accurate result in Lucene for the search engine page?
Also I've been following http://www.lucenetutorial.com/lucene-query-syntax.html about how to increase search result but it doesn't work.
Can someone explain how the boost of the search term works.
I recommend you leverage the Advanced Database Crawler to get the best use of Lucene.NET with Sitecore. From that, there's a config file for the indexes with a section called <dynamicFields ... >. In that section, you can specify an individual Sitecore field and adjust the boost attribute. The default boost for every field is 1f which is 1 floating point.
More reading:
Sitecore Searcher and Advanced Database Crawler
Source code for the ADC