Elasticsearch and Spark, how to write one field to docs - lucene

I am using Elasticsearch and Spark. I create an index completely in an rdd and write it using rdd.saveToEs("index", mappings).
Later I want to update the index by overwriting a single field in the doc. But the rdd writing code seems to need the entire doc, writing a map with one property will erase the rest of the doc. For example the docs may have a title, body, and popularity. Populartiy being a Double.
When I recreate the popularity number I now have:
RDD[(String, Map[String, Double])]
where a single element will have the following value:
("doc1", Map(("popularity" -> 1.5d)))
if I write this RDD of only popularity fields I believe it will erase the other fields. What I need is some way to do an upsert type of op that overwrites or adds the "popularity" field but leaves the rest of the doc unchanged.
I've read about the include and exclude mappings but not sure if these apply to this case. When I'm writing the popularity I don't know the rest of the doc structure only the single field.

Related

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

How to prevent a field from not analyzing in lucene

I want some fields like urls, to be indexed and stored but not to be analyzed. Field class had a constructor to do the same.
Field(String name, String value, Field.Store store, Field.Index index)
But this constructor has been deprecated since lucene 4 and it is suggested to use StringField or TextField objects. But they don't have any constructors to specify which field to be indexed. So can it be done?
The correct way to index and store an un-analyzed field, as a single token, is to use StringField. It is designed to handle atomic strings, like id numbers, urls, etc. You can specify whether it is stored similarity to in Lucene 3.X
Such as:
new StringField("myUrl, "http://stackoverflow.com/questions/19042587/how-to-prevent-a-field-from-not-analyzing-in-lucene", Field.Store.YES)
Hello you are totally right with what you are saying. With the new fields provided by Lucene you cannot achieve what you want.
You can either continue using the Field as you described or implement your own field by implementing the interface IndexableField. there you can decide yourself what behaviors you want your Field to have.

Neo4j - Querying with Lucene

I am using Neo4j embedded as database. I have to store thousands of articles daily and and I need to provide a search functionality where I should return the articles whose content match to the keywords entered by the users. I indexed the content of each and every article and queried on the index like below
val articles = article_content_index.query("article_content", search string)
This works fine. But, its taking lot of time when the search string contains common words like "the", "a" and etc which will be present in each and every article.
How do I solve this problem?
Probably a lucene issue.
You can configure your own analyzer which could leave off those frequent (stop-)words:
http://docs.neo4j.org/chunked/stable/indexing-create-advanced.html
http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/Analyzer.html
http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
You might configure article_content_index as fulltext index, see http://docs.neo4j.org/chunked/stable/indexing-create-advanced.html. To switch to using fulltext index, you first have to remove the index and the first usage of IndexManager.forNodes(String, Map) needs to configure the index on creation properly.

how to index and search for custom fields using Lucene or hibernate search?

how to index and search for custom fields using Lucene or hibernate search. i cannot find a way to index the custom field. they are dynamic.
'custom fields' in here means they can be editabled by user,those fields are not hard code.
Any help will be thankful!
Query of Custom Fields
Just use the projection API:
FullTextQuery hibernateQuery = fullTextSession
.createFullTextQuery(luceneQuery)
.setProjection("myField1", "myField2");
List results = hibernateQuery.list();
Using projections you get to read any field as long as it's STORED.
If it matches some property name of your indexed entities it will be materialized after being converted to the appropriate type (if you have a TwoWayFieldBridge); if not you will get the String value.
If for some reason you need to bypass this conversion or just want to have fun decoding the raw Lucene Document you can open an IndexReaderdirectly.
Indexing Custom Fields
When defining a FieldBridge you get to add as many fields as you like to the indexed Document, and you can name each of them as you like.
The method parameter name is a hint - useful for example to scope the field name - but you can ignore it.
An example FieldBridge implementation writing multiple fields is the DateSplitBridge in the documentation.

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.
for the most frequent terms look into the terms- and statistical component
besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393
For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.