Lucene complex structure search - lucene

Basically I do have pretty simple database that I'd like to index with Lucene.
Domains are:
// Person domain
class Person {
Set<Pair> keys;
}
// Pair domain
class Pair {
KeyItem keyItem;
String value;
}
// KeyItem domain, name is unique field within the DB (!!)
class KeyItem{
String name;
}
I've tens of millions of profiles and hundreds of millions of Pairs, however, since most of KeyItem's "name" fields duplicates, there are only few dozens KeyItem instances.
Came up to that structure to save on KeyItem instances.
Basically any Profile with any fields could be saved into that structure.
Lets say we've profile with properties
- name: Andrew Morton
- eduction: University of New South Wales,
- country: Australia,
- occupation: Linux programmer.
To store it, we'll have single Profile instance, 4 KeyItem instances: name, education,country and occupation, and 4 Pair instances with values: "Andrew Morton", "University of New South Wales", "Australia" and "Linux Programmer".
All other profile will reference (all or some) same instances of KeyItem: name, education, country and occupation.
My question is, how to index all of that so I can search for Profile for some particular values of KeyItem::name and Pair::value. Ideally I'd like that kind of query to work:
name:Andrew* AND occupation:Linux*
Should I create custom Indexer and Searcher? Or I could use standard ones and just map KeyItem and Pair as Lucene components somehow?

I believe you can use standard Lucene methodology.
I would:
Translate every profile to a Lucene Document.
Translate every Pair to a Field in this Document. All Fields need to be indexed, but not necessarily stored.
Add a stored Field with a profile id to the Document.
Search using name:value pairs similarly to your example.
If you choose bare Lucene, you will need a custom Indexer and Searcher, but they are not hard to build.
It may be easier for you to use Solr, where you need less programming. However, I do not know if Solr allows an open-ended schema like the one I described - I believe you have to predefine all field names, so this may prevent you from using Solr.

Lucene returns the list of hit documents essentially based on the occurence of the keyword/s regardless of the type of query. The fundamental segment reader checks for the presence of keywords in the entire index database rather than in specific field of interest.
Suggest to introduce a custom searcher that performs the following.
1.Read the short-listed documents using the document id. ( I guess the collect() method may be overridden to pass the document id from search() of IndexSearcher class ).
2.Get the field value and check the presence of expected keywords.
3.Subject the document for scoring only if the document meets your custom criteria.
Note : The default standard searcher can be modified rather than writing a custom seacher from scratch.

Related

Is it possible to order lucene documents by matching term?

I'm using Lucene 4.10.3 with Java 1.7
I'm wondering whether it's possible to order query results the matching term?
Simply put, if my documents conatin a text field;
The query is
text:a*
I want documents with ab, then ac, then ad etc.
The real case is more complex however, what I'm actually trying to accomplish is to "stuff" a relational DB into my lucene Index (probably not the best idea?).
An appropriate example would be :
I have documents representing books in a library. every book has a title and also a list of people who has borrowed this book and the date of borrowing.
when a user searches for a book with title containing "JAVA", I want to give priority to books that were borrowed by this user. This could be accomplished by adding a TextField "borrowers", adding a SHOULD clause on it and ordering by score)
also, if there are several books with "JAVA" that this user has borrowed before, I want to show the most recent borrowed ones first. so I thought to create a TextField "borrowers" that will look like
borrowers : "user1__20150505 user2__20150506" etc.
I will add a BooleanClause borrowers: user1* and order by matching term.
any other solution ideas will be welcome
I understand your real problem is more complex, but maybe this is helpful anyway.
You could first search for Tokens in the index that match your query, then for each matching token executing a query using this token specifically.
See https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/index/TermsEnum.html for that. Just seek to the prefix and iterate until the prefix stops matching.
In general it is sometimes easy to just issue two queries. For example one within the corpus of books the user as borrowed before and another witin the whole corpus.
These approaches may not work, but in that case you could implement a custom Scorer somehow mapping the ordering to a number.
See http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

How to use Lucene FastVectorHighlighter on multiple fields?

I've got a basic search working, and I'm highlighting using FastVectorHighlighter. When you ask the highlighter for a "best fragment" you have a few overloads of getBestFragment(s) to choose from, documented here. I'm now using the simplest one, like this:
highlightedText = highlighter.getBestFragment(fieldQuery, searcher.getIndexReader(),
scoreDoc.doc, "description", 100)
So I'm highlighting the match from the "description" field. My query however searches another field, "notes". How do I include that in the highlighting? There is an overload that takes a Set<String> matchedFields and one String storedField, but I don't understand the docs. The doc for the method says:
it is advisable that all matchedFields share the same source as storedField or are at least a prefix of it.
What does that mean? How do I index the "notes" and "description" Strings, and what do I pass for matchedFields and storedField?
That call, I believe, is intended to highlight against multiple indexed forms of the same content. That is, if you have one stored full-text content field, but you have indexed it in a number of different ways to expand how you can search it. Perhaps you have one indexed field that uses standard analysis, another with language-specific stemming, another that uses ngrams, and another indexing metaphones.
If you want to highlight two different stored fields, two calls to getBestFragment would be called for. Or you could use a different highlighter that allows multiple stored fields to be highlighted at the same time, PostingsHighlighter, for instance.

How to prevent a field from not analyzing in lucene

I want some fields like urls, to be indexed and stored but not to be analyzed. Field class had a constructor to do the same.
Field(String name, String value, Field.Store store, Field.Index index)
But this constructor has been deprecated since lucene 4 and it is suggested to use StringField or TextField objects. But they don't have any constructors to specify which field to be indexed. So can it be done?
The correct way to index and store an un-analyzed field, as a single token, is to use StringField. It is designed to handle atomic strings, like id numbers, urls, etc. You can specify whether it is stored similarity to in Lucene 3.X
Such as:
new StringField("myUrl, "http://stackoverflow.com/questions/19042587/how-to-prevent-a-field-from-not-analyzing-in-lucene", Field.Store.YES)
Hello you are totally right with what you are saying. With the new fields provided by Lucene you cannot achieve what you want.
You can either continue using the Field as you described or implement your own field by implementing the interface IndexableField. there you can decide yourself what behaviors you want your Field to have.

How do i include other fields in a lucene search?

Lets use emails for an example as a document. You have your subject, body, the person who its from and lets say we can also tag them (as gmail does)
From my understanding of QueryParser i give it ONE field and the parser type. If a user enter text the user only searches whatever i set. I notice it will look in the subject or body field if i wrote fieldName: text to search however how do i make a regular query such as "funny SO question unicorn" find result(s) with some of those strings in the subject, the others in the body? ATM because i knew it would be easy i made a field called ALL and combined all the other fields into that but i would like to know how i can do it in a proper way. Especially since my next app is text search dependent
Use MultiFieldQueryParser. You can specify list of fields to be searched using following constructor.
MultiFieldQueryParser(Version matchVersion, String[] fields, Analyzer analyzer)
This will generate a query as if you have created multiple queries on different fields. This partially addresses your problem. This, still, will not match one term matching in field1 and another matching in field2. For this, as you have rightly pointed out, you will need to combine all the fields in one single field and search in that field. Nevertheless, you will find MultiFieldQueryParser useful when query terms do not cross the field boundaries.

hibernate search multiple fields based on language

I'm interested in changing db full text search to lucene. I'm using hibernate so I guess it would be smart to use hibernate search. I have a problem though.
Our record has a list of informations and titles from different languages and I need to be able to search based on a single language and over all languages.
I could probably do it in plain lucene but I don't know how well it would work with current transactions. So using hibernate search and hibernate to deal with the index would be much better.
Is it possible to create such fields in the index to search the way I described?
class Record{
List<Info> infos;
}
class Info{
String title;
String infoText;
String langCode;
}
Can I do it like this. Create getters in Record like this:
public String getEnghlishTitle(){...}
public String getFullInfos(){...}
And then put index annotations on these getters and then have necessary fields in index?
I would write a custom FieldBridge for the infos property. Then you have full control which fields you add to the index, eg you could could use text. as field names. This should allow to dynamically decide which language to search for. Remember you have to think about the analyzers too. A custom per field analyzer would work.