searching lucene index on multiple fields - lucene

I have an index with 2 content fields (analyzed, indexed & stored):
for example: name , hobbies. (The hobbies field can be added multiple times with different values).
I have another field that is only indexed (un_analyzed & not stored) used for filtering:
for example: country_code
Now, I want to build a query that will retrieve documents that match (as best as possible) to some "search" input field but only such documents where country_code has some exact value.
What would be the most suitable combination query syntax / query parser to use to build such a query.

You can use the following query:
country_code:india +(name:search_value OR hobbies:search_value)

Why don't you start with QueryParser, it might work for your use case and it requires the least amount of effort.
It's not clear from your question, but let's assume you have a single input field ('search') and a combobox for the country code. You would then read those values and create a query:
// you don't have to use two parsers, you can do this using one.
QueryParser nameParser = new QueryParser(Version.LUCENE_CURRENT, "name", your_analyzer);
QueryParser hobbiesParser = new QueryParser(Version.LUCENE_CURRENT, "hobbies", your_analyzer);
BooleanQuery q = new BooleanQuery();
q.add(nameParser.parser(query), BooleanClause.Occur.SHOULD);
q.add(hobbiesParser.parser(query), BooleanClause.Occur.SHOULD);BooleanClause.Occur.SHOULD);
/* Filtering by country code can be done using a BooleanQuery
* or a filter, the difference will be how Lucene scores matches.
* For example, using a filter:
*/
Filter countryCodeFilter = new QueryWrapperFilter(new TermQuery(new Term("country_code", )));
//and finally searching:
TopDocs topDocs = searcher.search(q, countryCodeFilter, 10);

Related

Get distinct values for a group of fields from a list of records

We are using Liferay (6.2 CE GA4) with Lucene to perform search on custom assets. Currently we can retrieve the proper hits and the full documents.
We want to return a unique combination of certain fields for our custom asset.
To make it more clear, we want to do something similar to the following SQL query but using Lucene in Liferay:
SELECT DISTINCT
field01, field02, field03
FROM
FieldsTable
WHERE
someOtherField04 LIKE "%test%";
ORDER BY
field01 ASC, field02 ASC, field03 ASC;
How we are doing it currently
Currently we are manually fetching field values by iterating through all the documents and then filtering the duplicate combination. This process takes time when there are more than 5k records to process on each request. And the distinct field values would mostly be a few hundred records.
Any help is much appreciated.
Thanks
P.S.: Also cross-posted on Liferay forums: https://www.liferay.com/community/forums/-/message_boards/message/55513210
First you need to create the SearchContext for your query (just as reference):
SearchContext searchContext = new SearchContext();
searchContext.setAndSearch(true);
// Add any specific attributes for your use case below:
Map<String, Serializable> attributes = new HashMap<>();
attributes.put(Field.CLASS_NAME_ID, 0L);
attributes.put(Field.DESCRIPTION, null);
attributes.put(Field.STATUS, String.valueOf(WorkflowConstants.STATUS_APPROVED));
attributes.put(Field.TITLE, null);
attributes.put(Field.TYPE, null);
attributes.put("articleId", null);
attributes.put("ddmStructureKey", ...);
attributes.put("ddmTemplateKey", ...);
attributes.put("params", new LinkedHashMap<String, Object>());
searchContext.setAttributes(attributes);
searchContext.setCompanyId(... the ID of my portal instance ..);
searchContext.setGroupIds(new long[] { ... the ID of the site ... });
searchContext.setFolderIds(new long[] {});
Now you can find the list of all values for one or more specific fields:
// We don't need any result document, just the field values
searchContext.setStart(0);
searchContext.setEnd(0);
// A facet is responsible for collecting the values
final MultiValueFacet fieldFacet = new MultiValueFacet(searchContext);
String fieldNameInLucene = "ddm/" + structureId + "/" + fieldName + "_" + LocaleUtil.toLanguageId(locale);
fieldFacet.setFieldName(fieldNameInLucene);
searchContext.addFacet(fieldFacet);
// Do search
IndexerRegistryUtil.getIndexer(JournalArticle.class).search(searchContext);
// Retrieve all terms
final List<String> terms = new ArrayList<>();
for (final TermCollector collector : fieldFacet.getFacetCollector().getTermCollectors()) {
terms.add(collector.getTerm());
}
At the end terms will contain all terms of your field from all found documents.

how to get search hits when at least one character present in field value using lucene search

How do I get search hits when at least one character searched is present in a field's value, using lucene search?
I got search hits only when I search with a complete word.
Example:
Hello world
In above example, if I enter "Hello", then I will get a hit, but not if I enter "Hel"
Here is my code to get hits:
QueryParser parser = null;
Query query = null;
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT, new HashSet());
BooleanQuery.setMaxClauseCount(32767);
parser = new QueryParser("fieldname", analyzer);
parser.setAllowLeadingWildcard(true);
query = parser.parse("searchString");
TopDocs topResultDocs = searcher.search(query, null, 20);
Always append * to the query to get all suffix matches: Hel* will match Hello.

lucene - most relevant search and sort the results

I am trying to make a search page based on the data we have. Here is my code.
SortField sortField = new SortField(TEXT_FIELD_RANK, SortField.Type.INT, true);
Sort sort = new Sort(sortField);
Query q = queryParser.parse(useQuery);
TopDocs topDocs = searcher.search(q, totalLimit, sort);
ScoreDoc[] hits = topDocs.scoreDocs;
log.info("totalResults="+ topDocs.totalHits);
int index = getStartIndex(start, maxReturn);
int resultsLength = start * maxReturn;
if (resultsLength > totalLimit) {
resultsLength = totalLimit;
}
log.info("index:"+ index + "==resultsLength:"+ resultsLength);
for (int i = index; i < resultsLength; ++i) {
}
Basically, here is my requirement. If there is an exact match, I need to display the exact match. If there is no exact match, I need to sort the results by the field. So i check the exact match inside the for loop.
But it seems to me that it sorts the results no matter what, so even though there is an exact match, it doesn't show up at the first page.
Thanks.
You set it to Sort on a field value, not on relevance, so there is no guarantee that the best matches will be on the first page. You can sort by Relevance first, then on your field value, like:
Sort sort = new Sort(SortField.FIELD_SCORE, sortField);
If that is what you were looking for.
Otherwise, if you are looking to ignore relevance for anything except a direct match, you could query using a more restrictive (exact matching) query first, getting your exact matches as an entirely separate result set.

how to search word from String field in Lucene Index

How to search word from Lucene index String field ?
i have lucene index with field TITLE ,containts Document titles
eg:TV not working,Mobile not working
i want to search particular word from title .
code below gives me result from Full content,if i change FULL_CONTENET to TITLE then i dont get any results.
Query qry = null;
qry = new QueryParser(FULL_CONTENT, new SimpleAnalyzer()).parse("not");
Searcher searcher = null;
searcher = new IndexSearcher(indexDirectory);
Hits hits = null;
hits = searcher.search(qry);
System.out.println(hits.length());
As "NOT" is a Lucene query syntax operator, that may be your problem.
The problem is StringAnalyzer applies a Lower Case filter. Your query will be in lower case:
e.g. title:mobile.
StringField doesn't apply any analysis so your text will be indexed as is. If you change StringField to TextField it will be analyzed by the StringAnalyzer and get converted to lower case in the index.
If you replace StringAnalyzer with WhitespaceAnalyzer there is no Lower Case filter and it will work again (because your query doesn't get converted to lower case).

How to get reliable docid from Lucene 3.0.3?

I would like to get the int docid of a Document I just added to a Lucene index so that I can stick it into a Filter to update a standing query. My documents have a unique external id, so I thought that doing a TermDocs enumeration on the unique id would return the correct document, like this:
protected int getDocId(IndexReader reader, String idField, Document doc) throws IOException {
String id = doc.get(idField);
TermDocs termDocs = reader.termDocs(new Term(idField, id));
int docid = -1;
while (termDocs.next()) {
docid = termDocs.doc();
Document aDoc = reader.document(docid);
String docIdString = aDoc.get(idField);
System.out.println(docIdString + ": " + docid);
}
return docid;
}
Unfortunately, this loops and loops, returning the same docIdString and increasing docids.
What is the recommended way to get the docids for newly-added documents so that I could use them in a Filter immediately after the documents are commited?
The doc Id of a document is not the same as the value in your id field. The doc ID is an internal Lucene identifier, which you probably shouldn't access. Your field is just a field - you can call it "ID", but Lucene won't do anything special with it.
Why are you trying to manually update the filter? When you commit, merges can happen etc. so the IDs before will not be the same as the IDs afterwards. (Which is just an example of the general point that you shouldn't rely on Lucene IDs for anything.) So you don't need to just add that one document to the filter, you need to update the whole thing.
To update a cached filter, just run a query for "foo" and use your filter with a CachingWrapperFilter.
EDIT: Because your id field is just a field, you do a search for it like you would anything else:
TopDocs results = searcher.Search(new TermQuery(new Term("MyIDField", Id)), 1);
int internalId = results.scoreDocs[0].doc;
However, like I said, I think you want to ignore internal IDs. So I would build a filter from a query:
BooleanQuery filterQuery = new BooleanQuery(); // or get existing query from cache
filterQuery.Add(new TermQuery(new Term("MyIdField", Id)), BooleanClause.Occur.SHOULD);
// add more sub queries for each ID you want in the filter here
Filter myFilter = new CachingWrapperFilter(new QueryWrapperFilter(filterQuery));