Get distinct values for a group of fields from a list of records - lucene

We are using Liferay (6.2 CE GA4) with Lucene to perform search on custom assets. Currently we can retrieve the proper hits and the full documents.
We want to return a unique combination of certain fields for our custom asset.
To make it more clear, we want to do something similar to the following SQL query but using Lucene in Liferay:
SELECT DISTINCT
field01, field02, field03
FROM
FieldsTable
WHERE
someOtherField04 LIKE "%test%";
ORDER BY
field01 ASC, field02 ASC, field03 ASC;
How we are doing it currently
Currently we are manually fetching field values by iterating through all the documents and then filtering the duplicate combination. This process takes time when there are more than 5k records to process on each request. And the distinct field values would mostly be a few hundred records.
Any help is much appreciated.
Thanks
P.S.: Also cross-posted on Liferay forums: https://www.liferay.com/community/forums/-/message_boards/message/55513210

First you need to create the SearchContext for your query (just as reference):
SearchContext searchContext = new SearchContext();
searchContext.setAndSearch(true);
// Add any specific attributes for your use case below:
Map<String, Serializable> attributes = new HashMap<>();
attributes.put(Field.CLASS_NAME_ID, 0L);
attributes.put(Field.DESCRIPTION, null);
attributes.put(Field.STATUS, String.valueOf(WorkflowConstants.STATUS_APPROVED));
attributes.put(Field.TITLE, null);
attributes.put(Field.TYPE, null);
attributes.put("articleId", null);
attributes.put("ddmStructureKey", ...);
attributes.put("ddmTemplateKey", ...);
attributes.put("params", new LinkedHashMap<String, Object>());
searchContext.setAttributes(attributes);
searchContext.setCompanyId(... the ID of my portal instance ..);
searchContext.setGroupIds(new long[] { ... the ID of the site ... });
searchContext.setFolderIds(new long[] {});
Now you can find the list of all values for one or more specific fields:
// We don't need any result document, just the field values
searchContext.setStart(0);
searchContext.setEnd(0);
// A facet is responsible for collecting the values
final MultiValueFacet fieldFacet = new MultiValueFacet(searchContext);
String fieldNameInLucene = "ddm/" + structureId + "/" + fieldName + "_" + LocaleUtil.toLanguageId(locale);
fieldFacet.setFieldName(fieldNameInLucene);
searchContext.addFacet(fieldFacet);
// Do search
IndexerRegistryUtil.getIndexer(JournalArticle.class).search(searchContext);
// Retrieve all terms
final List<String> terms = new ArrayList<>();
for (final TermCollector collector : fieldFacet.getFacetCollector().getTermCollectors()) {
terms.add(collector.getTerm());
}
At the end terms will contain all terms of your field from all found documents.

Related

Dynamic spinner using a SQLite Database

I want my spinner to be dynamically updated from my SQLite database. The spinner should contain months and years (ex. april 2013), from my sqlite database.
I have searched the web a lot, and also many of the questions from this site, but now i can't solve the rest.
Right now my spinner list is getting longer, but there is no text? What is the problem?
Here is the relevant code in my main class
KilometerSQL info = new KilometerSQL(this);
info.open();
String data = info.getData();
final Cursor cSpinner;
cSpinner = (Cursor) KilometerSQL.getSpinnerData();
startManagingCursor(cSpinner);
SimpleCursorAdapter scaYear = new SimpleCursorAdapter(this, android.R.layout.simple_spinner_item,cSpinner,new String[] {KilometerSQL.KEY_MONTH},new int[]{});
scaYear.setDropDownViewResource(android.R.layout.simple_spinner_dropdown_item);
spinner1.setAdapter(scaYear);
info.close();
tvView.setText(data);
And here is my SQLiteDatabase:
public static Cursor getSpinnerData() {
// TODO Auto-generated method stub
return ourDatabase.query(DATABASE_TABLE, //table name
new String[] {KEY_ROWID, KEY_MONTH}, //list of columns to return
null, //filter declaring which rows to return; formatted as SQL WHERE clause
null,
KEY_MONTH, //filter declaring how to group rows; formatted as SQL GROUP BY clause
null, //filter declaring which row groups to include in cursor; formatted as SQL HAVING clause
null); //how to order rows; formatted as SQL ORDER BY clause
}
Don't hesitate to ask questions if you need some info or code.
Thank you very much.
I found the solution..
I needed to add this into the code: android.R.id.text1
like this:
SimpleCursorAdapter scaYear = new SimpleCursorAdapter(this, android.R.layout.simple_spinner_item,cSpinner,new String[] {KilometerSQL.KEY_MONTH},new int[]{android.R.id.text1});

Modeshape querying mixinTypes

I'm using Modeshape and modeshape-connector-jdbc-metadata. I want to get all nodes representing tables in the storage. That nodes have [mj:catalog] mixin type.
I'm querying storage using next code:
public List getDatabases() throws RepositoryException {
// Obtain the query manager for the session ...
QueryManager queryManager = dbSession.getWorkspace().getQueryManager();
// Create a query object ...
Query query = queryManager.createQuery("SELECT * FROM [mj:table]"
, Query.JCR_SQL2);
// Execute the query and get the results ...
QueryResult result = query.execute();
// Iterate over the nodes in the results ...
NodeIterator nodeIter = result.getNodes();
List stringResult = new ArrayList();
while (nodeIter.hasNext()) {
stringResult.add(nodeIter.nextNode().getName());
}
return stringResult;
}
But it always returns empty list.
I also tried to query using next queries:
SELECT unst.*, tbl.* FROM [nt:unstructured] AS unst
JOIN [mj:table] AS tbl ON ISSAMENODE(unst,tbl)
SELECT * FROM [nt:unstructured] WHERE [jcr:mixinTypes] = [mj:table]
But result remains the same.
What I'm doing wrong?
Thank you for any help.
There is a known issue that the database metadata nodes are not indexed automatically. A simple workaround is to cast the JCR Session's getWorkspace() instance to org.modeshape.jcr.api.Workspace (the public API for ModeShape's workspace) and call the reindex(String path) method and passing in the path to the database catalog node (or an ancestor if desired).

searching lucene index on multiple fields

I have an index with 2 content fields (analyzed, indexed & stored):
for example: name , hobbies. (The hobbies field can be added multiple times with different values).
I have another field that is only indexed (un_analyzed & not stored) used for filtering:
for example: country_code
Now, I want to build a query that will retrieve documents that match (as best as possible) to some "search" input field but only such documents where country_code has some exact value.
What would be the most suitable combination query syntax / query parser to use to build such a query.
You can use the following query:
country_code:india +(name:search_value OR hobbies:search_value)
Why don't you start with QueryParser, it might work for your use case and it requires the least amount of effort.
It's not clear from your question, but let's assume you have a single input field ('search') and a combobox for the country code. You would then read those values and create a query:
// you don't have to use two parsers, you can do this using one.
QueryParser nameParser = new QueryParser(Version.LUCENE_CURRENT, "name", your_analyzer);
QueryParser hobbiesParser = new QueryParser(Version.LUCENE_CURRENT, "hobbies", your_analyzer);
BooleanQuery q = new BooleanQuery();
q.add(nameParser.parser(query), BooleanClause.Occur.SHOULD);
q.add(hobbiesParser.parser(query), BooleanClause.Occur.SHOULD);BooleanClause.Occur.SHOULD);
/* Filtering by country code can be done using a BooleanQuery
* or a filter, the difference will be how Lucene scores matches.
* For example, using a filter:
*/
Filter countryCodeFilter = new QueryWrapperFilter(new TermQuery(new Term("country_code", )));
//and finally searching:
TopDocs topDocs = searcher.search(q, countryCodeFilter, 10);

How to get reliable docid from Lucene 3.0.3?

I would like to get the int docid of a Document I just added to a Lucene index so that I can stick it into a Filter to update a standing query. My documents have a unique external id, so I thought that doing a TermDocs enumeration on the unique id would return the correct document, like this:
protected int getDocId(IndexReader reader, String idField, Document doc) throws IOException {
String id = doc.get(idField);
TermDocs termDocs = reader.termDocs(new Term(idField, id));
int docid = -1;
while (termDocs.next()) {
docid = termDocs.doc();
Document aDoc = reader.document(docid);
String docIdString = aDoc.get(idField);
System.out.println(docIdString + ": " + docid);
}
return docid;
}
Unfortunately, this loops and loops, returning the same docIdString and increasing docids.
What is the recommended way to get the docids for newly-added documents so that I could use them in a Filter immediately after the documents are commited?
The doc Id of a document is not the same as the value in your id field. The doc ID is an internal Lucene identifier, which you probably shouldn't access. Your field is just a field - you can call it "ID", but Lucene won't do anything special with it.
Why are you trying to manually update the filter? When you commit, merges can happen etc. so the IDs before will not be the same as the IDs afterwards. (Which is just an example of the general point that you shouldn't rely on Lucene IDs for anything.) So you don't need to just add that one document to the filter, you need to update the whole thing.
To update a cached filter, just run a query for "foo" and use your filter with a CachingWrapperFilter.
EDIT: Because your id field is just a field, you do a search for it like you would anything else:
TopDocs results = searcher.Search(new TermQuery(new Term("MyIDField", Id)), 1);
int internalId = results.scoreDocs[0].doc;
However, like I said, I think you want to ignore internal IDs. So I would build a filter from a query:
BooleanQuery filterQuery = new BooleanQuery(); // or get existing query from cache
filterQuery.Add(new TermQuery(new Term("MyIdField", Id)), BooleanClause.Occur.SHOULD);
// add more sub queries for each ID you want in the filter here
Filter myFilter = new CachingWrapperFilter(new QueryWrapperFilter(filterQuery));

Lucene Field Grouping

say i m having fields stud_roll_number and date_leave.
select stud_roll_number,count(*) from some_table where date_leave > some_date group by stud_roll_number;
how to write the same query using Lucene....I tried after querying date_leave > some_date
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = search.doc(scoreDoc.doc);
String value = doc.get(fieldName);
Integer key = mapGrouper.get(value);
if (key == null) {
key = 1;
} else {
key = key+1;
}
mapGrouper.put(value, key);
}
But, I m having huge data set, it takes much time to compute this. Is there any other way to find it???? Thanks in advance...
Your performance bottleneck is almost certainly the I/O it takes to perform the document and field value lookups. What you want to do in this situation is use a FieldCache for the field you want to group by. Once you have a field cache, you can look up the values by Lucene doc ID, which will be fast because all the values are in memory.
Also remember to give your HashMap an initial capacity to avoid array resizing.
There is a very new grouping module, on https://issues.apache.org/jira/browse/LUCENE-1421 as a patch, that will do this.