Lucene Field Grouping - lucene

say i m having fields stud_roll_number and date_leave.
select stud_roll_number,count(*) from some_table where date_leave > some_date group by stud_roll_number;
how to write the same query using Lucene....I tried after querying date_leave > some_date
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = search.doc(scoreDoc.doc);
String value = doc.get(fieldName);
Integer key = mapGrouper.get(value);
if (key == null) {
key = 1;
} else {
key = key+1;
}
mapGrouper.put(value, key);
}
But, I m having huge data set, it takes much time to compute this. Is there any other way to find it???? Thanks in advance...

Your performance bottleneck is almost certainly the I/O it takes to perform the document and field value lookups. What you want to do in this situation is use a FieldCache for the field you want to group by. Once you have a field cache, you can look up the values by Lucene doc ID, which will be fast because all the values are in memory.
Also remember to give your HashMap an initial capacity to avoid array resizing.

There is a very new grouping module, on https://issues.apache.org/jira/browse/LUCENE-1421 as a patch, that will do this.

Related

Get distinct values for a group of fields from a list of records

We are using Liferay (6.2 CE GA4) with Lucene to perform search on custom assets. Currently we can retrieve the proper hits and the full documents.
We want to return a unique combination of certain fields for our custom asset.
To make it more clear, we want to do something similar to the following SQL query but using Lucene in Liferay:
SELECT DISTINCT
field01, field02, field03
FROM
FieldsTable
WHERE
someOtherField04 LIKE "%test%";
ORDER BY
field01 ASC, field02 ASC, field03 ASC;
How we are doing it currently
Currently we are manually fetching field values by iterating through all the documents and then filtering the duplicate combination. This process takes time when there are more than 5k records to process on each request. And the distinct field values would mostly be a few hundred records.
Any help is much appreciated.
Thanks
P.S.: Also cross-posted on Liferay forums: https://www.liferay.com/community/forums/-/message_boards/message/55513210
First you need to create the SearchContext for your query (just as reference):
SearchContext searchContext = new SearchContext();
searchContext.setAndSearch(true);
// Add any specific attributes for your use case below:
Map<String, Serializable> attributes = new HashMap<>();
attributes.put(Field.CLASS_NAME_ID, 0L);
attributes.put(Field.DESCRIPTION, null);
attributes.put(Field.STATUS, String.valueOf(WorkflowConstants.STATUS_APPROVED));
attributes.put(Field.TITLE, null);
attributes.put(Field.TYPE, null);
attributes.put("articleId", null);
attributes.put("ddmStructureKey", ...);
attributes.put("ddmTemplateKey", ...);
attributes.put("params", new LinkedHashMap<String, Object>());
searchContext.setAttributes(attributes);
searchContext.setCompanyId(... the ID of my portal instance ..);
searchContext.setGroupIds(new long[] { ... the ID of the site ... });
searchContext.setFolderIds(new long[] {});
Now you can find the list of all values for one or more specific fields:
// We don't need any result document, just the field values
searchContext.setStart(0);
searchContext.setEnd(0);
// A facet is responsible for collecting the values
final MultiValueFacet fieldFacet = new MultiValueFacet(searchContext);
String fieldNameInLucene = "ddm/" + structureId + "/" + fieldName + "_" + LocaleUtil.toLanguageId(locale);
fieldFacet.setFieldName(fieldNameInLucene);
searchContext.addFacet(fieldFacet);
// Do search
IndexerRegistryUtil.getIndexer(JournalArticle.class).search(searchContext);
// Retrieve all terms
final List<String> terms = new ArrayList<>();
for (final TermCollector collector : fieldFacet.getFacetCollector().getTermCollectors()) {
terms.add(collector.getTerm());
}
At the end terms will contain all terms of your field from all found documents.

Tuples are not inserted sequentially in database table?

I am trying to insert 10 values of the format "typename_" + i where i is the counter of the loop in a table named roomtype with attributes typename (primary key of SQL type character varying (45)) and samplephoto (it can be NULL and I am not dealing with this for now). What seems strange to me is that the tuples are inserted in different order than the loop counter increments. That is:
typename_1
typename_10
typename_2
typename_3
...
I suppose it's not very important but I can't understand why this is happening. I am using PostgreSQL 9.3.4, pgAdmin III version 1.18.1 and Eclipse Kepler.
The Java code that creates the connection (using JDBC driver) and makes the query is:
import java.sql.*;
import java.util.Random;
public class DBC{
Connection _conn;
public DBC() throws Exception{
try{
Class.forName("org.postgresql.Driver");
}catch(java.lang.ClassNotFoundException e){
java.lang.System.err.print("ClassNotFoundException: Postgres Server JDBC");
java.lang.System.err.println(e.getMessage());
throw new Exception("No JDBC Driver found in Server");
}
try{
_conn = DriverManager.getConnection("jdbc:postgresql://localhost:5432/hotelreservation","user", "0000");
ZipfGenerator p = new ZipfGenerator(new Random(System.currentTimeMillis()));
_conn.setCatalog("jdbcTest");
Statement statement = _conn.createStatement();
String query;
for(int i = 1; i <= 10; i++){
String roomtype_typename = "typename_" + i;
query = "INSERT INTO roomtype VALUES ('" + roomtype_typename + "','" + "NULL" +"')";
System.out.println(i);
statement.execute(query);
}
}catch(SQLException E){
java.lang.System.out.println("SQLException: " + E.getMessage());
java.lang.System.out.println("SQLState: " + E.getSQLState());
java.lang.System.out.println("VendorError: " + E.getErrorCode());
throw E;
}
}
}
But what I get in pgAdmin table is:
This is a misunderstanding. There is no "natural" order in a relational database table. While rows are normally inserted in sequence to the physical file holding a table, a wide range of activities can reshuffle physical order. And queries doing anything more than a basic (non-parallelized) sequential scan may return rows in any opportune order. That's according to standard SQL.
The order you see is arbitrary unless you add ORDER BY to the query.
pgAdmin3 by default orders rows by the primary key (unless specified otherwise). Your column is of type varchar and rows are ordered alphabetically (according to your current locale). All by design, all as it should be.
To sort rows like you seem to be expecting, you could pad some '0' in your text:
...
typename_0009
typename_0010
...
The proper solution would be to have a numeric column with just the number, though.
You may be interested in natural-sort. You may also be interested in a serial column.
i guess, that the output is ordered via alphabet ... if you create typename_1 thru typename_9, everything should be ok. you can also use typename_01 ( filled up with zeros ) to get the correct order.
if you are unsure about that, you can also add a sleep between the insert statements and record the insert-time in the database( as a column )
You are not seeing the order in which PostgreSQL stores the data, but rather the order in which pgadmin displays it.
The edit table feature of pgadmin automatically sorts the data by the primary key by default. that is what you are seeing.
In general, databases store table data in whatever order is convenient. Since you did not intentionally supply an ORDER BY you have no right to care what order it is actually in.

Berkeley DB equivalent of SELECT COUNT(*) All, SELECT COUNT(*) WHERE LIKE "%...%"

I'm looking for Berkeley DB equivalent of
SELECT COUNT All, SELECT COUNT WHERE LIKE "%...%"
I have got 100 records with keys: 1, 2, 3, ... 100.
I have got the following code:
//Key = 1
i=1;
strcpy_s(buf, to_string(i).size()+1, to_string(i).c_str());
key.data = buf;
key.size = to_string(i).size()+1;
key.flags = 0;
data.data = rbuf;
data.size = sizeof(rbuf)+1;
data.flags = 0;
//Cursor
if ((ret = dbp->cursor(dbp, NULL, &dbcp, 0)) != 0) {
dbp->err(dbp, ret, "DB->cursor");
goto err1;
}
//Get
dbcp->get(dbcp, &key, &data_read, DB_SET_RANGE);
db_recno_t cnt;
dbcp->count(dbcp, &cnt, 0);
cout <<"count: "<<cnt<<endl;
Count cnt is always 1 but I expect it calculates all the partial key matches for Key=1: 1, 10, 11, 21, ... 91.
What is wrong in my code/understanding of DB_SET_RANGE ?
Is it possible to get SELECT COUNT WHERE LIKE "%...%" in BDB ?
Also is it possible to get SELECT COUNT All records from the file ?
Thanks
You're expecting Berkeley DB to be way more high-level than it actually is. It doesn't contain anything like what you're asking for. If you want the equivalent of WHERE field LIKE '%1%' you have to make a cursor, read through all the values in the DB, and do the string comparison yourself to pick out the ones that match. That's what an SQL engine actually does to implement your query, and if you're using libdb instead of an SQL engine, it's up to you. If you want it done faster, you can use a secondary index (much like you can create additional indexes for a table in SQL), but you have to provide some code that links the secondary index to the main DB.
DB_SET_RANGE is useful to optimize a very specific case: you're looking for items whose key starts with a specific substring. You can DB_SET_RANGE to find the first matching key, then DB_NEXT your way through the matches, and stop when you get a key that doesn't match. This works only on DB_BTREE databases because it depends on the keys being returned in lexical order.
The count method tells you how many exact duplicate keys there are for the item at the current cursor position.
You can use method DB->stat().
For example, number of unique keys in the BT_TREE.
bool row_amount(DB *db, size_t &amount) {
amount = 0;
if (db==NULL) return false;
DB_BTREE_STAT *sp;
int ret = db->stat(db, NULL, &sp, 0);
if(ret!=0) return false;
amount = (size_t)sp->bt_nkeys;
return true;
}

lucene - most relevant search and sort the results

I am trying to make a search page based on the data we have. Here is my code.
SortField sortField = new SortField(TEXT_FIELD_RANK, SortField.Type.INT, true);
Sort sort = new Sort(sortField);
Query q = queryParser.parse(useQuery);
TopDocs topDocs = searcher.search(q, totalLimit, sort);
ScoreDoc[] hits = topDocs.scoreDocs;
log.info("totalResults="+ topDocs.totalHits);
int index = getStartIndex(start, maxReturn);
int resultsLength = start * maxReturn;
if (resultsLength > totalLimit) {
resultsLength = totalLimit;
}
log.info("index:"+ index + "==resultsLength:"+ resultsLength);
for (int i = index; i < resultsLength; ++i) {
}
Basically, here is my requirement. If there is an exact match, I need to display the exact match. If there is no exact match, I need to sort the results by the field. So i check the exact match inside the for loop.
But it seems to me that it sorts the results no matter what, so even though there is an exact match, it doesn't show up at the first page.
Thanks.
You set it to Sort on a field value, not on relevance, so there is no guarantee that the best matches will be on the first page. You can sort by Relevance first, then on your field value, like:
Sort sort = new Sort(SortField.FIELD_SCORE, sortField);
If that is what you were looking for.
Otherwise, if you are looking to ignore relevance for anything except a direct match, you could query using a more restrictive (exact matching) query first, getting your exact matches as an entirely separate result set.

How to get reliable docid from Lucene 3.0.3?

I would like to get the int docid of a Document I just added to a Lucene index so that I can stick it into a Filter to update a standing query. My documents have a unique external id, so I thought that doing a TermDocs enumeration on the unique id would return the correct document, like this:
protected int getDocId(IndexReader reader, String idField, Document doc) throws IOException {
String id = doc.get(idField);
TermDocs termDocs = reader.termDocs(new Term(idField, id));
int docid = -1;
while (termDocs.next()) {
docid = termDocs.doc();
Document aDoc = reader.document(docid);
String docIdString = aDoc.get(idField);
System.out.println(docIdString + ": " + docid);
}
return docid;
}
Unfortunately, this loops and loops, returning the same docIdString and increasing docids.
What is the recommended way to get the docids for newly-added documents so that I could use them in a Filter immediately after the documents are commited?
The doc Id of a document is not the same as the value in your id field. The doc ID is an internal Lucene identifier, which you probably shouldn't access. Your field is just a field - you can call it "ID", but Lucene won't do anything special with it.
Why are you trying to manually update the filter? When you commit, merges can happen etc. so the IDs before will not be the same as the IDs afterwards. (Which is just an example of the general point that you shouldn't rely on Lucene IDs for anything.) So you don't need to just add that one document to the filter, you need to update the whole thing.
To update a cached filter, just run a query for "foo" and use your filter with a CachingWrapperFilter.
EDIT: Because your id field is just a field, you do a search for it like you would anything else:
TopDocs results = searcher.Search(new TermQuery(new Term("MyIDField", Id)), 1);
int internalId = results.scoreDocs[0].doc;
However, like I said, I think you want to ignore internal IDs. So I would build a filter from a query:
BooleanQuery filterQuery = new BooleanQuery(); // or get existing query from cache
filterQuery.Add(new TermQuery(new Term("MyIdField", Id)), BooleanClause.Occur.SHOULD);
// add more sub queries for each ID you want in the filter here
Filter myFilter = new CachingWrapperFilter(new QueryWrapperFilter(filterQuery));