Finding total number of matches to a query with lucene - lucene

I'm new to lucene so I don't know if it is possible, but I have an index and I would like to get the total amount of phrases in a subset of the index(the subset is defined by a filter).
I can use FilteredQuery with my Filter and a PhraseQuery to search for the phrase and thus I can count the documents in which this phrase occurs, but I can't seem to find a way to count the number of matches per document as well.

You can do this, see LUCENE-2590 for details.
For example code you can look at the unit tests for this feature.
I've copied the relevant code for phrase searchers below,
This is the collector,
private static class CountingCollector extends Collector {
private final Collector other;
private int docBase;
public final Map<Integer, Map<Query, Float>> docCounts = new HashMap<Integer, Map<Query, Float>>();
private final Map<Query, Scorer> subScorers = new HashMap<Query, Scorer>();
private final ScorerVisitor<Query, Query, Scorer> visitor = new MockScorerVisitor();
private final EnumSet<Occur> collect;
private class MockScorerVisitor extends ScorerVisitor<Query, Query, Scorer> {
#Override
public void visitOptional(Query parent, Query child, Scorer scorer) {
if (collect.contains(Occur.SHOULD))
subScorers.put(child, scorer);
}
#Override
public void visitProhibited(Query parent, Query child, Scorer scorer) {
if (collect.contains(Occur.MUST_NOT))
subScorers.put(child, scorer);
}
#Override
public void visitRequired(Query parent, Query child, Scorer scorer) {
if (collect.contains(Occur.MUST))
subScorers.put(child, scorer);
}
}
public CountingCollector(Collector other) {
this(other, EnumSet.allOf(Occur.class));
}
public CountingCollector(Collector other, EnumSet<Occur> collect) {
this.other = other;
this.collect = collect;
}
#Override
public void setScorer(Scorer scorer) throws IOException {
other.setScorer(scorer);
scorer.visitScorers(visitor);
}
#Override
public void collect(int doc) throws IOException {
final Map<Query, Float> freqs = new HashMap<Query, Float>();
for (Map.Entry<Query, Scorer> ent : subScorers.entrySet()) {
Scorer value = ent.getValue();
int matchId = value.docID();
freqs.put(ent.getKey(), matchId == doc ? value.freq() : 0.0f);
}
docCounts.put(doc + docBase, freqs);
other.collect(doc);
}
#Override
public void setNextReader(IndexReader reader, int docBase)
throws IOException {
this.docBase = docBase;
other.setNextReader(reader, docBase);
}
#Override
public boolean acceptsDocsOutOfOrder() {
return other.acceptsDocsOutOfOrder();
}
}
The unit test is,
#Test
public void testPhraseQuery() throws Exception {
PhraseQuery q = new PhraseQuery();
q.add(new Term("f", "b"));
q.add(new Term("f", "c"));
CountingCollector c = new CountingCollector(TopScoreDocCollector.create(10,
true));
s.search(q, null, c);
final int maxDocs = s.maxDoc();
assertEquals(maxDocs, c.docCounts.size());
for (int i = 0; i < maxDocs; i++) {
Map<Query, Float> doc0 = c.docCounts.get(i);
assertEquals(1, doc0.size());
assertEquals(2.0F, doc0.get(q), FLOAT_TOLERANCE);
Map<Query, Float> doc1 = c.docCounts.get(++i);
assertEquals(1, doc1.size());
assertEquals(1.0F, doc1.get(q), FLOAT_TOLERANCE);
}
}

Related

What happens if no records exists when I use Room to query?

I use Room in my Android Studio App, the Code A will crash when no record exists to query
What happens if no records exists when I use Room to query with Code B ?
#Dao
interface RecordDao {
// Code A
#Query("SELECT * FROM record_table where id=:id")
fun getByID(id:Int): Flow<RecordEntity>
// Code B
#Query("SELECT * FROM record_table")
fun getAll(): Flow<List<RecordEntity>>
}
If you look at the code generated by room for the RecordDao_Impl class implementation of RecordDao, you'll notice multiple things:
The getById function code returns null when no matching records exist in the table, and since your 'Code A' function return type is not null, kotlin throws a NullPointerException for the first function.
The getAll function code returns a Flow object with an ArrayList, in case it found any records then it adds them to the list, otherwise it just emits the empty ArrayList to the Flow object, therefore, you'll always get a flow object with a list inside it regardless if room found matching records or not, so no exception is thrown.
You can understand this a bit more if you look at the code generated for the two functions here:
#Override
public Flow<RecordEntity> getByID(final int id) {
final String _sql = "SELECT * FROM record_table where id=?";
final RoomSQLiteQuery _statement = RoomSQLiteQuery.acquire(_sql, 1);
int _argIndex = 1;
_statement.bindLong(_argIndex, id);
return CoroutinesRoom.createFlow(__db, false, new String[]{"record_table"}, new Callable<RecordEntity>() {
#Override
public RecordEntity call() throws Exception {
final Cursor _cursor = DBUtil.query(__db, _statement, false, null);
try {
final int _cursorIndexOfId = CursorUtil.getColumnIndexOrThrow(_cursor, "id");
final RecordEntity _result;
if (_cursor.moveToFirst()) {
final int _tmpId;
_tmpId = _cursor.getInt(_cursorIndexOfId);
_result = new RecordEntity(_tmpId);
} else {
_result = null;
}
return _result;
} finally {
_cursor.close();
}
}
#Override
protected void finalize() {
_statement.release();
}
});
}
#Override
public Flow<List<RecordEntity>> getAll() {
final String _sql = "SELECT * FROM record_table";
final RoomSQLiteQuery _statement = RoomSQLiteQuery.acquire(_sql, 0);
return CoroutinesRoom.createFlow(__db, false, new String[]{"record_table"}, new Callable<List<RecordEntity>>() {
#Override
public List<RecordEntity> call() throws Exception {
final Cursor _cursor = DBUtil.query(__db, _statement, false, null);
try {
final int _cursorIndexOfId = CursorUtil.getColumnIndexOrThrow(_cursor, "id");
final List<RecordEntity> _result = new ArrayList<RecordEntity>(_cursor.getCount());
while (_cursor.moveToNext()) {
final RecordEntity _item;
final int _tmpId;
_tmpId = _cursor.getInt(_cursorIndexOfId);
_item = new RecordEntity(_tmpId);
_result.add(_item);
}
return _result;
} finally {
_cursor.close();
}
}
#Override
protected void finalize() {
_statement.release();
}
});
}

Recyclerview: can't get the item count value

im newbie here... I have a problem in getting the item count on my recyclerview? I've try this code but it's not working.
int count = 0;
if (recyclerViewInstance.getAdapter() != null) {
count = recyclerViewInstance.getAdapter().getItemCount();
}
also this one but it's not working too..
int count = 0;
if (mAdapter != null) {
count = mAdapter.getItemCount();
}
this is my code:
mainActivity:
private List<NavDrawerFleetGetterSetter> navList= new ArrayList<>();
private RecyclerView recyclerView;
private NavDrawerFleetAdapter mAdapter;
Button add;
TextView successCount;
protected void onCreate(final Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_nav_drawer_fleet);
Toolbar toolbar = findViewById(R.id.toolbar);
setSupportActionBar(toolbar);
add = findViewById(R.id.fab);
successCount=findViewById(R.id.count);
recyclerView =findViewById(R.id.nav_sent);
mAdapter = new NavDrawerFleetAdapter(navList);
RecyclerView.LayoutManager mLayoutManager = new
LinearLayoutManager(getApplicationContext());
recyclerView.setLayoutManager(mLayoutManager);
recyclerView.setAdapter(mAdapter);
if (recyclerView.getAdapter() != null) {
successCount.getText(recyclerViewInstance.getAdapter().getItemCount());
}
add.setOnClickListener(new View.OnClickListener() {
#Override
public void onClick(View view) {
NavDrawerFleetGetterSetter navD= new
NavDrawerFleetGetterSetter(lati,longi,dateTime);
navList.add(navD);
mAdapter.notifyDataSetChanged();
}
myGetterSetter:
public class NotSentModuleGetterSetter {
String lat,lon,dateTime;
public NotSentModuleGetterSetter(String lat, String lon, String dateTime) {
this.lat = lat;
this.lon = lon;
this.dateTime = dateTime;
}
public String getLat() {
return lat;
}
public void setLat(String lat) {
this.lat = lat;
}
public String getLon() {
return lon;
}
public void setLon(String lon) {
this.lon = lon;
}
public String getDateTime() {
return dateTime;
}
public void setDateTime(String dateTime) {
this.dateTime = dateTime;
}
}
myOutput:
as you see, the success count was turned into zero.
btw, my data was getting on my database and i was pickup the code need since i have a bunch of codes as for now for my project.
an also, using debug, the data of my mAdapter=null.
In this case, i was getting the count data on my adapter but it should be get the "Size" of adapter than getting the "count". It is now working on me. :)

Lucene sort by score and then modified date

I have three fields in my document
Title
Content
Modified Date
So when I search a term it's giving by results sorted by score
Now I would like to further sort the results with same score based upon on modifiedDate i.e. showing recent documents on top with the same score.
I tried sort by score, modified date but it's not working. Anyone can point me to the right direction?
This can be done simply by defining a Sort:
Sort sort = new Sort(
SortField.FIELD_SCORE,
new SortField("myDateField", SortField.Type.STRING));
indexSearcher.search(myQuery, numHits, sort);
Two possible gotchas here:
You should make sure your date is indexed in a searchable, and sortable, form. Generally, the best way to accomplish this is to convert it using DateTools.
The field used for sorting must be indexed, and should not be analyzed (a StringField, for instance). Up to you whether it is stored.
So adding the date field might look something like:
Field dateField = new StringField(
"myDateField",
DateTools.DateToString(myDateInstance, DateTools.Resolution.MINUTE),
Field.Store.YES);
document.add(dateField);
Note: You can also index dates as a numeric field using Date.getTime(). I prefer the DateTools string approach, as it provides some nicer tools for handling them, particularly with regards to precision, but either way can work.
You can use a custom collector for solving this problem. It will sort result by score, then by timestamp. In this collector you should retrieve the timestamp value for second sorting. See class below
public class CustomCollector extends TopDocsCollector<ScoreDocWithTime> {
ScoreDocWithTime pqTop;
// prevents instantiation
public CustomCollector(int numHits) {
super(new HitQueueWithTime(numHits, true));
// HitQueue implements getSentinelObject to return a ScoreDoc, so we know
// that at this point top() is already initialized.
pqTop = pq.top();
}
#Override
public LeafCollector getLeafCollector(LeafReaderContext context)
throws IOException {
final int docBase = context.docBase;
final NumericDocValues modifiedDate =
DocValues.getNumeric(context.reader(), "modifiedDate");
return new LeafCollector() {
Scorer scorer;
#Override
public void setScorer(Scorer scorer) throws IOException {
this.scorer = scorer;
}
#Override
public void collect(int doc) throws IOException {
float score = scorer.score();
// This collector cannot handle these scores:
assert score != Float.NEGATIVE_INFINITY;
assert !Float.isNaN(score);
totalHits++;
if (score <= pqTop.score) {
// Since docs are returned in-order (i.e., increasing doc Id), a document
// with equal score to pqTop.score cannot compete since HitQueue favors
// documents with lower doc Ids. Therefore reject those docs too.
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop.timestamp = modifiedDate.get(doc);
pqTop = pq.updateTop();
}
};
}
#Override
public boolean needsScores() {
return true;
}
}
Also to do second sorting you need add to ScoreDoc an additional field
public class ScoreDocWithTime extends ScoreDoc {
public long timestamp;
public ScoreDocWithTime(long timestamp, int doc, float score) {
super(doc, score);
this.timestamp = timestamp;
}
public ScoreDocWithTime(long timestamp, int doc, float score, int shardIndex) {
super(doc, score, shardIndex);
this.timestamp = timestamp;
}
}
and create a custom priority queue to support this
public class HitQueueWithTime extends PriorityQueue<ScoreDocWithTime> {
public HitQueueWithTime(int numHits, boolean b) {
super(numHits, b);
}
#Override
protected ScoreDocWithTime getSentinelObject() {
return new ScoreDocWithTime(0, Integer.MAX_VALUE, Float.NEGATIVE_INFINITY);
}
#Override
protected boolean lessThan(ScoreDocWithTime hitA, ScoreDocWithTime hitB) {
if (hitA.score == hitB.score)
return (hitA.timestamp == hitB.timestamp) ?
hitA.doc > hitB.doc :
hitA.timestamp < hitB.timestamp;
else
return hitA.score < hitB.score;
}
}
After this you can search result as you need. See example below
public class SearchTest {
public static void main(String[] args) throws IOException {
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
Directory directory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
addDoc(indexWriter, "w1", 1000);
addDoc(indexWriter, "w1", 3000);
addDoc(indexWriter, "w1", 500);
addDoc(indexWriter, "w1 w2", 1000);
addDoc(indexWriter, "w1 w2", 3000);
addDoc(indexWriter, "w1 w2", 2000);
addDoc(indexWriter, "w1 w2", 5000);
final IndexReader indexReader = DirectoryReader.open(indexWriter, false);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("desc", "w1")), BooleanClause.Occur.SHOULD);
query.add(new TermQuery(new Term("desc", "w2")), BooleanClause.Occur.SHOULD);
CustomCollector results = new CustomCollector(100);
indexSearcher.search(query, results);
TopDocs search = results.topDocs();
for (ScoreDoc sd : search.scoreDocs) {
Document document = indexReader.document(sd.doc);
System.out.println(document.getField("desc").stringValue() + " " + ((ScoreDocWithTime) sd).timestamp);
}
}
private static void addDoc(IndexWriter indexWriter, String decs, long modifiedDate) throws IOException {
Document doc = new Document();
doc.add(new TextField("desc", decs, Field.Store.YES));
doc.add(new LongField("modifiedDate", modifiedDate, Field.Store.YES));
doc.add(new NumericDocValuesField("modifiedDate", modifiedDate));
indexWriter.addDocument(doc);
}
}
Program will output following results
w1 w2 5000
w1 w2 3000
w1 w2 2000
w1 w2 1000
w1 3000
w1 1000
w1 500
P.S. this solution for Lucene 5.1

Oracle Coherence index not working with ContainsFilter query

I've added an index to a cache. The index uses a custom extractor that extends AbstractExtractor and overrides only the extract method to return a List of Strings. Then I have a ContainsFilter which uses the same custom extractor that looks for the occurence of a single String in the List of Strings. It does not look like my index is being used based on the time it takes to execute my test. What am I doing wrong? Also, is there some debugging I can switch on to see which indices are used?
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
// returns a List of String objects
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
}
Adding the index:
mCache = CacheFactory.getCache(pCacheName);
mCache.addIndex(new DependencyIdExtractor(), false, null);
Performing the ContainsFilter query:
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}
I solved this by adding a hashCode and equals method implementation to the DependencyIdExtractor class. It is important that you use exactly the same value extractor when adding an index and creating your filter.
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
#Override
public int hashCode() {
return 1;
}
#Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (obj instanceof DependencyIdExtractor) {
return true;
}
return false;
}
}
To debug Coherence indices/queries, you can generate an explain plan similar to database query explain plans.
http://www.oracle.com/technetwork/tutorials/tutorial-1841899.html
#SuppressWarnings("unchecked")
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
if (mLog.isTraceEnabled()) {
QueryRecorder agent = new QueryRecorder(RecordType.EXPLAIN);
Object resultsExplain = mCache.aggregate(vContainsFilter, agent);
mLog.trace("resultsExplain = \n" + resultsExplain + "\n");
}
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}

How does hive achieve count(distinct ...)?

In the GenericUDAFCount.java:
#Description(name = "count",
value = "_FUNC_(*) - Returns the total number of retrieved rows, including "
+ "rows containing NULL values.\n"
+ "_FUNC_(expr) - Returns the number of rows for which the supplied "
+ "expression is non-NULL.\n"
+ "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for "
+ "which the supplied expression(s) are unique and non-NULL.")
but I don`t see any code to deal with the 'distinct' expression.
public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator {
private boolean countAllColumns = false;
private LongObjectInspector partialCountAggOI;
private LongWritable result;
#Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
partialCountAggOI =
PrimitiveObjectInspectorFactory.writableLongObjectInspector;
result = new LongWritable(0);
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) {
countAllColumns = countAllCols;
return this;
}
/** class for storing count value. */
static class CountAgg implements AggregationBuffer {
long value;
}
#Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
CountAgg buffer = new CountAgg();
reset(buffer);
return buffer;
}
#Override
public void reset(AggregationBuffer agg) throws HiveException {
((CountAgg) agg).value = 0;
}
#Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
// parameters == null means the input table/split is empty
if (parameters == null) {
return;
}
if (countAllColumns) {
assert parameters.length == 0;
((CountAgg) agg).value++;
} else {
assert parameters.length > 0;
boolean countThisRow = true;
for (Object nextParam : parameters) {
if (nextParam == null) {
countThisRow = false;
break;
}
}
if (countThisRow) {
((CountAgg) agg).value++;
}
}
}
#Override
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
long p = partialCountAggOI.get(partial);
((CountAgg) agg).value += p;
}
}
#Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((CountAgg) agg).value);
return result;
}
#Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
return terminate(agg);
}
}
How does hive achieve count(distinct ...)? When task runs, it really cost much time.
Where is it in the source code?
As you can just run SELECT DISTINCT column1 FROM table1, DISTINCT expression isn't a flag or option, it's evaluated independently
This page says:
The actual filtering of data bound to parameter types for DISTINCT
implementation is handled by the framework and not the COUNT UDAF
implementation.
If you want drill down to source details, have a look into hive git repository