HBase secondary index with observer coprocessor, .put on the index table results in recursion - indexing

In HBase database I want to create a secondary index by using additional "linking" table. I have followed the example given in this answer: Create secondary index using coprocesor HBase
I am not very well familiar with the entire concept of HBase, and I had read some examples on the issue of creating secondary indexes. I am attaching the coprocessor to single table only, like this:
disable 'Entry2'
alter 'Entry2', METHOD => 'table_att', 'COPROCESSOR' => '/home/user/hbase/rootdir/hcoprocessors.jar|com.acme.hobservers.EntryParentIndex||'
enable 'Entry2'
The source code of it, is as follows:
public class EntryParentIndex extends BaseRegionObserver{
private static final Log LOG = LogFactory.getLog(CoprocessorHost.class);
private HTablePool pool = null;
private final static String INDEX_TABLE = "EntryParentIndex";
private final static String SOURCE_TABLE = "Entry2";
#Override
public void start(CoprocessorEnvironment env) throws IOException {
pool = new HTablePool(env.getConfiguration(), 10);
}
#Override
public void prePut(
final ObserverContext<RegionCoprocessorEnvironment> observerContext,
final Put put,
final WALEdit edit,
final boolean writeToWAL)
throws IOException {
try {
final List<KeyValue> filteredList = put.get(Bytes.toBytes ("data"),Bytes.toBytes("parentId"));
byte [] id = put.getRow(); //Get the Entry ID
KeyValue kv=filteredList.get( 0 ); //get Entry PARENT_ID
byte[] parentId=kv.getValue();
HTableInterface htbl = pool.getTable(Bytes.toBytes(INDEX_TABLE));
//create row key for the index table
byte[] p1=concatTwoByteArrays(parentId, ":".getBytes()); //Insert a semicolon between two UUIDs
byte[] rowkey=concatTwoByteArrays(p1, id);
Put indexput = new Put(rowkey);
//The following call is setting up a strange? recursion, resulting
//...in thesame prePut method invoken again and again. Interestingly
//...the recursion limits itself up to 6 times. The actual row does not
//...get inserted into the INDEX_TABLE
htbl.put(indexput);
htbl.close();
}
catch ( IllegalArgumentException ex) { }
}
#Override
public void stop(CoprocessorEnvironment env) throws IOException {
pool.close();
}
public static final byte[] concatTwoByteArrays(byte[] first, byte[] second) {
byte[] result = Arrays.copyOf(first, first.length + second.length);
System.arraycopy(second, 0, result, first.length, second.length);
return result;
}
}
This executes when I perform put on the SOURCE_TABLE.
There is a comment in the code (please seek it): "The following call is setting up a strange".
I set a debugging print in the log confirming that the prePut method is being executed only on the SOURCE_TABLE, and never on the INDEX_TABLE. Yet I don't understand why this strange recursion is happening despite in the coprocessor I only execute one put on the INDEX_TABLE.
I have also confirmed that the put action on the source table is again only one.

I have fixed my problem. It came out to be that I was adding multiple times the same observer mistakenly thinking that it is getting lost after Hbase restart.
Also the reason why the .put call to the INDEX_TABLE was not working is because I did not set a value to it, but only a rowkey, mistakenly thinking that this is possible. Yet HBase did not throw any excepiton whatsoever, just did not perform the PUT, no info given, which may be confusing for newcommers to this technology.

Related

PagedListAdapter jumps to beginning of the list on receiving new PagedList

I'm using Paging Library to load data from network using ItemKeyedDataSource. After fetching items user can edit them, this updates are done inside in Memory cache (no database like Room is used).
Now since the PagedList itself cannot be updated (discussed here) I have to recreate PagedList and pass it to the PagedListAdapter.
The update itself is no problem but after updating the recyclerView with the new PagedList, the list jumps to the beginning of the list destroying previous scroll position. Is there anyway to update PagedList while keeping scroll position (like how it works with Room)?
DataSource is implemented this way:
public class MentionKeyedDataSource extends ItemKeyedDataSource<Long, Mention> {
private Repository repository;
...
private List<Mention> cachedItems;
public MentionKeyedDataSource(Repository repository, ..., List<Mention> cachedItems){
super();
this.repository = repository;
this.teamId = teamId;
this.inboxId = inboxId;
this.filter = filter;
this.cachedItems = new ArrayList<>(cachedItems);
}
#Override
public void loadInitial(#NonNull LoadInitialParams<Long> params, final #NonNull ItemKeyedDataSource.LoadInitialCallback<Mention> callback) {
Observable.just(cachedItems)
.filter(() -> return cachedItems != null && !cachedItems.isEmpty())
.switchIfEmpty(repository.getItems(..., params.requestedLoadSize).map(...))
.subscribeOn(Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribe(response -> callback.onResult(response.data.list));
}
#Override
public void loadAfter(#NonNull LoadParams<Long> params, final #NonNull ItemKeyedDataSource.LoadCallback<Mention> callback) {
repository.getOlderItems(..., params.key, params.requestedLoadSize)
.subscribeOn(Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribe(response -> callback.onResult(response.data.list));
}
#Override
public void loadBefore(#NonNull LoadParams<Long> params, final #NonNull ItemKeyedDataSource.LoadCallback<Mention> callback) {
repository.getNewerItems(..., params.key, params.requestedLoadSize)
.subscribeOn(Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribe(response -> callback.onResult(response.data.list));
}
#NonNull
#Override
public Long getKey(#NonNull Mention item) {
return item.id;
}
}
The PagedList created like this:
PagedList.Config config = new PagedList.Config.Builder()
.setPageSize(PAGE_SIZE)
.setInitialLoadSizeHint(preFetchedItems != null && !preFetchedItems.isEmpty()
? preFetchedItems.size()
: PAGE_SIZE * 2
).build();
pagedMentionsList = new PagedList.Builder<>(new MentionKeyedDataSource(mRepository, team.id, inbox.id, mCurrentFilter, preFetchedItems)
, config)
.setFetchExecutor(ApplicationThreadPool.getBackgroundThreadExecutor())
.setNotifyExecutor(ApplicationThreadPool.getUIThreadExecutor())
.build();
The PagedListAdapter is created like this:
public class ItemAdapter extends PagedListAdapter<Item, ItemAdapter.ItemHolder> { //Adapter from google guide, Nothing special here.. }
mAdapter = new ItemAdapter(new DiffUtil.ItemCallback<Mention>() {
#Override
public boolean areItemsTheSame(Item oldItem, Item newItem) {
return oldItem.id == newItem.id;
}
#Override
public boolean areContentsTheSame(Item oldItem, Item newItem) {
return oldItem.equals(newItem);
}
});
, and updated like this:
mAdapter.submitList(pagedList);
You should use a blocking call on your observable. If you don't submit the result in the same thread as loadInitial, loadAfter or loadBefore, what happens is that the adapter will compute the diff of the existing list items against an empty list first, and then against the newly loaded items. So effectively it's as if all items were deleted and then inserted again, that is why the list seems to jump to the beginning.
You're not using androidx.paging.ItemKeyedDataSource.LoadInitialParams#requestedInitialKey in your implementation of loadInitial, and I think you should be.
I took a look at another implementation of ItemKeyedDataSource, the one used by autogenerated Room DAO code: LimitOffsetDataSource. Its implementation of loadInitial contains (Apache 2.0 licensed code follows):
// bound the size requested, based on known count
final int firstLoadPosition = computeInitialLoadPosition(params, totalCount);
final int firstLoadSize = computeInitialLoadSize(params, firstLoadPosition, totalCount);
... where those functions do something with params.requestedStartPosition, params.requestedLoadSize and params.pageSize.
So what's going wrong?
Whenever you pass a new PagedList, you need to make sure that it contains the elements that the user is currently scrolled to. Otherwise, your PagedListAdapter will treat this as a removal of these elements. Then, later, when your loadAfter or loadBefore items load those elements, it will treat them as a subsequent insertion of these elements. You need to avoid doing this removal and insertion of any visible items. Since it sounds like you're scrolling to the top, maybe you're accidentally removing all items and inserting them all.
The way I think this works when using Room with PagedLists is:
The database is updated.
A Room observer invalidates the data source.
The PagedListAdapter code spots the invalidation and uses the factory to create a new data source, and calls loadInitial with the params.requestedStartPosition set to a visible element.
A new PagedList is provided to the PagedListAdapter, who runs the diff checking code to see what's actually changed. Usually, nothing has changed to what's visible, but maybe an element has been inserted, changed or removed. Everything outside the initial load is treated as being removed - this shouldn't be noticeable in the UI.
When scrolling, the PagedListAdapter code can spot that new items need to be loaded, and call loadBefore or loadAfter.
When these complete, an entire new PagedList is provided to the PagedListAdapter, who runs the diff checking code to see what's actually changed. Usually - just an insertion.
I'm not sure how that corresponds to what you're trying to do, but maybe that helps? Whenever you provide a new PagedList, it will be diffed against the previous one, and you want to make sure that there's no spurious insertions or deletions, or it can get really confused.
Other ideas
I've also seen issues where PAGE_SIZE is not big enough. The docs recommend several times the maximum number of elements that can be visible at a time.
This also happens when DiffUtil.ItemCallback is not correctly implemented. And by correct implementation I mean, you should properly check whether the oldItem and newItem are same or not and accordingly return true or false from areItemsTheSame() and areContentsTheSame() methods.
For example, if I always return false from both of these methods like:
DiffUtil.ItemCallback<Mention>() {
#Override
public boolean areItemsTheSame(Item oldItem, Item newItem) {
return false;
}
#Override
public boolean areContentsTheSame(Item oldItem, Item newItem) {
return false;
}
}
The library thinks that all the items are new therefore it jumps to the top to display all the new items.
So make sure you carefully check the oldItem and newItem and properly return true or false based on your comparisons

How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?

I want to read data from Cloud Pub/Sub and write it to BigQuery with Cloud Dataflow. Each data contains a table ID where the data itself will be saved.
There are various factors that writing to BigQuery fails:
Table ID format is wrong.
Dataset does not exist.
Dataset does not allow the pipeline to access.
Network failure.
When one of the failures occurs, a streaming job will retry the task and stall. I tried using WriteResult.getFailedInserts() in order to rescue the bad data and avoid stalling, but it did not work well. Is there any good way?
Here is my code:
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public class MyData implements Serializable {
String table_id;
}
public interface MyOptions extends PipelineOptions {
#Description("PubSub topic to read from, specified as projects/<project_id>/topics/<topic_id>")
#Validation.Required
ValueProvider<String> getInputTopic();
void setInputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<MyData> input = p
.apply("ReadFromPubSub", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("ParseJSON", MapElements.into(TypeDescriptor.of(MyData.class))
.via((String text) -> new Gson().fromJson(text, MyData.class)));
WriteResult writeResult = input
.apply("WriteToBigQuery", BigQueryIO.<MyData>write()
.to(new SerializableFunction<ValueInSingleWindow<MyData>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<MyData> input) {
MyData myData = input.getValue();
return new TableDestination(myData.table_id, null);
}
})
.withSchema(new TableSchema().setFields(new ArrayList<TableFieldSchema>() {{
add(new TableFieldSchema().setName("table_id").setType("STRING"));
}}))
.withFormatFunction(new SerializableFunction<MyData, TableRow>() {
#Override
public TableRow apply(MyData myData) {
return new TableRow().set("table_id", myData.table_id);
}
})
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry()));
writeResult.getFailedInserts()
.apply("LogFailedData", ParDo.of(new DoFn<TableRow, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
TableRow row = c.element();
LOG.info(row.get("table_id").toString());
}
}));
p.run();
}
}
There is no easy way to catch exceptions when writing to output in a pipeline definition. I suppose you could do it by writing a custom PTransform for BigQuery. However, there is no way to do it natively in Apache Beam. I also recommend against this because it undermines Cloud Dataflow's automatic retry functionality.
In your code example, you have the failed insert retry policy set to never retry. You can set the policy to always retry. This is only effective during something like an intermittent network failure (4th bullet point).
.withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
If the table ID format is incorrect (1st bullet point), then the CREATE_IF_NEEDED create disposition configuration should allow the Dataflow job to automatically create a new table without error, even if the table ID is incorrect.
If the dataset does not exist or there is an access permission issue to the dataset (2nd and 3rd bullet points), then my opinion is that the streaming job should stall and ultimately fail. There is no way to proceed under any circumstances without manual intervention.

Java 8 map with Map.get nullPointer Optimization

public class StartObject{
private Something something;
private Set<ObjectThatMatters> objectThatMattersSet;
}
public class Something{
private Set<SomeObject> someObjecSet;
}
public class SomeObject {
private AnotherObject anotherObjectSet;
}
public class AnotherObject{
private Set<ObjectThatMatters> objectThatMattersSet;
}
public class ObjectThatMatters{
private Long id;
}
private void someMethod(StartObject startObject) {
Map<Long, ObjectThatMatters> objectThatMattersMap = StartObject.getSomething()
.getSomeObject.stream()
.map(getSomeObject::getAnotherObject)
.flatMap(anotherObject-> anotherObject.getObjectThatMattersSet().stream())
.collect(Collectors.toMap(ObjectThatMatters -> ObjectThatMatters.getId(), Function.identity()));
Set<ObjectThatMatters > dbObjectThatMatters = new HashSet<>();
try {
dbObjectThatMatters.addAll( tartObject.getObjectThatMatters().stream().map(objectThatMatters-> objectThatMattersMap .get(objectThatMatters.getId())).collect(Collectors.toSet()));
} catch (NullPointerException e) {
throw new someCustomException();
}
startObject.setObjectThatMattersSet(dbObjectThatMatters);
Given a StartObject that contains a set of ObjectThatMatters
And a Something that contains the database structure already fetched filled with all valid ObjectThatMatters.
When I want to swap the StartObject set of ObjectThatMatters to the valid corresponding db objects that only exist in the scope of the Something
Then I compare the set of ObjectThatMatters on the StartObject
And replace every one of them with the valid ObjectThatMatters inside the Something object
And If some ObjectThatMatters doesn't have a valid ObjectThatMatters I throw a someCustomException
This someMethod seems pretty horrible, how can I make it more readable?
Already tried to change the try Catch to a optional but that doesn't actually help.
Used a Map instead of a List with List.contains because of performance, was this a good idea? The total number of ObjectThatMatters will be usually 500.
I'm not allowed to change the other classes structure and I'm only showing you the fields that affect this method not every field since they are extremely rich objects.
You don’t need a mapping step at all. The first operation, which produces a Map, can be used to produce the desired Set in the first place. Since there might be more objects than you are interested in, you may perform a filter operation.
So first, collect the IDs of the desired objects into a set, then collect the corresponding db objects, filtering by the Set of IDs. You can verify whether all IDs have been found, by comparing the resulting Set’s size with the ID Set’s size.
private void someMethod(StartObject startObject) {
Set<Long> id = startObject.getObjectThatMatters().stream()
.map(ObjectThatMatters::getId).collect(Collectors.toSet());
HashSet<ObjectThatMatters> objectThatMattersSet =
startObject.getSomething().getSomeObject().stream()
.flatMap(so -> so.getAnotherObject().getObjectThatMattersSet().stream())
.filter(obj -> id.contains(obj.getId()))
.collect(Collectors.toCollection(HashSet::new));
if(objectThatMattersSet.size() != id.size())
throw new SomeCustomException();
startObject.setObjectThatMattersSet(objectThatMattersSet);
}
This code produces a HashSet; if this is not a requirement, you can just use Collectors.toSet() to get an arbitrary Set implementation.
It’s even easy to find out which IDs were missing:
private void someMethod(StartObject startObject) {
Set<Long> id = startObject.getObjectThatMatters().stream()
.map(ObjectThatMatters::getId)
.collect(Collectors.toCollection(HashSet::new));// ensure mutable Set
HashSet<ObjectThatMatters> objectThatMattersSet =
startObject.getSomething().getSomeObject().stream()
.flatMap(so -> so.getAnotherObject().getObjectThatMattersSet().stream())
.filter(obj -> id.contains(obj.getId()))
.collect(Collectors.toCollection(HashSet::new));
if(objectThatMattersSet.size() != id.size()) {
objectThatMattersSet.stream().map(ObjectThatMatters::getId).forEach(id::remove);
throw new SomeCustomException("The following IDs were not found: "+id);
}
startObject.setObjectThatMattersSet(objectThatMattersSet);
}

Way of working with Lucene index custom Collector

Can someone help me in undestanding a way of working with customized implementations of abstract Collector class in Lucene?
I've implemented two ways of querying index with some test text:
1.Total hits is eq to 2. Both file names are the same, hence results size is eq to 1 because I keep them in a set.
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
LOG.info("Total hits " + topDocs.totalHits);
ScoreDoc[] scoreDosArray = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDosArray) {
Document doc = searcher.doc(scoreDoc.doc);
String fileName = doc.get(FILENAME_FIELD);
results.add(fileName);
}
2.CountCollect is eq to 2. Both documents from which I get files names in collect method of the Collector are unique, hence final results size is also eq to 2. CountNextReader variable is at the end of the logic is eq to 10.
private Set<String> doStreamingSearch(final IndexSearcher searcher, Query query) throws IOException {
final Set<String> results = new HashSet<String>();
Collector collector = new Collector() {
private int base;
private Scorer scorer;
private int countCollect;
private int countNextReader;
#Override
public void collect(int doc) throws IOException {
Document document = searcher.doc(doc);
String filename = document.get(FILENAME_FIELD);
results.add(filename);
countCollect++;
}
#Override
public boolean acceptsDocsOutOfOrder() {
return true;
}
#Override
public void setScorer(Scorer scorer) throws IOException {
this.scorer = scorer;
}
#Override
public void setNextReader(AtomicReaderContext ctx) throws IOException {
this.base = ctx.docBase;
countNextReader++;
}
#Override
public String toString() {
LOG.info("CountCollect: " + countCollect);
LOG.info("CountNextReader: " + countNextReader);
return null;
}
};
searcher.search(query, collector);
collector.toString();
return results;
}
I don't understand why within collect method I get different documents and different file names in comparison with previous implementation? I would expect the same result, or?
The Collector#collect method is the hotspot of a search request. It's called for every document that matches the query, not only the ones that you get back. In fact, you usually get back only the top documents, which are effectively the ones that you show to the users.
I would suggest not to do things like:
TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
which would force lucene to return too many documents.
Anyway, if you only have two matching documents (or you are asking for all the documents that match), the number of documents that you get back and the number of calls to the collect method should be the same.
The setNextReader method is something completely different that you shouldn't care that much about. Have a look at this article if you want to know more about AtomicReader and so on. To keep it short, Lucene stores data as segments, which are mini searchable inverted indexes. Every query is executed on each segment sequentially. Every time the search switches to the next segment the setNextReader method is called to allow to do operations at a segment level in the Collector. For example, the internal lucene document id is unique only within the segment, thus you need to add docBase to it to make it unique within the whole index. That's why you need to store it when the segment changes and take it into account. Your countNextReader variable just contains the number of segments that have been analyzed for your query, it doesn't have anything to do with your documents.
Looking deeper at your Collector code I also noticed you are not taking into account the docBase when retrieving documents by id. This should fix it:
Document document = searcher.doc(doc + docBase);
Keep also in mind that loading a stored field within a Collector is not really a wise thing to do. It's gonna make your searches really slow, because stored fields are loaded from disk. You usually load stored fields only for the subset of documents that you want to return. Within a Collector you usually load information needed to score documents like payloads or similar things, usually making use of the lucene field cache too.

Pros & cons bean vs SSJS?

I was trying to build a bean that always retrieves the same document ( a counter document), gets the current value, increment it and save the document with the new value. Finally it should return the value to the calling method and that would get me a new sequential number in my Xpage.
Since the Domino objects cannot be serialized or singleton'ed what's the benefit creating a bean doing this, over creating a SSJS function doing the exact same thing?
My bean must have calls to session, database, view and document, which then will be called every time.
The same within the SSJS-function except for session and database.
Bean:
public double getTransNo() {
try {
Session session = ExtLibUtil.getCurrentSession();
Database db = session.getCurrentDatabase();
View view = db.getView("vCount");
view.refresh();
doc = view.getFirstDocument();
transNo = doc.getItemValueDouble("count");
doc.replaceItemValue("count", ++transNo);
doc.save();
doc.recycle();
view.recycle();
} catch (NotesException e) {
e.printStackTrace();
}
return transNo;
}
SSJS:
function getTransNo() {
var view:NotesView = database.getView("vCount");
var doc:NotesDocument = view.getFirstDocument();
var transNo = doc.getItemValueDouble("count");
doc.replaceItemValue("count", ++transNo);
doc.save();
doc.recycle();
view.recycle();
return transNo;
}
Thank you
Both pieces of code are not good (sorry to be blunt).
If you have one document in your view, you don't need a view refresh which might be queued behind a refresh on another view and be very slow. Presumably you are talking about a single sever solution (since replication of the counter document would for sure lead to conflicts).
What you do in XPages is to create a Java class and declare it as application bean:
public class SequenceGenerator {
// Error handling is missing in this class
private double sequence = 0;
private String docID;
public SequenceGenerator() {
// Here you load from the document
Session session = ExtLibUtil.getCurrentSession();
Database db = session.getCurrentDatabase();
View view = db.getView("vCount");
doc = view.getFirstDocument();
this.sequence = doc.getItemValueDouble("count");
this.docID = doc.getUniversalId();
Utils.shred(doc, view); //Shred currenDatabase isn't a good idea
}
public synchronized double getNextSequence() {
return this.updateSequence();
}
private double updateSequence() {
this.sequence++;
// If speed if of essence I would spin out a new thread here
Session session = ExtLibUtil.getCurrentSession();
Database db = session.getCurrentDatabase();
doc = db.getDocumentByUnid(this.docID);
doc.ReplaceItemValue("count", this.sequence);
doc.save(true,true);
Utils.shred(doc);
// End of the candidate for a thread
return this.sequence;
}
}
The problem for the SSJS code: what happens if 2 users hit that together? At least you need to use synchronized there too. Using a bean makes it accessible in EL too (you need to watch out not to call it too often). Also in Java you can defer the writing back to a different thread - or not write it back at all and in your class initialization code read the view with the actual documents and pick the value from there.
Update: Utils is a class with static methods:
/**
* Get rid of all Notes objects
*
* #param morituri = the one designated to die, read your Caesar!
*/
public static void shred(Base... morituri) {
for (Base obsoleteObject : morituri) {
if (obsoleteObject != null) {
try {
obsoleteObject.recycle();
} catch (NotesException e) {
// We don't care we want go get
// rid of it anyway
} finally {
obsoleteObject = null;
}
}
}
}