AEM 6.2 OAK Indexing Behavior - indexing

As mentioned in adobe docs - OAK does not index anything by default & custom indexes need to be created when necessary.
But OOTB lucene index under /oak:index/lucene index all content text & binary by default which is 180 degree shift to above statement. If this is true than ideally same lucene index should be used for search and we should not see error.
Source - AEM Lucene OOTB Index - Q43
WARN Traversed 1000 nodes with filter Filter(query=select ...) consider creating an index or changing the query
Of course it does not index any property but still things should be good as most of times query goes for content only. Can anybody suggest?

As per oak docs following indexes are available OOTB and hold true for AEM repo as well. These may fulfill indexing/search needs OR may not depending upon use case as i hope aem will try to use below index defs as best as possible.
A property index for each indexed property.
A full-text index which is based on Apache Lucene / Solr.
A node type index (which is based on an property index for the properties cr:primaryType and jcr:mixins).
A traversal index that iterates over a subtree.
Finally for any search if AEM indexing module does not find any matching index definitions as above, it will go to repo traversal warning indexing error in logs to create index. So these scenarios will always fall under custom index definition creation process.

Related

Performance difference in Couchbase's get by Key and select by index

As we are doing benchmark tests on our Couchbase DB, we tried to compare search for item by their id / key and search for items by a query that uses secondary index.
Following this article about indexing and performance in Couchbase we thought the performance of the two will be the same.
However, in our tests, we discovered that sometimes, the search by key/id was much faster then the search that uses the secondary index.
E.g. ~3MS to search using the index and ~0.3MS to search by the key.(this is a 10 times factor)
The point is that this difference is not consist. The search by key varies from 0.3MS to 15MS.
We are wondering if:
There should be better performance for search by key over search by secondary index?
There should be such time difference between key searches?
The results you get are consistent with what I would expect. Couchbase works as a key-value store when you do any operation using the id. A key-value store is roughly a big distributed hashmap, and in this data structure, you can a very good performance on get/save/delete while using the id.
Whenever you store a new document, couchbase hash the key and assign a Virtual Bucket to it (something similar to a shard). When you need to get this document back, it uses the same algorithm to find out in which virtual bucket the document is located, as the SDK has the cluster map and knows exactly which node has which shards, your application will request the document directly to the node who owns it.
On the other hand, when you query the database, Couchbase has to make internally a map/reduce to find out where the document is located, that is why operations by id are faster.
About your questions about results from 0.3ms to 15ms, it is hard to tell without debugging your environment. However, there are a number of factors that could contribute to it. Ex: the document is cached/not cached, node is undersized, etc.
To add to #deniswrosa's answer, the secondary index will always be slower, because first the index must be traversed based on your query to find the document key, and then a key lookup is performed. Doing just the key lookup is faster if you already have the key. The amount of work to traverse the index can vary depending on how selective the index is, whether the entire index is in memory, etc. Memory-optimized indexes can ensure that the whole index is in memory, if you have enough memory to support that.
Of course even a simple key lookup can be slower if the document in question is not in the cache, and needs to be brought in to memory from storage.
It is possible to achieve sub-millisecond secondary lookups at scale, but it requires some tuning of your query, index, and some possibly some of Couchbase' system parameters. Consider the following simple example:
Sample document in userBucket:
"user::000000000001" : {
"email" : "benjamin1#couchbase.com",
"userId" : "000000000001"
}
This query:
SELECT userId
FROM userBucket
WHERE
email = "benjamin1#couchbase.com"
AND userId IS NOT NULL
;
...should be able to achieve sub-millisecond performance with a properly tuned secondary index:
CREATE INDEX idx01 ON userBucket(email, userId);
Since the index is covering the query completely there is no need for the Query engine to FETCH the document from the K/V store. However "SELECT * ..." will always cause the Query service to FETCH the document and thus will be slower than a simple k/v GET("user::000000000001").
For the best latencies, make sure to review your query plan (using EXPLAIN syntax) and make sure your query is not FETCHing. https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/explain.html

B+ Tree and Index Page in Apache Ignite

I'm trying to understanding the purpose of B+ Tree and Index Pages for Apache Ignite as described here: https://apacheignite.readme.io/docs/page-memory
I have a few questions:
What exactly does an Index Page contain? An ordered list of hash code values for keys that fall into the index page and "other" information that will be used to locate and index into the data page to store/get the key-value pair?
Since hash codes are being used in the index pages, what would happen if collision occurs?
For a "typical" application, do we expect the number of data pages to be much higher than the number of index pages ? (since data pages contain key-value pairs)
What type of relation exists between a distributed cache that we create using ignite.getOrCreateCache(name) and a memory region? 1-to-1, Many-to-1, 1-to-Many, or Many-to-Many?
Consider the following pseudo code:
Ignite ignite = Ignition.start("two_server_node_config");
IgniteCache<Integer,String> cache = ignite.getOrCreateCache("my_cache");
cache.put(7, "abcd");
How does Ignite determine the node to put the key into?
Once the node where to put the key is determined, how does Ignite locate the specific memory region the key belongs to?
Thanks
Index Page contains ordered list of hash values along with links to key-value pairs stored in durable memory. Link = page ID + offset inside page
All links to objects with collided hashes will be present in index page. To perform a lookup, Ignite will dereference links and compare the keys.
This is dependent on object size. You can roughly estimate ratio of data pages to index pages in "typical" application as 90 to 10. However, share of index pages will grow if you add extra indexes: https://apacheignite.readme.io/v2.1/docs/indexes#section-registering-indexed-types
You may also find useful the most recent version of docs: https://apacheignite.readme.io/v2.1/docs/memory-architecture
Answering last two questions:
Many-to-1. Same memory region can be used for multiple caches.
This is based on affinity. Basically, cache key is mapped to affinity key (by default they are the same), and then affinity function is called to determine partition and node. Some more details about affinity here: https://apacheignite.readme.io/docs/affinity-collocation

Duplicates when using nutch -> elasticsearch solution

I have crawled some data using nutch and managed to inject it into elasticsearch. But I have one problem: If I inject the crawled data again it will create duplicates. Is there any way of disallowing this?
Has anyone managed to solve this or have any suggestions on how to solve it?
/Samus
If you index each page/document crawled with the same id in ElasticSearch it won't duplicate it. You could use a checksum/hash function to turn the page's URL into a distinct ID.
You can also use Operation_type to ensure that if that id is already indexed it should not reindex it:
The index operation also accepts an op_type that can be used to force
a create operation, allowing for “put-if-absent” behavior. When create
is used, the index operation will fail if a document by that id
already exists in the index.
ElasticSearch index API
One way , you can keep an index of check sum of all data you have entered into elasticSearch in some db and cross refer those before attempting to send data to elasticSearch.
Or then you can run a "more like this" query to see similar documents and take decision based on that.
LINK - http://www.elasticsearch.org/guide/reference/query-dsl/mlt-field-query.html

How do I remove logically deleted documents from a Solr index?

I am implementing Solr for a free text search for a project where the records available to be searched will need to be added and deleted on a large scale every day.
Because of the scale I need to make sure that the size of the index is appropriate.
On my test installation of Solr, I index a set of 10 documents. Then I make a change in one of the document and want to replace the document with the same ID in the index. This works correctly and behaves as expected when I search.
I am using this code to update the document:
getSolrServer().deleteById(document.getIndexId());
getSolrServer().add(document.getSolrInputDocument());
getSolrServer().commit();
What I noticed though is that when I look at the stats page for the Solr server that the figures are not what I expect.
After the initial index, numDocs and maxDocs both equal 10 as expected. When I update the document however, numDocs is still equal to 10 (expected) but maxDocs equals 11 (unexpected).
When reading the documentation I see that
maxDoc may be larger as the maxDoc count includes logically deleted documents that have not yet been removed from the index.
So the question is, how do I remove logically deleted documents from the index?
If these documents still exist in the index do I run the risk of performance penalties when this is run with a very large volume of documents?
Thanks :)
You have to optimize your index.
Note that an optimize is expansive, you probably should not do it more than daily.
Here is some more info on optimize:
http://www.lucidimagination.com/search/document/CDRG_ch06_6.3.1.3
http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations

How to deal with constantly changing data and SOLR indexes?

Afternoon guys,
I'm using a SOLR index for searching through items on my site. The search results contain an average rating of the item and an amount of comments the item has. The results can be sorted by both rating and num of comments.
But obviously with the solr index, these numbers aren't updated until the db (2million~ rows) is reindexed (done nightly, probably).
What would you guys think is the best way to approach this?
Well, i think you should change your db - index sync policy:
First approach: when commiting database changes also post changes (a batch of them) to indexes. You should write a mapper tier to map your domain objects to solr docs (remember, persists and if it goes ok then index -this works fine for us ;-)). If you want to achieve near real-time index updates you should see solutions like zoey (linkedin lucene-based searching framework)
Second approach: take a look around delta import (and program more frecuently index updates).