Lucene Index cleanup in Cassandra - indexing

We are using Lucene index in one Cassandra Table. As and when Cassandra data (With TWCS compaction for expired tombstone) gets removed,we could see index cleanup is not happening automatically.What is the best way for Lucene Index clanup in cassandra.

If this is an index that Cassandra knows about (and the nodetool option exists in your version) then you can use 'nodetool rebuild_index' to clean up the index:
https://cassandra.apache.org/doc/4.1/cassandra/tools/nodetool/rebuild_index.html#rebuild_index
Otherwise, you can delete all the documents in the current index, then reindex, which should also do the trick.
https://solr.apache.org/guide/8_0/reindexing.html

Related

Performance difference in Couchbase's get by Key and select by index

As we are doing benchmark tests on our Couchbase DB, we tried to compare search for item by their id / key and search for items by a query that uses secondary index.
Following this article about indexing and performance in Couchbase we thought the performance of the two will be the same.
However, in our tests, we discovered that sometimes, the search by key/id was much faster then the search that uses the secondary index.
E.g. ~3MS to search using the index and ~0.3MS to search by the key.(this is a 10 times factor)
The point is that this difference is not consist. The search by key varies from 0.3MS to 15MS.
We are wondering if:
There should be better performance for search by key over search by secondary index?
There should be such time difference between key searches?
The results you get are consistent with what I would expect. Couchbase works as a key-value store when you do any operation using the id. A key-value store is roughly a big distributed hashmap, and in this data structure, you can a very good performance on get/save/delete while using the id.
Whenever you store a new document, couchbase hash the key and assign a Virtual Bucket to it (something similar to a shard). When you need to get this document back, it uses the same algorithm to find out in which virtual bucket the document is located, as the SDK has the cluster map and knows exactly which node has which shards, your application will request the document directly to the node who owns it.
On the other hand, when you query the database, Couchbase has to make internally a map/reduce to find out where the document is located, that is why operations by id are faster.
About your questions about results from 0.3ms to 15ms, it is hard to tell without debugging your environment. However, there are a number of factors that could contribute to it. Ex: the document is cached/not cached, node is undersized, etc.
To add to #deniswrosa's answer, the secondary index will always be slower, because first the index must be traversed based on your query to find the document key, and then a key lookup is performed. Doing just the key lookup is faster if you already have the key. The amount of work to traverse the index can vary depending on how selective the index is, whether the entire index is in memory, etc. Memory-optimized indexes can ensure that the whole index is in memory, if you have enough memory to support that.
Of course even a simple key lookup can be slower if the document in question is not in the cache, and needs to be brought in to memory from storage.
It is possible to achieve sub-millisecond secondary lookups at scale, but it requires some tuning of your query, index, and some possibly some of Couchbase' system parameters. Consider the following simple example:
Sample document in userBucket:
"user::000000000001" : {
"email" : "benjamin1#couchbase.com",
"userId" : "000000000001"
}
This query:
SELECT userId
FROM userBucket
WHERE
email = "benjamin1#couchbase.com"
AND userId IS NOT NULL
;
...should be able to achieve sub-millisecond performance with a properly tuned secondary index:
CREATE INDEX idx01 ON userBucket(email, userId);
Since the index is covering the query completely there is no need for the Query engine to FETCH the document from the K/V store. However "SELECT * ..." will always cause the Query service to FETCH the document and thus will be slower than a simple k/v GET("user::000000000001").
For the best latencies, make sure to review your query plan (using EXPLAIN syntax) and make sure your query is not FETCHing. https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/explain.html

AEM 6.2 OAK Indexing Behavior

As mentioned in adobe docs - OAK does not index anything by default & custom indexes need to be created when necessary.
But OOTB lucene index under /oak:index/lucene index all content text & binary by default which is 180 degree shift to above statement. If this is true than ideally same lucene index should be used for search and we should not see error.
Source - AEM Lucene OOTB Index - Q43
WARN Traversed 1000 nodes with filter Filter(query=select ...) consider creating an index or changing the query
Of course it does not index any property but still things should be good as most of times query goes for content only. Can anybody suggest?
As per oak docs following indexes are available OOTB and hold true for AEM repo as well. These may fulfill indexing/search needs OR may not depending upon use case as i hope aem will try to use below index defs as best as possible.
A property index for each indexed property.
A full-text index which is based on Apache Lucene / Solr.
A node type index (which is based on an property index for the properties cr:primaryType and jcr:mixins).
A traversal index that iterates over a subtree.
Finally for any search if AEM indexing module does not find any matching index definitions as above, it will go to repo traversal warning indexing error in logs to create index. So these scenarios will always fall under custom index definition creation process.

How do I drop an index using Lucandra?

I am using Lucandra and want to drop an entire index. The IndexReader and IndexWriters don't have all methods implemented so even iterating through a call to deleteDocument(int docNum) isn't possible.
Has anyone run up against this and either figured out how to either
hack the Cassandra keyspace or
made additions to the Lucandra code, or
figured out how to construct an iterator to delete all docs?
The current version of lucandra doesn't store documents from 1-N so the deleteDocument(int) doesn't work.
What I've done is index a field with the same term in all documents so you can match all documents then delete them with deleteDocuments(Term) call.
Another option (if you only have 1 index per keyspace) is to truncate the cassandra CFs
The next version of lucandra(in development) does store documents 1-N fashion.

Does MySQL use existing indexes on creating new indexes?

I have a large table with millions of records.
Table `price`
------------
id
product
site
value
The table is brand new, and there are no indexes created.
I then issued a request for new index creation with the following query:
CREATE INDEX ix_price_site_product_value_id ON price (site, product, value, id);
This took long long time, last time I was checking ran for 5000+ seconds, because of the machine.
I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
Next to run query 1:
CREATE INDEX ix_price_product_value_id ON price (product, value, id);
Next to run query 2:
CREATE INDEX ix_price_value_id ON price (value, id);
I am wondering if I issue another index creation, will it use the existing index in the process calculation? If so in what form?
No, it won't.
Theoretically, an index on (site, product, value, id) has everything required to build an index on any subset of these fields (including the indices on (product, value, id) and (value, id)).
However, building an index from a secondary index is not supported.
First, MySQL does not support fast full index scan (that is scanning an index in physical order rather than logical), thus making an index access path more expensive than the table read. This is not a problem for InnoDB, since the table itself is always clustered.
Second, the record orders in these indexes are completely different so the records need to be sorted anyway.
However, the main problem with the index creation speed in MySQL is that it generates the order on site (just inserting the records one by one into a B-Tree) instead of using a presorted source. As #Daniel mentioned, fast index creation solves this problem. It is available as a plugin for 5.1 and comes preinstalled in 5.5.
If you're using MySQL version 5.1, and the InnoDB storage engine, you may want to use the InnoDB Plugin 1.0, which supports a new feature called Fast Index Creation. This allows the storage engine to create indexes without copying the contents of the entire table.
Overview of the InnoDB Plugin:
Starting with version 5.1, MySQL AB has promoted the idea of a “pluggable” storage engine architecture, which permits multiple storage engines to be added to MySQL. Currently, however, most users have accessed only those storage engines that are distributed by MySQL AB, and are linked into the binary (executable) releases.
Since 2001, MySQL AB has distributed the InnoDB transactional storage engine with its releases (both source and binary). Beginning with MySQL version 5.1, it is possible for users to swap out one version of InnoDB and use another.
Source: Introduction to the InnoDB Plugin
Overview of Fast Index Creation:
In MySQL versions up to 5.0, adding or dropping an index on a table with existing data can be very slow if the table has many rows. The CREATE INDEX and DROP INDEX commands work by creating a new, empty table defined with the requested set of indexes. It then copies the existing rows to the new table one-by-one, updating the indexes as it goes. Inserting entries into the indexes in this fashion, where the key values are not sorted, requires random access to the index nodes, and is far from optimal. After all rows from the original table are copied, the old table is dropped and the copy is renamed with the name of the original table.
Beginning with version 5.1, MySQL allows a storage engine to create or drop indexes without copying the contents of the entire table. The standard built-in InnoDB in MySQL version 5.1, however, does not take advantage of this capability. With the InnoDB Plugin, however, users can in most cases add and drop indexes much more efficiently than with prior releases.
...
Changing the clustered index requires copying the data, even with the InnoDB Plugin. However, adding or dropping a secondary index with the InnoDB Plugin is much faster, since it does not involve copying the data.
Source: Overview of Fast Index Creation

SQL Server: What is the difference between Index Rebuilding and Index Reorganizing?

What is the difference between Index Rebuilding and Index Reorganizing?
Think about how the index is implemented. It's generally some kind of tree, like a B+ Tree or B- Tree. The index itself is created by looking at the keys in the data, and building the tree so the table can be searched efficiently.
When you reorganize the index, you go through the existing index, cleaning up blocks for deleted records etc. This could be done (and is in some databases) when you make a deletion, but that imposes some performance penalty. instead, you do it separately in order to do it more or less batch mode.
When you rebuild the index, you delete the existing tree and read all the records, building a new tree directly from the data. That gives you a new, and hopefully optimized, tree that may be better than the results of reorganizing the table; it also lets you regenerate the tree if it somehow has been corrupted.
REBUILD locks the table for the whole operation period (which may be hours and days if the table is large).
REORGANIZE doesn't lock the table.
Well. actually, it places some temporary locks on the pages it works with right now, but they are removed as soon as the operation is complete (which is fractions of second for any given lock).
As #Andomar noted, there is an option to REBUILD an index online, which creates the new index, and when the operation is complete, just replaces the old index with the new one.
This of course means you should have enough space to keep both the old and the new copy of the index.
REBUILD is also a DML operation which changes the system tables, affects statistics, enables disabled indexes etc.
REORGANIZE is a pure cleanup operation which leaves all system state as is.
There are a number of differences. Basically, rebuilding is a total rebuild of an index - it will build a new index, then drop the existing one, whereas reorganising it will simply, well... it will reorganise it.
This blog entry I came across a while back will explain it much better than I can. :)
Rebuild it droping the current indexes and recreating new ones.
Reorganizing is like putting the house in order with what u already have.
it is a good practice to use 30% fragmentation to determine rebuild vs reorganize.
<30% reorganize vs. >30% rebuild
"Reorganize index" is a process of cleaning, organizing, and defragmenting of "leaf level" of the B-tree (really, data pages).
Rebuilding of the index is changing the whole B-tree, recreating the index.
It’s recommended that index should be reorganized when index fragmentation is from 10% to 40%; if index fragmentation is great than 40%, it’s better to rebuild it.
Rebuilding of an index takes more resources, produce locks and slowing performance (if you choose to keep table online). So, you need to find right time for that process.
In addition to the differences above (basically rebuild will create the index anew, and then "swap it in" for the existing one, rather than trying to fix the existing one), an important consideration is that a rebuild - even an Enterprise ONLINE rebuild - will interfere with snapshot isolation transactions.
TX1 starts snapshot transaction
TX1 reads from T
TX2 rebuilds index on T
TX2 rebuild complete
TX1 read from T again:
Error 3961, Snapshot isolation transaction failed in database because the object accessed by the statement has been modified by a DDL statement in another concurrent transaction since the start of this transaction. It is disallowed because the metadata is not versioned. A concurrent update to metadata can lead to inconsistency if mixed with snapshot isolation.
Rebuild index - rebuilds one or more indexes for a table in the specified database.
Reorganise index - Defragments clustered and secondary indexes of the specified table