Secondary index support in Cassandra? - indexing

At blog I see below statement
Secondary Indexes
Secondary indexes are a first-class construct in MongoDB. This makes
it easy to index any property of an object stored in MongoDB even if
it is nested. This makes it really easy to query based on these
secondary indexes
Cassandra has only cursory support for secondary
indexes. Secondary indexes are also limited to single columns and
equality comparisons. If you are mostly going to be querying by the
primary key then Cassandra will work well for you.
My question is can't Cassandra create more than one secondary index on separate columns ?
Also can't we execute operation like or full text search on Cassandra as it says secondary index are good for only equality comparison
Update :-
What is the difference between cassandra secondary index and Mongo secondary index ?

Cassandra create more than one secondary index on separate columns ?
Yes it can. Multiple Indexes are possible but ALLOW FILTERING must be used to query, which affects the performance. Secondary index in cassandra are not like the one in RDBMS and proper analysis should be done before using it.
can't we execute operation like or full text search on Cassandra as it
says secondary index are good for only equality comparison
Normal Secondary index does not support like operation. Though latest cassandra version (3.x) has support for SASI Index which has support for like or CONTAINS operation.
Custom SASI Index

Related

Can we create index on external table in Hive?

Is it possible to create index on external table in HIVE? It could be any index, Compact or Bitmap. In some place I read that it is not possible to create index on external table but somewhere else I also read that it doesn't matter. So I want to know for sure.
Hive indexing was added in version 0.7.0, and bitmap indexing was added in version 0.8.0.
Create/Drop/Alter Index
more details
You can perform indexing on both the tables. Internal or External table does not make a difference as far as performance is considered. You can build indexes on both. Either ways building indexes on large data sets is counter intuitive.
Here are few scenarios when indexing is not preferred
Indexes are advised to build on the columns on which you frequently
perform operations.
Building more number of indexes also degrade the performance of your query.
Type of index to be created should be identified prior to its creation (if your data requires bitmap you should not create
compact). This leads to increase in time for executing your query.
You can refer to the below link for more details on how to perform indexing in Hive
https://acadgild.com/blog/indexing-in-hive/

What types of indexing does Neo4j use for schema index?

I am new to Neo4j when I came across Neo4j indexes, all what I found is that there's a legacy index and another new one (the schema index), but I want to know what are the types of these indexes and if there's a way to specify it ? i.e in oracle we have clustered/non-clustered/b-tree/bitmap ...etc , do we have something similar in Neo4j?
In Neo4j there is two kinds of index :
Schema indexes (via create index or create constraint)
Legacy indexes
All those indexes are internally made with Lucene, and there is no type of indexe, like in oracle.
When you use legacy indexes, you can configure them like it's describe here : http://neo4j.com/docs/stable/indexing-create-advanced.html
You can find some additional informations here :
http://jexp.de/blog/2015/04/on-neo4j-indexes-match-merge/
http://blog.armbruster-it.de/2013/12/indexing-in-neo4j-an-overview/
Cheers

Apache Cassandra. Advantage and disadvantage of Secondary Index

I have read, that Secondary Index in Cassandra is quite useless feature. Indeed, it makes writing to DB much more slower, you can find value only by exact index and you need to make requests to all servers in claster to find value by index. Can anyone tell me about benifit, that will be the reason to use Secondary Index?
Querying becomes more flexible when you add secondary indexes to table columns. You can add indexed columns to the WHERE clause of a SELECT.
When to use secondary indexes
You want to query on a column that isn't the primary key and isn't part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).
When to avoid secondary indexes
Try not using secondary indexes on columns contain a high count of unique values and that will produce few results.
As always, check out the documentation:
About Indexes in Cassandra
FAQ for Secondary Indexes

multiple cluster indices effect

My question is about limitation of clustered index on a table.
By theory, in a single table we can have only one cluster index. But what if I have datetime columns in a table say "From date" and "To date"? These columns will often required in WHERE clause to populate reports in my application. And if I also require a cluster index on primary key in the same table, then still how to get advantage of cluster index on other columns? In this case my queries will still run slower with larger records.
In practice, you also can have just a single clustered index on a table - since the table's data is physically ordered by that clustered index.
If you need two datetime columns frequently in WHERE clauses, the best choice would be to have a non-clustered index on those two columns, and possibly include additional columns that you frequently retrieve with those queries, in order to make it a covering index.
There's really not much difference between a good, covering non-clustered index, and the clustered index, in terms of query performance.
However, you don't want to bloat your clustered index, since those columns will also be added to all non-clustered indices on the same table - keep it small, preferably an INT, ever-increasing, stable (not changing) and you should be just fine.
Another option is indexed (or materialized) views: you could create multiple views on the table, each with a different clustered index. That might be useful in a reporting scenario, but indexed views have lots of restrictions and will affect the performance of queries that modify the table data. Books Online has all the information you need to create and test them.
I suspect your real requirement is indeed to implement a reporting solution, and if so then it might be best to do it properly: create a separate database with a schema optimized for reporting (Google "star schema") and load data regularly from the main database into the reporting one. But that's a whole new area of development to investigate, and I wouldn't rush into it.
If you need the performance of a cluster index table for multiple indexes of the same table, the only route I see is holding a copy of the table for each cluster index.
The clustered index effects the physical storage of the data in a table, so by definition there can only be one. You can widen your clustered index to include the other columns, but this can have its own disadvantages.
The performance advantage from a clustered index is that the records are stored in a manner that reflects the index (which is why random inserts and updates to the clustered index very quickly fragment the table), and therefore query performance based on this index can be as good as performance as reads from your storage device, you can't get this on other indexes.
I suggest that you choose the index from which you would derive the most benefit from clustering and make that your clustered index. Make the rest of the indexes non-clustered. You may want to run some tests to find out what benefits will be derived from making different indexes clustered vs. non-clustered.
Share and enjoy.

What are the different types of indexes, what are the benefits of each?

What are the different types of indexes, what are the benefits of each?
I heard of covering and clustered indexes, are there more? Where would you use them?
Unique - Guarantees unique values for the column(or set of columns) included in the index
Covering - Includes all of the columns that are used in a particular query (or set of queries), allowing the database to use only the index and not actually have to look at the table data to retrieve the results
Clustered - This is way in which the actual data is ordered on the disk, which means if a query uses the clustered index for looking up the values, it does not have to take the additional step of looking up the actual table row for any data not included in the index.
OdeToCode has a good article covering the basic differences
As it says in the article:
Proper indexes are crucial for good
performance in large databases.
Sometimes you can make up for a poorly
written query with a good index, but
it can be hard to make up for poor
indexing with even the best queries.
Quite true, too... If you're just starting out with it, I'd focus on clustered and composite indexes, since they'll probably be what you use the most.
I'll add a couple of index types
BITMAP - when you have very low number of different possible values, very fast and doesn't take up much space
PARTITIONED - allows the index to be partitioned based on some property usually advantageous on very large database objects for storage or performance reasons.
FUNCTION/EXPRESSION indexes - used to pre-calculate some value based on the table and store it in the index, a very simple example might be an index based on lower() or a substring function.
PostgreSQL allows partial indexes, where only rows that match a predicate are indexed. For instance, you might want to index the customer table for only those records which are active. This might look something like:
create index i on customers (id, name, whatever) where is_active is true;
If your index many columns, and you have many inactive customers, this can be a big win in terms of space (the index will be stored in fewer disk pages) and thus performance. To hit the index you need to, at a minimum, specify the predicate:
select name from customers where is_active is true;
Conventional wisdom suggests that index choice should be based on cardinality. They'll say,
For a low cardinality column like GENDER, use bitmap. For a high cardinality like LAST_NAME, use b-tree.
This is not the case with Oracle, where index choice should instead be based on the type of application (OLTP vs. OLAP). DML on tables with bitmap indexes can cause serious lock contention. On the other hand, the Oracle CBO can easily combine multiple bitmap indexes together, and bitmap indexes can be used to search for nulls. As a general rule:
For an OLTP system with frequent DML and routine queries, use btree. For an OLAP system with infrequent DML and adhoc queries, use bitmap.
I'm not sure if this applies to other databases, comments are welcome. The following articles discuss the subject further:
Bitmap Index vs. B-tree Index: Which and When?
Understanding Bitmap Indexes
Different database systems have different names for the same type of index, so be careful with this. For example, what SQL Server and Sybase call "clustered index" is called in Oracle an "index-organised table".
I suggest you search the blogs of Jason Massie (http://statisticsio.com/) and Brent Ozar (http://www.brentozar.com/) for related info. They have some post about real-life scenario that deals with indexes.
Oracle has various combinations of b-tree, bitmap, partitioned and non-partitioned, reverse byte, bitmap join, and domain indexes.
Here's a link to the 11gR1 documentation on the subject: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/data_acc.htm#PFGRF004
Unique
cluster
non-cluster
column store
Index with included column
index on computed column
filtered
spatial
xml
full text
SQL Server 2008 has filtered indexes, similar to PostgreSQL's partial indexes. Both allow to include in index only rows matching specified criteria.
The syntax is identical to PostgreSQL:
create index i on Customers(name) where is_alive = cast(1 as bit);
To view the types of indexes and its meaning visits:
https://msdn.microsoft.com/en-us/library/ms175049.aspx