Is it possible to create index on external table in HIVE? It could be any index, Compact or Bitmap. In some place I read that it is not possible to create index on external table but somewhere else I also read that it doesn't matter. So I want to know for sure.
Hive indexing was added in version 0.7.0, and bitmap indexing was added in version 0.8.0.
Create/Drop/Alter Index
more details
You can perform indexing on both the tables. Internal or External table does not make a difference as far as performance is considered. You can build indexes on both. Either ways building indexes on large data sets is counter intuitive.
Here are few scenarios when indexing is not preferred
Indexes are advised to build on the columns on which you frequently
perform operations.
Building more number of indexes also degrade the performance of your query.
Type of index to be created should be identified prior to its creation (if your data requires bitmap you should not create
compact). This leads to increase in time for executing your query.
You can refer to the below link for more details on how to perform indexing in Hive
https://acadgild.com/blog/indexing-in-hive/
Related
In the project I'm working with, we have a table which sees a lot of read/write activity.
It's sort of a "visibilities" table; a background job is constantly running to generate records into this table based on the creation of other business domain entities.
This table needs to get searched against (and is being updated) on a regular basis, and we're running into performance problems because of it.
When we introduce indexes to improve the speed of search, it causes timeout issues with writing to the table when people perform updates. The table is relatively large and the search criteria is a bit complex so the indexes are large.
What I'm wondering, is if I added an "archived" bit column to the table, consistently marked somewhat old records as archived, could I re-structure the indexes to only index data which is Archived=0? Would that allow me to reduce the size of the indexes (and thus the performance impact of writing to those tables)?
I would assume no since the indexes must still consider which records are archived or not, but I'm not a SQL expert so I wanted to check.
If that would not be an ideal setup, what might I do to accomplish a similar result?
You can create a Filtered Index, which can index only the columns where Archived=0, and can be used only in queries that specify WHERE Archived=0 and ....
How can I improve my performance issue? I have an sql query with 'IN' I guess 'IN' making some costly performance issue. But I need index my sql query?
My sql query:
SELECT [p].[ReferencedxxxId]
FROM [Common].[xxxReference] AS [p]
WHERE ([p].[IsDeleted] = 0)
AND (([p].[ReferencedxyzType] = #__refxyzType_0)
AND [p].[ReferencedxxxId] IN ('42342','ffsdfd','5345345345'))
My solution: (BUT I NEED YOUR HELP FOR BETTER ADVISE) Whichone is correct clustered or nonclustred index?
USE [xxx]
GO
CREATE NONCLUSTERED INDEX IX_NonClusteredIndexDemo_xxxId
ON [Common].[xxxReference](xxxId)
INCLUDE ([ID],[ReferencedxxxId])
WITH (DROP_EXISTING=ON, ONLINE=ON, FILLFACTOR=90)
GO
Second:
CREATE INDEX xxxReference_ReferencedxxxId_index
ON [Common].[xxxReference] (ReferencedxxxId)[/code]
Whichone is correct or do you have better solution?
The performance problem of this query is not the result of using the IN operator.
This operator performs very well with small lists (say, less than 1000 members).
The performance bottle neck here is the fact that SQL Server performs an index scan instead of an index seek (which is very costly), and the key lookup, which is 20% of the query cost.
To avoid both problems, you can add an index on IsDeleted, ReferencedxyzType and ReferencedxxxId - probably in this exact order.
SQL Performance tuning is a science that tends to look a little like art or magic - either way you look at it it requires a good knowledge of both the theory and practice of index settings and the relevant systems requirements.
Therefor, my suggestion is this: Do not attempt to solve it yourself with the help of strangers on the internet. Get an expert for a consulting job for a couple of hours/days to analyze the system and help you fine-tune it.
Learn whatever you can during this process. Ask questions about everything that is not trivial. This will be money well spent.
Couple of things:
If you have a SELECT statement inside the IN, that should be avoided
and should be replaced with an EXISTS clause. But in your above
example, that is not relevant as you have direct values inside IN.
Using EXISTS and NOT EXISTS instead of IN and NOT IN helps SQL
Server to not needing to scan each value of the column for each
values inside the IN / NOT IN and rather can short circuit the
search once a match or non-match found.
Avoid the implicit conversion. They degrade the performance due to
many reasons including i> SQL Server not able to find proper
statistics on an index and hence not able to leverage an index and
would rather go make use of a clustered index available in the table
(which may not be covering your query), ii> Not assigning proper
required RAM during memory allocation phase of the query by storage
engine, iii> Cardinality estimation becomes wrong as SQL Server
would not have statistics on the computed value of that column, and
rather probably had statistics on that column.
If you look at your execution plan posted above, you will see a
yellow mark in your 'SELECT'. If you hover over it, you will see
one/more warning messages. If your warning is related to implicit
conversion, try to use proper datatypes during comparison.
Eg. What is the datatype of the column '[ReferencedxxxId]'? If it
is not an NVARCHAR and is rather a VARCHAR, then I would suggest:
Make the values inside the IN as VARCHAR (currently you are making them NVARCHAR). This way you will still be able to take full advantage of the rowstore index created on [ReferencedxxxId] column.
If you must have the values as NVARCHAR inside the IN clause, then you should:
CONVERT/CAST the column [ReferencedxxxId] in your IN clause. This is going to get rid of the Implicit conversion but you will no longer be able to take full advantage of the rowstore index on [ReferencedxxxId] column.
+
Rather create a clustered/nonclustered columnstore index on the table covering the columns used in the query. This should significantly enhance the performance of your SELECT query.
If you decided to go with the route of using rowstore index by correcting the values inside the IN, you need to make sure that you create a clustered/nonclustered index which covers the query. Meaning the index covers the columns on which you are doing search ([ReferencedxxxId], [ReferencedxxxType], [IsDeleted]) and then including the columns used in SELECT statement under INCLUDE clause (if it is a nonclustered index)
Also, when you are creating a composite rowstore index, try to keep the order of columns inside the index high cardinality to low cardinality from left to right to make the best use of that index.
On the basis of assuming an OLTP based system and not OLAP, my first pass would be an NC Index - given isDeleted is likely to have the least selectivity, I would place it last, first pass would be an NC index ReferencedxyzType, ReferencedxxxId, IsDeleted
I might even be tempted in a higher volume scenario to move the IsDeleted out of the index onto an include instead, since it provides so little selectivity to the index itself.
There is clearly already a clustered index in place on the table (from the query plan we can see it), we don't have the details of what is in it.
The question around clustered vs non-clustered is more complex and requires a lot more knowledge of the system and usage.
I know that indexing has been removed in the latest versions of hive, but I'd still like to know the difference between the 2.
The main difference is how they store the mapping from values to the rows in which the value occurs so that when we query we can identify the blocks fast which has relevant data.
Compact indexing stores the pair of indexed column’s value and its block id while Bitmap indexing stores the combination of indexed column value and list of rows as a bitmap.
Bitmap indexing is a standard technique for indexing columns with few distinct values.
I would recommend to read this excellent blog post about Hive Indexing.
Additional Information
There are other things which you might want to know here.
Indexes has been removed with Hive 3.0, they recommend to use materialized view to achieve similar results but I would say go with columnar storage like PARQUET or ORC, they they can do selective scanning and even skip entire files/blocks.
ORC format has build in Indexes which allow the format to skip blocks of data during read, they also support Bloom filters index.
At blog I see below statement
Secondary Indexes
Secondary indexes are a first-class construct in MongoDB. This makes
it easy to index any property of an object stored in MongoDB even if
it is nested. This makes it really easy to query based on these
secondary indexes
Cassandra has only cursory support for secondary
indexes. Secondary indexes are also limited to single columns and
equality comparisons. If you are mostly going to be querying by the
primary key then Cassandra will work well for you.
My question is can't Cassandra create more than one secondary index on separate columns ?
Also can't we execute operation like or full text search on Cassandra as it says secondary index are good for only equality comparison
Update :-
What is the difference between cassandra secondary index and Mongo secondary index ?
Cassandra create more than one secondary index on separate columns ?
Yes it can. Multiple Indexes are possible but ALLOW FILTERING must be used to query, which affects the performance. Secondary index in cassandra are not like the one in RDBMS and proper analysis should be done before using it.
can't we execute operation like or full text search on Cassandra as it
says secondary index are good for only equality comparison
Normal Secondary index does not support like operation. Though latest cassandra version (3.x) has support for SASI Index which has support for like or CONTAINS operation.
Custom SASI Index
What are the different types of indexes, what are the benefits of each?
I heard of covering and clustered indexes, are there more? Where would you use them?
Unique - Guarantees unique values for the column(or set of columns) included in the index
Covering - Includes all of the columns that are used in a particular query (or set of queries), allowing the database to use only the index and not actually have to look at the table data to retrieve the results
Clustered - This is way in which the actual data is ordered on the disk, which means if a query uses the clustered index for looking up the values, it does not have to take the additional step of looking up the actual table row for any data not included in the index.
OdeToCode has a good article covering the basic differences
As it says in the article:
Proper indexes are crucial for good
performance in large databases.
Sometimes you can make up for a poorly
written query with a good index, but
it can be hard to make up for poor
indexing with even the best queries.
Quite true, too... If you're just starting out with it, I'd focus on clustered and composite indexes, since they'll probably be what you use the most.
I'll add a couple of index types
BITMAP - when you have very low number of different possible values, very fast and doesn't take up much space
PARTITIONED - allows the index to be partitioned based on some property usually advantageous on very large database objects for storage or performance reasons.
FUNCTION/EXPRESSION indexes - used to pre-calculate some value based on the table and store it in the index, a very simple example might be an index based on lower() or a substring function.
PostgreSQL allows partial indexes, where only rows that match a predicate are indexed. For instance, you might want to index the customer table for only those records which are active. This might look something like:
create index i on customers (id, name, whatever) where is_active is true;
If your index many columns, and you have many inactive customers, this can be a big win in terms of space (the index will be stored in fewer disk pages) and thus performance. To hit the index you need to, at a minimum, specify the predicate:
select name from customers where is_active is true;
Conventional wisdom suggests that index choice should be based on cardinality. They'll say,
For a low cardinality column like GENDER, use bitmap. For a high cardinality like LAST_NAME, use b-tree.
This is not the case with Oracle, where index choice should instead be based on the type of application (OLTP vs. OLAP). DML on tables with bitmap indexes can cause serious lock contention. On the other hand, the Oracle CBO can easily combine multiple bitmap indexes together, and bitmap indexes can be used to search for nulls. As a general rule:
For an OLTP system with frequent DML and routine queries, use btree. For an OLAP system with infrequent DML and adhoc queries, use bitmap.
I'm not sure if this applies to other databases, comments are welcome. The following articles discuss the subject further:
Bitmap Index vs. B-tree Index: Which and When?
Understanding Bitmap Indexes
Different database systems have different names for the same type of index, so be careful with this. For example, what SQL Server and Sybase call "clustered index" is called in Oracle an "index-organised table".
I suggest you search the blogs of Jason Massie (http://statisticsio.com/) and Brent Ozar (http://www.brentozar.com/) for related info. They have some post about real-life scenario that deals with indexes.
Oracle has various combinations of b-tree, bitmap, partitioned and non-partitioned, reverse byte, bitmap join, and domain indexes.
Here's a link to the 11gR1 documentation on the subject: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/data_acc.htm#PFGRF004
Unique
cluster
non-cluster
column store
Index with included column
index on computed column
filtered
spatial
xml
full text
SQL Server 2008 has filtered indexes, similar to PostgreSQL's partial indexes. Both allow to include in index only rows matching specified criteria.
The syntax is identical to PostgreSQL:
create index i on Customers(name) where is_alive = cast(1 as bit);
To view the types of indexes and its meaning visits:
https://msdn.microsoft.com/en-us/library/ms175049.aspx