I know that indexing has been removed in the latest versions of hive, but I'd still like to know the difference between the 2.
The main difference is how they store the mapping from values to the rows in which the value occurs so that when we query we can identify the blocks fast which has relevant data.
Compact indexing stores the pair of indexed column’s value and its block id while Bitmap indexing stores the combination of indexed column value and list of rows as a bitmap.
Bitmap indexing is a standard technique for indexing columns with few distinct values.
I would recommend to read this excellent blog post about Hive Indexing.
Additional Information
There are other things which you might want to know here.
Indexes has been removed with Hive 3.0, they recommend to use materialized view to achieve similar results but I would say go with columnar storage like PARQUET or ORC, they they can do selective scanning and even skip entire files/blocks.
ORC format has build in Indexes which allow the format to skip blocks of data during read, they also support Bloom filters index.
Related
I wonder how can I find a specific value from DB without going through the entire DB table.
by example:
There is a DB of students and we are looking for all the students with a certain name, how do you do that without going through the whole DB table.
Use INDEXES
Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed. ... Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
SQL Server has four options for improving performance for this type of query:
A regular index (either clustered or non-clustered).
A full text index.
Partitioning.
Hash index (for memory optimized tables).
A regular index, created using create index, is the "canonical" answer to this question. It is like an alphabetical list of all names with a pointer to the record. The implementation uses something called B-trees, so the analogy is not perfect. These indexes can be used for equality (eg. =, is null) and inequality comparisons (eg. in, <, >)
A full text index indexes all words in a text column (for some definition of "word"). This can be used for a range of full text search options -- and available through contains.
Partitioning is used when you have lots and lots of data and only a handful of categories. That is highly unlikely with a name in a student database. But it physically splits the data into separate files for each name or range of names.
Hash-based indexing is only available on memory-optimized tables. These are only useful for comparisons using = and in (and <> and not in).
Are column store indexes in SQL Server useful only when the query uses aggregate functions?
Certainly no. Even if they were designed to be used in DWH ambient, they can be used in OLTP environment.
And even when used in DWH, aggregation is not a requirenment.
Columnstore indexes use a different storage format for data, storing
compressed data on a per-column rather than a per-row basis. This
storage format benefits query processing in data warehousing,
reporting, and analytics environments where, although they typically
read a very large number of rows, queries work with just a subset of
the columns from a table.
So the first benefit is data compression.
Compression in columnstore is table-wide, non page-wide (I mean dictionary applied) when you use PAGE data compression. So the compression ratio is the best. A table with clustered columnstore index defined uses less space compared to the same table with no columnstore but page compression enambed.
The second benefit is for queries that filter nothing(or almost nothing, needing (almost) all the rows) but need only some columns to be returned.
When the table is stored "per-row", even if you want only 10 columns of 100, and you want all the rows, the whole table will be read, because there is a need to read the whole row to get your 10 requested columns out of it. When you use "per-column" storage, only needed columns will be read.
Of course you can define an index with your 10 needed columns as included, but it will be additional space used and the overhead of maintenance of this index. Now imagin that your queries need these 10, and other 10, and another 2o of 100, so you need to create more indexes for these queries.
With one columnstore index you will be able to satisfy all these queries
Columnstore Indexes, stores data in columnar format, therefore they are quite helpful when you are using aggregate functions. One of the reason is, because homogeneous data compression is much faster when you are trying to aggregate columns.
But this is not the only use of columnstore indexes. It is really helpful when you are processing millions of rows (in multidimensional data models).
Check out the official documentation and this as well for better understanding.
You can't say they are always useful for aggregate functions as it depends on which rows are included in the aggregation. If you are performing aggregation on all rows - they are useful. If you are selecting only small amount of the rows because of filtering, you can even get worse result than using traditional non-clustered index.
As it is written in MSDN, they can be used:
to achieve up to 10x query performance gains in your data warehouse
over traditional row-oriented storage
to get 10x data compression over the uncompressed data size (if you are interested in compression in should check the COLUMNSTORE_ARCHIVE option)
Also, depending on your SQL Server version (if SQL Server 2017 or later) you can check the Adaptive Query Processing as one of the conditions is to have such index:
You should look through the documentation and see what options you have depending on your SQL Server version and to test for sure how this index is going to affect the performance, because it is very possible to make the things worse.
It's good that in every article Microsoft have mention the scenarios when the types of column store indexes can be used for good.
Is it possible to create index on external table in HIVE? It could be any index, Compact or Bitmap. In some place I read that it is not possible to create index on external table but somewhere else I also read that it doesn't matter. So I want to know for sure.
Hive indexing was added in version 0.7.0, and bitmap indexing was added in version 0.8.0.
Create/Drop/Alter Index
more details
You can perform indexing on both the tables. Internal or External table does not make a difference as far as performance is considered. You can build indexes on both. Either ways building indexes on large data sets is counter intuitive.
Here are few scenarios when indexing is not preferred
Indexes are advised to build on the columns on which you frequently
perform operations.
Building more number of indexes also degrade the performance of your query.
Type of index to be created should be identified prior to its creation (if your data requires bitmap you should not create
compact). This leads to increase in time for executing your query.
You can refer to the below link for more details on how to perform indexing in Hive
https://acadgild.com/blog/indexing-in-hive/
In the Redshift FAQ under
Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?
It says the following:
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Why is this the case?
It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.
Materialised Views
I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.
I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?
Indexes
RedShift does not have indexes.
It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).
It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.
That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics
As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns
as of dec 2019, Redshift has a preview of materialized views: Announcement
from the documentation: A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all. From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.
Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.
The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.
Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.
for eg.
Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys.
If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate).
If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently.
If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently.
If you have an interleaved soft key on (orderid, shipdate) and if your query
Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.
https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html
Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks
The simple answer is: because it can read the needed data really, really fast and in parallel.
One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).
Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.
As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.
They recently added support for Materialized Views in Redshift: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-introduces-support-for-materialized-views-preview/
Syntax for materialized view creation:
CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[ table_attributes ]
AS query
Syntax for refreshing a materialized view:
REFRESH MATERIALIZED VIEW mv_name
What are the different types of indexes, what are the benefits of each?
I heard of covering and clustered indexes, are there more? Where would you use them?
Unique - Guarantees unique values for the column(or set of columns) included in the index
Covering - Includes all of the columns that are used in a particular query (or set of queries), allowing the database to use only the index and not actually have to look at the table data to retrieve the results
Clustered - This is way in which the actual data is ordered on the disk, which means if a query uses the clustered index for looking up the values, it does not have to take the additional step of looking up the actual table row for any data not included in the index.
OdeToCode has a good article covering the basic differences
As it says in the article:
Proper indexes are crucial for good
performance in large databases.
Sometimes you can make up for a poorly
written query with a good index, but
it can be hard to make up for poor
indexing with even the best queries.
Quite true, too... If you're just starting out with it, I'd focus on clustered and composite indexes, since they'll probably be what you use the most.
I'll add a couple of index types
BITMAP - when you have very low number of different possible values, very fast and doesn't take up much space
PARTITIONED - allows the index to be partitioned based on some property usually advantageous on very large database objects for storage or performance reasons.
FUNCTION/EXPRESSION indexes - used to pre-calculate some value based on the table and store it in the index, a very simple example might be an index based on lower() or a substring function.
PostgreSQL allows partial indexes, where only rows that match a predicate are indexed. For instance, you might want to index the customer table for only those records which are active. This might look something like:
create index i on customers (id, name, whatever) where is_active is true;
If your index many columns, and you have many inactive customers, this can be a big win in terms of space (the index will be stored in fewer disk pages) and thus performance. To hit the index you need to, at a minimum, specify the predicate:
select name from customers where is_active is true;
Conventional wisdom suggests that index choice should be based on cardinality. They'll say,
For a low cardinality column like GENDER, use bitmap. For a high cardinality like LAST_NAME, use b-tree.
This is not the case with Oracle, where index choice should instead be based on the type of application (OLTP vs. OLAP). DML on tables with bitmap indexes can cause serious lock contention. On the other hand, the Oracle CBO can easily combine multiple bitmap indexes together, and bitmap indexes can be used to search for nulls. As a general rule:
For an OLTP system with frequent DML and routine queries, use btree. For an OLAP system with infrequent DML and adhoc queries, use bitmap.
I'm not sure if this applies to other databases, comments are welcome. The following articles discuss the subject further:
Bitmap Index vs. B-tree Index: Which and When?
Understanding Bitmap Indexes
Different database systems have different names for the same type of index, so be careful with this. For example, what SQL Server and Sybase call "clustered index" is called in Oracle an "index-organised table".
I suggest you search the blogs of Jason Massie (http://statisticsio.com/) and Brent Ozar (http://www.brentozar.com/) for related info. They have some post about real-life scenario that deals with indexes.
Oracle has various combinations of b-tree, bitmap, partitioned and non-partitioned, reverse byte, bitmap join, and domain indexes.
Here's a link to the 11gR1 documentation on the subject: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/data_acc.htm#PFGRF004
Unique
cluster
non-cluster
column store
Index with included column
index on computed column
filtered
spatial
xml
full text
SQL Server 2008 has filtered indexes, similar to PostgreSQL's partial indexes. Both allow to include in index only rows matching specified criteria.
The syntax is identical to PostgreSQL:
create index i on Customers(name) where is_alive = cast(1 as bit);
To view the types of indexes and its meaning visits:
https://msdn.microsoft.com/en-us/library/ms175049.aspx