What is correlation between HBase and HCatalog? - hive

Can enyone explain, what is the corellation between HCatalog and HBase, please?
I've found these definitions:
Apache HCatalog
HCatalog is a metadata abstraction layer for referencing data without using the underlying fileĀ­names or formats. It insulates users and scripts from how and where the data is physically stored.
Apache HBase
HBase (Hadoop DataBase) is a distributed, column oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and point queries (random reads).
Whet we use CREATE TABLE in Hive, it creates table in HCatalog. I just don't get it. Why not in real DATABASE which is HBase?
HCatalog seems to be some kind of metedata repository for all data stores. Does it mean it also keeps information about databases and tables in HBase?
I'll be grateful for explanation
Regards
Pawel

When you CREATE TABLE in HIVE it registers it in HCatalog. A Table in Hive may be an HBase table but it can also be an abstraction above HDFS files and directories
You can find a nice explanation of HCatalog on HortonWorks' site

Because I've noticed the question is quite popular, I've decided to answer it as I've undrestood it quite well since I asked it
So, first of all since Hadoop 2.0 HCatalog and Hive are treated as one product. Hive creates tables in HCatalog by default. It means that natural interface for HCatalog is Hive. So you can use all SQL-92 DMLs (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML)and DDLs (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL), starting from create/alter/drop database, through create/alter/drop table ending with select, insert into etc... The only exception is that insert works only as insert into ... as select from.
For typical insert we have to use:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Tables can have partitions, indexes (but from my experioence it doesn't work well), but you it is not a relational database, so you cannot use foreign keys.
With HBase is quite different. This is one of noSQL databases (but as answered in previous post, Hive can be HBase interface fro SQL queries)
It has key-> value organized tables.
Lets compare a few commands (create table, insert into table, select from table, drop table
Hive:
create table table_name (
id int,
value1 string,
value2 string
)
partitioned by (date string)
LOAD DATA INPATH 'filepath' ] INTO TABLE table_name [PARTITION (partcol1=val1, partcol2=val2 ...)]
INSERT INTO table_name as select * from othertable
SELECT * FROM table_name
DROP TABLE table_name
HBase:
hbase> create 'test', 'cf'
hbase> put 'test', 'row1', 'cf:a', 'value1'
hbase> get 'test', 'row1'
hbase> disable 'test'
hbase> drop 'test'
As you can see the syntax is completely different. For SQL users, working with HCatalog is natural, ones working with noSQL databases will feel comfortabe with HBase.

Related

Optimize Temporary Table on Presto/Hive SQL

I would like to optimize my computation time for queries ran on PRESTO/HIVE SQL. One of the techniques I used to do on Redshift was to improve efficiency of temporary tables as in the following :
BEGIN;
CREATE TEMPORARY TABLE my_temp_table(
column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) -- Assuming you intend to join this table on column_a
sortkey (column_b) -- Assuming you are sorting or grouping by column_b
;
INSERT INTO my_temp_table SELECT column_a, column_b FROM my_table;
COMMIT;
I have tried that on Presto/Hive SQL but it is not supported. Do you please know the equivalent of this technique on Presto/Hive SQL?
Many thanks!
Redshift is relational database, Presto is a distributed SQL Query Engine. Presto currently doesn't support the creation of temporary tables and also not the creation of indexes. But you may create tables based on a SQL statement via CREATE TABLE AS - Presto Documentation
You optimize the performance of Presto in two ways:
Optimizing the query itself
Optimizing how the underlying data is stored
One of the best articles around is Top 10 Performance Tuning Tips for Amazon Athena - Athena is a AWS Service based on Presto 0.172 and therefore the tips should also work for Presto.
I am not a Redshift expert but it seems you want to precompute a data set, distributing it and sorting by selected columns, so that it is faster to query.
This corresponds to Presto Hive connector ability to:
partition data -- data with same value in partitioning column(s) will form a single partition, which is a folder on storage; do not use partitioning on high cardinality columns. This is defined using partitioned_by table property
bucket data -- data is grouped in files using hash of bucketing column(s); this is similar to partitioning to a certain extent. This is defined using bucketed_by and bucket_count table properties.
sort data -- within data file, data is sorted by given column(s). This is defined using sorted_by table property.
See examples in Trino (formerly Presto SQL) Hive connector documentation
Note: while i realize documentation is scarce at the moment, i filed an issue to improve it. In the meantime, you can get additional information on Trino (formerly Presto SQL) community slack.

Identifying Partitioned tables in Hive [duplicate]

This question already has an answer here:
Check if a hive table is partitioned on a given column
(1 answer)
Closed 2 years ago.
Is there any way which allows listing of partitioned tables in Hive?
I found the way which allows this to happen in SQL Server.
https://dba.stackexchange.com/questions/14996/how-do-i-get-a-list-of-all-the-partitioned-tables-in-my-database
I want to list only partitioned tables under a specific database so that I don't get to check the DDLs of numerous tables to find whether the table is partitioned or non-partitioned. Any similar functionality in Hive? Please suggest.
You can directly connect with the hive metastore data base and get the information about the tables which are partitioned.
Need to know following information, may change according to your cluster configuration:
The database(e.g PostgreSQL,mysql etc) in which hive metastore is configured to store the metadata about the tables.
Usually metastore is the database name in which table information are stored in hive metastore database.
TBLS is the table which store hive table information. DBS is the table which store the hive database information and PARTITIONS is the table whoch store the information about partitioning in hive.
DB_ID is the foreign key in TBLS and TBL_ID is the foreign key of TBLS in PARTITIONS.
Join tables like below:
select d."NAME" as DATABASE_NAME, t."TBL_NAME" as TABLE_NAME, p."PKEY_NAME" as PARTITION_KEY_NAME
from "PARTITION" p
join "TBLS" on p."TBL_ID"=t."TBL_ID"
join "DBS" dat on t."DB_ID"=d."DB_ID"
where d."NAME"="filterdbname" AND p."PKEY_NAME" is not null;
This is the sql approach. If programmatic approach is needed.
HiveMetaStoreClient APIs can be used to query the metastore tables. Metastore connection setup is needed. In java below is the pseudo code,
import org.apache.hadoop.hive.conf.HiveConf;
import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
HiveConf conf = new HiveConf();
hiveConf.setVar(HiveConf.ConfVars.METASTOREURIS, Address+":"+ Port);
HiveMetaStoreClient hiveMetaStoreClient = new HiveMetaStoreClient(conf);

Hive -where are tables information stored

I am creating and insert tables in HIVE,and the files are created on HDFS and some on external storage S3
Assuming if I created a 10 tables,is there any system table in Hive where I can find the table info created by the user??? (for example like in Teradata we have DBC.tablesv which hold information of all the user defined tables)
You can find where you metastore is configured to be in the hive-site.xml file.
Its usual location is under /etc/hive/{$hadoop_version}/ or /etc/hive/conf/.
grep for "hive.metastore.uris" or "javax.jdo.option.ConnectionURL" to see which db you are using for the metastore. The credentials should also be there.
If, for example, your metastore is on a MySQL server, you can run queries like
SELECT * FROM TBLS;
SELECT * FROM PARTITIONS;
etc
You can't query (as in SELECT ... FROM...) the metadata from within Hive.
You do however have comnands that display that information, e.g. show databases, show tables, desc MyTable etc.
I'm not sure I understood 100% your question, if you mean the informations about the creation of the table, like the query itself, with the location on HDFS, table properties, etc, you can try with:
SHOW CREATE TABLE <table>;
If you need to retrieve a list of the columns names and datatypes try with:
DESCRIBE <table>;

hive doesn't change parquet schema

I've a problem with alter table that changes the table schema but not the parquet schema.
For example I've a PARQUET table with these columns:
column1(string) column2(string)
column3(string) column4(string)
column5(bigint)
Now, I try to change the table's schema with
ALTER TABLE name_table DROP COLUMN column3;
With DESCRIBE TABLE I can see that the column2 there is not anymore;
Now I try to execute select * from table but i receive an error like this :
"data.0.parq' has an incompatible type with the table schema for column column4. Expected type: INT64. Actual type: BYTE_ARRAY"
The values of deleted column are yet present in parquet file that has 5 columns and not 4 (as the table schema)
This is a bug? How I can change the Parquet file's schema using Hive?
This is not a bug. When you drop the columns, that just updates the definition in Hive Metastore, which is just the information about the table. The underlying files on HDFS remain unchanged. Since the parquet metadata is embedded in the files , they have no idea what the metadata change has been.
Hence you see this issue.
The solution is described here. If you want to add a column(s) to a parquet table and be compatible with both impala and hive, you need to add the column(s) at the end.
If you alter the table and change column names or drop a column, that table will no longer be compatible with impala.
I had the same error after adding a column to hive table.
Solution is to set the below query option at each session
set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
If you're using Cloudera distribution, set it permanently in Cloudera Manager => Impala configuration => Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
set config value as PARQUET_FALLBACK_SCHEMA_RESOLUTION=name

Schema Evolution in Parquet Hive table

I have a lot of data in a Parquet based Hive table (Hive version 0.10). I have to add a few new columns to the table. I want the new columns to have data going forward. If the value is NULL for already loaded data, that is fine with me.
If I add the new columns and not update the old Parquet files, it gives an error and it looks strange as I am adding String columns only.
Error getting row data with exception java.lang.UnsupportedOperationException: Cannot inspect java.util.ArrayList
Can you please tell me how to add new fields to Parquet Hive without affecting the already existing data in the table ?
I use Hive version 0.10.
Thanks.
1)
Hive starting with version 0.13 has parquet schema evoultion built in.
https://issues.apache.org/jira/browse/HIVE-6456
https://github.com/Parquet/parquet-mr/pull/297
ps. Notice that out-of-the-box support for schema evolution might take a toll on performance. For example, Spark has a knob to turn parquet schema evolution on and off. After one of the recent Spark releases, it's now off by default because of performance hit (epscially when there are a lot of parquet files). Not sure if Hive 0.13+ has such a setting too.
2)
Also wanted to suggest to try creating views in Hive on top of such parquet tables where you expect often schema changes, and use views everywhere but not tables directly.
For example, if you have two tables - A and B with compatible schemas, but table B has two more columns, you could workaround this by
CREATE VIEW view_1 AS
SELECT col1,col2,col3,null as col4,null as col5 FROM tableA
UNION ALL
SELECT col1,col2,col3,col4,col5 FROM tableB
;
So you don't actually have to recreate any tables like #miljanm has suggested, you can just recreate the view. It'll help with the agility of your projects.
Create a new table with the two new columns. Insert data by issuing:
insert into new_table select old_table.col1, old_table.col2,...,null,null from old_table;
The last two nulls are for the two new columns. That's it.
If you have too many columns, it may be easier for you to write a program that reads the old files and writes the new ones.
Hive 0.10 does not have support for schema evolution in parquet as far as I know. Hive 0.13 does have it, so you may try to upgrade hive.