Identifying Partitioned tables in Hive [duplicate] - sql

This question already has an answer here:
Check if a hive table is partitioned on a given column
(1 answer)
Closed 2 years ago.
Is there any way which allows listing of partitioned tables in Hive?
I found the way which allows this to happen in SQL Server.
https://dba.stackexchange.com/questions/14996/how-do-i-get-a-list-of-all-the-partitioned-tables-in-my-database
I want to list only partitioned tables under a specific database so that I don't get to check the DDLs of numerous tables to find whether the table is partitioned or non-partitioned. Any similar functionality in Hive? Please suggest.

You can directly connect with the hive metastore data base and get the information about the tables which are partitioned.
Need to know following information, may change according to your cluster configuration:
The database(e.g PostgreSQL,mysql etc) in which hive metastore is configured to store the metadata about the tables.
Usually metastore is the database name in which table information are stored in hive metastore database.
TBLS is the table which store hive table information. DBS is the table which store the hive database information and PARTITIONS is the table whoch store the information about partitioning in hive.
DB_ID is the foreign key in TBLS and TBL_ID is the foreign key of TBLS in PARTITIONS.
Join tables like below:
select d."NAME" as DATABASE_NAME, t."TBL_NAME" as TABLE_NAME, p."PKEY_NAME" as PARTITION_KEY_NAME
from "PARTITION" p
join "TBLS" on p."TBL_ID"=t."TBL_ID"
join "DBS" dat on t."DB_ID"=d."DB_ID"
where d."NAME"="filterdbname" AND p."PKEY_NAME" is not null;
This is the sql approach. If programmatic approach is needed.
HiveMetaStoreClient APIs can be used to query the metastore tables. Metastore connection setup is needed. In java below is the pseudo code,
import org.apache.hadoop.hive.conf.HiveConf;
import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
HiveConf conf = new HiveConf();
hiveConf.setVar(HiveConf.ConfVars.METASTOREURIS, Address+":"+ Port);
HiveMetaStoreClient hiveMetaStoreClient = new HiveMetaStoreClient(conf);

Related

Impala - Find what tables have a specific column

In Impala, is there a way to check which tables in the database contain a specific column name?
Something like:
select tablename, columnname
from dbc.columns
where databasename = 'mydatabasename'
and columnname like '%findthis%'
order by tablename
The above query works in a teradata environment, but throws an error in Impala.
Thanks,
Impala shares the metastore with Hive. Unlike traditional RDBMS, Hive metadata is stored in a separate database. In most cases it is in MySQL or Postgres. If you have access to the metastore database, you can run SELECT on table TBLS to get the details about the tables and COLUMNS_V2 to get the details about columns.
If you do not have access to the metastore, the only option is to describe each table to get the column names. If you have a lot of databases and tables, you could write a shell script to get the list of tables using "show tables" and loop around the tables to describe them using "desc tablename".

Hive -where are tables information stored

I am creating and insert tables in HIVE,and the files are created on HDFS and some on external storage S3
Assuming if I created a 10 tables,is there any system table in Hive where I can find the table info created by the user??? (for example like in Teradata we have DBC.tablesv which hold information of all the user defined tables)
You can find where you metastore is configured to be in the hive-site.xml file.
Its usual location is under /etc/hive/{$hadoop_version}/ or /etc/hive/conf/.
grep for "hive.metastore.uris" or "javax.jdo.option.ConnectionURL" to see which db you are using for the metastore. The credentials should also be there.
If, for example, your metastore is on a MySQL server, you can run queries like
SELECT * FROM TBLS;
SELECT * FROM PARTITIONS;
etc
You can't query (as in SELECT ... FROM...) the metadata from within Hive.
You do however have comnands that display that information, e.g. show databases, show tables, desc MyTable etc.
I'm not sure I understood 100% your question, if you mean the informations about the creation of the table, like the query itself, with the location on HDFS, table properties, etc, you can try with:
SHOW CREATE TABLE <table>;
If you need to retrieve a list of the columns names and datatypes try with:
DESCRIBE <table>;

What is correlation between HBase and HCatalog?

Can enyone explain, what is the corellation between HCatalog and HBase, please?
I've found these definitions:
Apache HCatalog
HCatalog is a metadata abstraction layer for referencing data without using the underlying fileĀ­names or formats. It insulates users and scripts from how and where the data is physically stored.
Apache HBase
HBase (Hadoop DataBase) is a distributed, column oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and point queries (random reads).
Whet we use CREATE TABLE in Hive, it creates table in HCatalog. I just don't get it. Why not in real DATABASE which is HBase?
HCatalog seems to be some kind of metedata repository for all data stores. Does it mean it also keeps information about databases and tables in HBase?
I'll be grateful for explanation
Regards
Pawel
When you CREATE TABLE in HIVE it registers it in HCatalog. A Table in Hive may be an HBase table but it can also be an abstraction above HDFS files and directories
You can find a nice explanation of HCatalog on HortonWorks' site
Because I've noticed the question is quite popular, I've decided to answer it as I've undrestood it quite well since I asked it
So, first of all since Hadoop 2.0 HCatalog and Hive are treated as one product. Hive creates tables in HCatalog by default. It means that natural interface for HCatalog is Hive. So you can use all SQL-92 DMLs (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML)and DDLs (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL), starting from create/alter/drop database, through create/alter/drop table ending with select, insert into etc... The only exception is that insert works only as insert into ... as select from.
For typical insert we have to use:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Tables can have partitions, indexes (but from my experioence it doesn't work well), but you it is not a relational database, so you cannot use foreign keys.
With HBase is quite different. This is one of noSQL databases (but as answered in previous post, Hive can be HBase interface fro SQL queries)
It has key-> value organized tables.
Lets compare a few commands (create table, insert into table, select from table, drop table
Hive:
create table table_name (
id int,
value1 string,
value2 string
)
partitioned by (date string)
LOAD DATA INPATH 'filepath' ] INTO TABLE table_name [PARTITION (partcol1=val1, partcol2=val2 ...)]
INSERT INTO table_name as select * from othertable
SELECT * FROM table_name
DROP TABLE table_name
HBase:
hbase> create 'test', 'cf'
hbase> put 'test', 'row1', 'cf:a', 'value1'
hbase> get 'test', 'row1'
hbase> disable 'test'
hbase> drop 'test'
As you can see the syntax is completely different. For SQL users, working with HCatalog is natural, ones working with noSQL databases will feel comfortabe with HBase.

What does the hive metastore and name node do in a cluster?

In a cluster having Hive installed, What does the metastore and namenode have? i understand that the Metastore has all the table schema and partition details and metadata. Now what is this metadata? then what does the namenode have? and where is this metastore present in a cluster?
The NameNode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It also keeps track of all the DataNode(Dead+Live) through heartbeat mechanism. It also helps client for reads/writes by receiving their requests and redirecting them to the appropriate DataNode.
The metadata which metastore stores contains things like :
IDs of Database
IDs of Tables
IDs of Index
The time of creation of an Index
The time of creation of a Table
IDs of roles assigned to a particular user
InputFormat used for a Table
OutputFormat used for a Table etc etc.
Is this what you wanted to know?
And it is not mandatory to have metastore in the cluster itself. Any machine(inside or outside the cluster) having a JDBC-compliant database can be used for the metastore.
HTH
P.S : You might find the E/R diagram of metastore useful.
Hive data (not metadata) is spread across Hadoop HDFS DataNode servers. Typically, each block of data is stored on 3 different DataNodes. The NameNode keeps track of which DataNodes have which blocks of actual data.
For a Hive production environment, the metastore service should run in an isolated JVM. Hive processes can communicate with the metastore service using Thrift. The Hive metastore data is persisted in an ACID database such as Oracle DB or MySQL. You can use SQL to find out what is in the Hive metastore:
Here are the tables in the Hive metastore:
SQL> select table_name from user_tables;
DBS
DATABASE_PARAMS
SEQUENCE_TABLE
SERDES
TBLS
SDS
CDS
BUCKETING_COLS
TABLE_PARAMS
PARTITION_KEYS
SORT_COLS
SD_PARAMS
COLUMNS_V2
SERDE_PARAMS
You can describe the structure of each table:
SQL> describe partition_keys;
TBL_ID NUMBER
PKEY_COMMENT VARCHAR2(4000)
PKEY_NAME VARCHAR2(128)
PKEY_TYPE VARCHAR2(767)
INTEGER_IDX NUMBER(10)
And find the contents of each table:
SQL> select * from partition_keys;
So if in Hive you "CREATE TABLE xxx (...) PARTITIONED BY (...)" the Hive partitioning data is stored into the metastore (Oracle, MySQL...) database.
For example, in Hive if you create a table like this:
hive> create table employee_table (id bigint, name string) partitioned by (region string);
You will find this in the metastore:
SQL> select tbl_id,pkey_name from partition_keys;
TBL_ID PKEY_NAME
------ ---------
8 region
SQL> select tbl_name from tbls where tbl_id=8;
TBL_NAME
--------
employee_table
When you insert data into employee_table, the data will be stored in HDFS on Hadoop DataNodes and the NameNode will keep track of which DataNodes have the data.
Metastore - Its a database which stores metadata a.k.a all the details about the tables you create in HIVE. By default, HIVE comes with and uses Derby database. But you can use any other database like MySQL or Oracle.
Use of Metastore: Whenever you fire a query from your Hive CLI, the Execution engine gathers all the details regarding the table and creates an Execution plan(Job). These details comes from Metastore. Finally the Execution engine sends the Job to Hadoop. From here, the common Hadoop Map Reduce Job is executed and the result is send back to Hive. The Name node communicates with Execution engine to successfully execute the MR Job.
Above diagram is excellent one to understand Hive and hadoop communication.
Regarding Hive-Metastore (not hadoop - metastore):
It is not necessary/compulsory to have metastore in your hadoop environment as it is only required if you are using HIVE on top of your HDFS cluster.
Metastore is the metadata repository for HIVE only and used by HIVE to store created database object's meta information only(not actual data, which is already in HDFS because HIVE do not store data. Hive uses already stored datain File system)
Hive implementation required a metastore service using any RDBMS.
Regarding Namenode (hadoop -namenode):
core part of Hadoop, which behaves like metastore for cluster.
Not a RDBMS . Stores file system meta info in File System only.

Hive - How to see the table created in metastore?

Here is our setup -
We have Hive that uses MySQL on another machine as a metastore.
I can start the Hive command line shell and create a table and describe it.
But when I log on to the other machine where MySQL is used as metastore, I cannot see the Hive table details on the MySQL.
e.g. Here are hive commands -
hive> create table student(name STRING, id INT);
OK
Time taken: 7.464 seconds
hive> describe student;
OK
name string
id int
Time taken: 0.408 seconds
hive>
Next, I log on to the machine where MySQL is installed and this MySQL is used as Hive metastore. I use the "metastore" database. But if I want to list the tables, I cannot see the table or the table info I have created in Hive.
How can I see the Hive table information in the metastore?
First, find what MySql database the metastore is stored in. This is going to be in your hive-site.conf - connection URL. Then, once you connect to MySql you can
use metastore;
show tables;
select * from TBLS; <-- this will give you list of your hive tables
Another useful query if you want to search what other tables a particular column belongs to:
SELECT c.column_name, tbl_name, c.comment, c.type_name, c.integer_idx,
tbl_id, create_time, owner, retention, t.sd_id, tbl_type, input_format, is_compressed, location,
num_buckets, output_format, serde_id, s.cd_id
FROM TBLS t, SDS s, COLUMNS_V2 c
-- WHERE tbl_name = 'my_table'
WHERE t.SD_ID = s.SD_ID
AND s.cd_id = c.cd_id
AND c.column_name = 'my_col'
order by create_time
You can query the metastore schema in your MySQL database. Something like:
mysql> select * from TBLS;
More details on how to configure a MySQL metastore to store metadata for Hive and verify and see the stored metadata here.
*While setting up Hadoop services are any other services(this is mandatory too), admins use a relational databases in most of the scenarios to store the metadata information of the services like hive and oozie.
So, find which database(mysql,postgresql,sqlserver etc) your hive is backed up by, and you can see the metadata information in the TBLS table.*
While upgrading your hive, you have to take backup of these TBLS.