I have a table named "HISTORY" in HBase having column family "VDS" and the column names ROWKEY, ID, START_TIME, END_TIME, VALUE. I am using Cloudera Hadoop Distribution. I want to provide SQL interface to HBase table using Impala. In order to do this we have to create respective External Table in Hive? So how to create external hive table pointing to this HBase table?
Run the following code in Hive Query Editor:
CREATE EXTERNAL TABLE IF NOT EXISTS HISTORY
(
ROWKEY STRING,
ID STRING,
START_TIME STRING,
END_TIME STRING,
VALUE DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
(
"hbase.columns.mapping" = ":key,VDS:ID,VDS:START_TIME,VDS:END_TIME,VDS:VALUE"
)
TBLPROPERTIES("hbase.table.name" = "HISTORY");
Don't forget to Refresh Impala Metadata after External Table Creation with the following bash command:
echo "INVALIDATE METADATA" | impala-shell;
Related
I have created a table using partition. I tried two ways for my s3 bucket folder as following but both ways I get no records found when I query with where clause containing partition clause.
My S3 bucket looks like following. part*.csv is what I want to query in Athena. There are other folders at same location along side output, within output.
s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/part-filename121231.csv
s3://bucket-rootname/XYZ-CASE/report/678d1234-2c3a-481b-a1eb-5169d2a97747/output/part-filename213123.csv
my table looks like following
Version 1:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",id="f78dea49-2c3a-481b-a1eb-5169d2a97747") location "s3://bucket-rootname/casename=ABC-CASE/report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
select * from mytable_trial1 where casename='ABC-CASE' and report='report' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747' and foldername='output';
Version 2:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`report` string,
`id` string,
`foldername` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/casename=ABC-CASE/report=report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/foldername=output/";
select * from mytable_trial1 where casename='ABC-CASE' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747'
Show partitions shows this partition but no records found with where clause.
I worked with the AWS Support and we were able to narrow down the issue. Version 2 was right one to use since it has four partitions like my S3 bucket. Also, the Alter table command had issue with location. I used hive format location which was incorrect since my actual S3 location is not hive format. So correcting the command to following worked for me.
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
Preview table now shows my entries.
Let's imagine I store one file per day in a format:
/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv
How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?
Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :
/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv
You could then create an external table partitioned by dt and pointing to the location /path/to/files/
CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'
If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.
msck repair table yourtable
You can then simply select data within a date range by specifying the partition range
SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
Without moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
Loading files into tables
Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')
Moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have
/dbname.db/tableName/dt=2016-07-31/file1.csv
/dbname.db/tableName/dt=2016-08-01/file1.csv
/dbname.db/tableName/dt=2016-08-02/file1.csv
load partition with
alter table tableName add partition (dt=2016-07-31);
See Add partitions
In Spark-shell, read hive table
/path/to/data/user_info/dt=2016-07-31/0000-0
1.create sql
val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"
2. run it
spark.sql(sql)
3.load data
val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")
4.now you can select data from table
val df = spark.sql("select * from user_info")
I have a map reduce job, that already writes out record to hdfs using hive partition naming convention.
eg
/user/test/generated/code=1/channel=A
/user/test/generated/code=1/channel=B
After I create an external table, it does not see the partition.
create external table test_1 ( id string, name string ) partitioned by
(code string, channel string) STORED AS PARQUET LOCATION
'/user/test/generated'
Even with the alter command
alter table test_1 ADD PARTITION (code = '1', channel = 'A')
, it does not see the partition or record,
because
select * from test_1 limit 1 produces 0 result.
If I use empty location when I create external table, and then use
load data inpath ...
then it works. But the issue is there is too many partitions for the load data inpath to work.
Is there a way to make hive recognize the partition automatically (without doing insert query)?
Using msck, it seems to be working. But I had to exit the hive session, and connect again.
MSCK REPAIR TABLE test_1
I use flume to write data to hdfs,path like /hive/logs/dt=20151002.Then,i use hive to select data,but the count of response is always 0.
Here is my create table sql,CREATE EXTERNAL TABLE IF NOT EXISTS test (id STRING) partitioned by (dt string) ROW FORMAT DELIMITED fields terminated by '\t' lines terminated by '\n' STORED AS TEXTFILE LOCATION '/hive/logs'
Here is my select sql,select count(*) from test
It seems that you are not registering partition in hive meta-store.
Although partition is present in hdfs path,Hive won't know it if its not registered in meta store. To register it you can do the following:
ALTER TABLE test ADD PARTITION (dt='20151002') location '/hive/logs/dt=20151002';
I have a External Hive table with 4 partitions. I also have 4 hive views based on a different Hive table.
Every week I want the hive view to overwrite the partitions in the External Hive table.
I know I can create an unpartitioned hive table from a view like show below
CREATE TABLE hive_table AS SELECT * FROM hive_view;
But is there a way to overwrite partitions from view data?
Yes, there is a way:
INSERT OVERWRITE TABLE <table_name>
PARTITION(<partition_clause>)
SELECT <select_clause>
It is required to set hive.exec.dynamic.partition to true before such operations. See details here: Hive Language Manual DML - Dynamic Partitions
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--partition table
create external table pracitse_part (
id int,
first_name string,
last_name string,
email string,
ip_address string
)
partitioned by (gender string)
row format delimited
fields terminated by ',';
--create veiw table
create view practise_view as
select p.*
from practise p join practise_temp pt
on p.id=pt.id
where p.id < 11;
--load data into partition table from view table
insert overwrite table pracitse_part partition(gender)
select id,first_name,last_name,email,ip_address,gender from practise_view;