Is it possible to create n number of external tables are pointing to a single hdfs path using Hive. If yes what are the advantages and its limitations.
It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS.
Creating tables with exactly the same schema on top of the same data is not useful at all, but you can create different tables with different number of columns for example or with differently parsed columns using RegexSerDe for example, so you can have different schemas in these tables. And you can have different permissions on these tables in Hive. Also table can be created on top of the sub-folder of some other tables folder, in this case it will contain a sub-set of data. Better use partitions in single table for the same.
And the drawback is that it is confusing because you can rewrite the same data using more than one table and also you may drop it accidentally, thinking this data belongs to the only table and you can drop data because you do not need that table any more.
And this is few tests:
Create table with INT column:
create table T(id int);
OK
Time taken: 1.033 seconds
Check location and other properties:
hive> describe formatted T;
OK
# col_name data_type comment
id int
# Detailed Table Information
Database: my
Owner: myuser
CreateTime: Fri Jan 04 04:45:03 PST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://myhdp/user/hive/warehouse/my.db/t
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1546605903
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.134 seconds, Fetched: 26 row(s)
sts)
Create second table on top of the same location but with STRING column:
hive> create table T2(id string) location 'hdfs://myhdp/user/hive/warehouse/my.db/t';
OK
Time taken: 0.029 seconds
Insert data:
hive> insert into table T values(1);
OK
Time taken: 33.266 seconds
Check data:
hive> select * from T;
OK
1
Time taken: 3.314 seconds, Fetched: 1 row(s)
Insert into second table:
hive> insert into table T2 values( 'A');
OK
Time taken: 23.959 seconds
Check data:
hive> select * from T2;
OK
1
A
Time taken: 0.073 seconds, Fetched: 2 row(s)
Select from first table:
hive> select * from T;
OK
1
NULL
Time taken: 0.079 seconds, Fetched: 2 row(s)
String was selected as NULL because this table is defined as having INT column.
And now insert STRING into first table (INT column):
insert into table T values( 'A');
OK
Time taken: 84.336 seconds
Surprise, it is not failing!
What was inserted?
hive> select * from T2;
OK
1
A
NULL
Time taken: 0.067 seconds, Fetched: 3 row(s)
NULL was inserted, because during previous insert string was converted to int and this resulted in NULL
Now let's try to drop one table and select from another one:
hive> drop table T;
OK
Time taken: 4.996 seconds
hive> select * from T2;
OK
Time taken: 6.978 seconds
Returned 0 rows because first table was MANAGED and drop table also removed common location.
THE END,
data is removed, do We need T2 table without data in it?
drop table T2;
OK
Second table is removed, you see, it was metadata only. The table was also managed and drop table should remove the location with data also, but it's already nothing to remove in HDFS, only metadata was removed.
I have a hive table 'videotracking_playevent' which uses the following partition format (all strings): source/createyear/createmonth/createday.
Example: source=home/createyear=2016/createmonth=9/createday=1
I'm trying to update the partition values of createmonth and createday to consistently use double digits instead.
Example: source=home/createyear=2016/createmonth=09/createday=01
I've tried to the following query:
ALTER TABLE videotracking_playevent PARTITION (
source='home',
createyear='2015',
createmonth='11',
createday='1'
) RENAME TO PARTITION (
source='home',
createyear='2015',
createmonth='11',
createday='01'
);
However that returns the following, non-descriptive error from hive: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null
I've confirmed that this partition exists, and I think I'm using the correct syntax. My hive version is Hive 1.1.0
Any ideas what I might be doing wrong?
There was an issue with old version of Hive with renaming partition. This might be an issue for your case too. Please see this link for detail.
You need to set below two property before executing the rename partition command if you are using Older version of Hive.
set fs.hdfs.impl.disable.cache=false;
set fs.file.impl.disable.cache=false;
Now run the query by setting this property.
hive> set fs.hdfs.impl.disable.cache=false;
hive> set fs.file.impl.disable.cache=false;
hive> ALTER TABLE partition_test PARTITION (year='2016',day='1') RENAME TO PARTITION (year='2016',day='01');
OK
Time taken: 0.28 seconds
hive> show partitions partition_test;
OK
year=2016/day=01
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive>
This issue is fixed in Hive latest version. In my case Hive version is 1.2.1 and it works, without setting that property. Please see the example below.
Create a partitioned table.
hive> create table partition_test(
> name string,
> age int)
> partitioned by (year string, day string);
OK
Time taken: 5.35 seconds
hive>
Now add the partition and check the newly added partition.
hive> alter table partition_test ADD PARTITION (year='2016', day='1');
OK
Time taken: 0.137 seconds
hive>
hive> show partitions partition_test;
OK
year=2016/day=1
Time taken: 0.169 seconds, Fetched: 1 row(s)
hive>
Rename the partition using RENAME TO PARTITION command and check it.
hive> ALTER TABLE partition_test PARTITION (year='2016',day='1') RENAME TO PARTITION (year='2016',day='01');
OK
Time taken: 0.28 seconds
hive> show partitions partition_test;
OK
year=2016/day=01
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive>
Hope it helps you.
Rename lets you change the value of a partition column. One of use cases is that you can use this statement to normalize your legacy partition column value to conform to its type. In this case, the type conversion and normalization are not enabled for the column values in old partition_spec even with property hive.typecheck.on.insert set to true (default) which allows you to specify any legacy data in form of string in the old partition_spec"
Bug open
https://issues.apache.org/jira/browse/HIVE-10362
You can create a copy of the table without partition, then update the column of the table, and then recreate the first one with partition
create table table_name partitioned by (table_column) as
select
*
from
source_table
That worked for me.
I created a table named table1 in hive and i need to insert data from table2 into table1. I used the below statemnt to get the output.
Also i need to add a new column with some constant value -- colx = 'colval' along with the columns in table2 but am not sure how to add it.. Thanks!
INSERT INTO TABLE table1 select * FROM table2;
If you are willing to drop table1 and recreate it from scratch, you could do this:
-- I'm using Hive 0.13.0
DROP TABLE IF EXISTS table1;
CREATE TABLE table1 AS SELECT *, 'colval' AS colx FROM TABLE2;
If that is not an option for some reason, you can use INSERT OVERWRITE:
ALTER TABLE table1 ADD COLUMNS (colx STRING); -- Assuming you haven't created the column already
INSERT OVERWRITE TABLE table1 SELECT *, 'colval' FROM table2;
I am new to hive, and want to know if there is anyway to insert data into Hive table like we do in SQL. I want to insert my data into hive like
INSERT INTO tablename VALUES (value1,value2..)
I have read that you can load the data from a file to hive table or you can import data from one table to hive table but is there any way to append the data as in SQL?
Some of the answers here are out of date as of Hive 0.14
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingvaluesintotablesfromSQL
It is now possible to insert using syntax such as:
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
You can use the table generating function stack to insert literal values into a table.
First you need a dummy table which contains only one line. You can generate it with the help of limit.
CREATE TABLE one AS
SELECT 1 AS one
FROM any_table_in_your_database
LIMIT 1;
Now you can create a new table with literal values like this:
CREATE TABLE my_table AS
SELECT stack(3
, "row1", 1
, "row2", 2
, "row3", 3
) AS (column1, column2)
FROM one
;
The first argument of stack is the number of rows you are generating.
You can also add values to an existing table:
INSERT INTO TABLE my_table
SELECT stack(2
, "row4", 1
, "row5", 2
) AS (column1, column2)
FROM one
;
Slightly better version of the unique2 suggestion is below:
insert overwrite table target_table
select * from
(
select stack(
3, # generating new table with 3 records
'John', 80, # record_1
'Bill', 61 # record_2
'Martha', 101 # record_3
)
) s;
Which does not require the hack with using an already exiting table.
You can use below approach. With this, You don't need to create temp table OR txt/csv file for further select and load respectively.
INSERT INTO TABLE tablename SELECT value1,value2 FROM tempTable_with_atleast_one_records LIMIT 1.
Where tempTable_with_atleast_one_records is any table with atleast one record.
But problem with this approach is that If you have INSERT statement which inserts multiple rows like below one.
INSERT INTO yourTable values (1 , 'value1') , (2 , 'value2') , (3 , 'value3') ;
Then, You need to have separate INSERT hive statement for each rows. See below.
INSERT INTO TABLE yourTable SELECT 1 , 'value1' FROM tempTable_with_atleast_one_records LIMIT 1;
INSERT INTO TABLE yourTable SELECT 2 , 'value2' FROM tempTable_with_atleast_one_records LIMIT 1;
INSERT INTO TABLE yourTable SELECT 3 , 'value3' FROM tempTable_with_atleast_one_records LIMIT 1;
No. This INSERT INTO tablename VALUES (x,y,z) syntax is currently not supported in Hive.
You could definitely append data into an existing table. (But it is actually not an append at the HDFS level). It's just that whenever you do a LOAD or INSERT operation on an existing Hive table without OVERWRITE clause the new data will be put without replacing the old data. A new file will be created for this newly inserted data inside the directory corresponding to that table. For example :
I have a file named demo.txt which has 2 lines :
ABC
XYZ
Create a table and load this file into it
hive> create table demo(foo string);
hive> load data inpath '/demo.txt' into table demo;
Now,if I do a SELECT on this table it'll give me :
hive> select * from demo;
OK
ABC
XYZ
Suppose, I have one more file named demo2.txt which has :
PQR
And I do a LOAD again on this table without using overwrite,
hive> load data inpath '/demo2.txt' into table demo;
Now, if I do a SELECT now, it'll give me,
hive> select * from demo;
OK
ABC
XYZ
PQR
HTH
Ways to insert data into Hive table:
for demonstration, I am using table name as table1 and table2
create table table2 as select * from table1 where 1=1;
or
create table table2 as select * from table1;
insert overwrite table table2 select * from table1;
--it will insert data from one to another. Note: It will refresh the target.
insert into table table2 select * from table1;
--it will insert data from one to another. Note: It will append into the target.
load data local inpath 'local_path' overwrite into table table1;
--it will load data from local into the target table and also refresh the target table.
load data inpath 'hdfs_path' overwrite into table table1;
--it will load data from hdfs location iand also refresh the target table.
or
create table table2(
col1 string,
col2 string,
col3 string)
row format delimited fields terminated by ','
location 'hdfs_location';
load data local inpath 'local_path' into table table1;
--it will load data from local and also append into the target table.
load data inpath 'hdfs_path' into table table1;
--it will load data from hdfs location and also append into the target table.
insert into table2 values('aa','bb','cc');
--Lets say table2 have 3 columns only.
Multiple insertion into hive table
Yes you can insert but not as similar to SQL.
In SQL we can insert the row level data, but here you can insert by fields (columns).
During this you have to make sure target table and the query should have same datatype and same number of columns.
eg:
CREATE TABLE test(stu_name STRING,stu_id INT,stu_marks INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
INSERT OVERWRITE TABLE test SELECT lang_name, lang_id, lang_legacy_id FROM export_table;
To insert entire data of table2 in table1. Below is a query:
INSERT INTO TABLE table1 SELECT * FROM table2;
You can't do insert into to insert single record. It's not supported by Hive. You may place all new records that you want to insert in a file and load that file into a temp table in Hive. Then using insert overwrite..select command insert those rows into a new partition of your main Hive table. The constraint here is your main table will have to be pre partitioned. If you don't use partition then your whole table will be replaced with these new records.
Enter the following command to insert data into the testlog table with some condition:
INSERT INTO TABLE testlog SELECT * FROM table1 WHERE some condition;
I think in such scenarios you should be using HBASE which facilitates such kind of insertion but it does not provide any SQL kind of query language. You need you use Java API of HBASE like the put method to do such kind of insertion. Moreover HBASE is column oriented no-sql database.
You still can insert into complex type in Hive - it works
(id is int, colleagues array)
insert into emp (id,colleagues) select 11, array('Alex','Jian') from (select '1')
you can add values to specific columns as well, just specify the column names in which you like to add corresponding values:
Insert into Table (Col1, Col2, Col4,col5,Col7) Values ('Va11','Va2','Val4','Val5','Val7');
Make sure the columns you skip dont have not null value type.
There are few properties to set to make a Hive table support ACID properties and to insert the values into tables as like in SQL .
Conditions to create a ACID table in Hive.
The table should be stored as ORC file. Only ORC format can support ACID prpoperties for now.
The table must be bucketed
Properties to set to create ACID table:
set hive.support.concurrency =true;
set hive.enforce.bucketing =true;
set hive.exec.dynamic.partition.mode =nonstrict
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads= 1;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set the property hive.in.test to true in hive.site.xml
After setting all these properties , the table should be created with tblproperty 'transactional' ='true'. The table should be bucketed and saved as orc
CREATE TABLE table_name (col1 int,col2 string, col3 int) CLUSTERED BY col1 INTO 4
BUCKETS STORED AS orc tblproperties('transactional' ='true');
Now its possible to inserte values into the table like SQL query.
INSERT INTO TABLE table_name VALUES (1,'a',100),(2,'b',200),(3,'c',300);
Yes we can use Insert query in Hive.
hive> create table test (id int, name string);
INSERT: INSERT...VALUES is available starting in version 0.14.
hive> insert into table test values (1,'mytest');
This is going to work for insert. We have to use values keyword.
Note: User cannot insert data into a complex datatype column (array, map, struct, union) using the INSERT INTO...VALUES clause.