I want to do the following thing in Hive: I create an external table stored as a Textfile and I convert this table in an ORC table (with the usual way: first create an empty ORC table, and second load the data from the original one).
For my TextFile table, my data is located in HDFS in a directory, say /user/MY_DATA/.
So when I add/drop files from MY_DATA, my TextFile table is automatically updated. Now I would like the ORC table to be automatically updated too. Do you know if this is possible?
Thank you!
No, there is no straight forward way for this, u need to add the new data in the ORC table as u did for the first load, or u can create a new orc
CREATE TABLE orc_emp STORED AS ORC AS SELECT * FROM employees.emp;
table and drop the old orc table.
Related
Description
I have a managed partitioned Hive table table_a with data stored in Amazon S3 in parquet format. I renamed column col_old to col_new. And, I lost all the data of col_old because of the way parquet file works.
Question
Is there any way to recover values of col_old? (I still have the old parquet data files.)
Here are few things I tried:
Created a new table with old files and renamed col_new to col_old.
Created a new table with old files and added col_old.
creae table with new column and do
insert into new table as select * from old_table
I am new to Hive. Could you please let me know answer for below question?
Why do we need base table while loading the data in ORC?
Can't we directly create table as ORC and load data in it?
1. Why do we need base table while loading the data in ORC?
We need of the base table, because most of the time we get the data file in text file format, i.e. CSV, TXT, DAT or any other delimiter that we can open the file and see the content. But the file Format ORC maintain in a different way by using their algorithm to optimized the Row and Column.
Hence we need of a base table, so, Actually what happened in that case. We create a table with the textFile format and select the data over their and write it into ORC table.
2. Can't we directly create table as ORC and load data in it?
Yes, you can load the data into ORC file directly.
To understand more about ORC, you can refer to https://orc.apache.org/docs/
Usually if you don't define file format , for hive it is textfile by default.
Need of base table arises because when you create a hive table with orc format and then trying to load data using command:
load data in path '' ..
it simply moves data from one location to another.
hive orc table won't understand textfile. that's when serde comes into picture. you define serde while creating table.
so when a operation like :
1. select * (read)
2. insert into (write)
serde will serialize and desiarlize various format to orc and map data to hive columns.
I am new to HDFS and HIVE. I got some introduction of both after reading some books and documentation. I have a question regarding creation of a table in HIVE for which file is present in HDFS.
I have this file with 300 fields in HDFS. I want to create a table accessing this file in HDFS. But I want to make use of say 30 fields from this file.
My questions are
1. Does hive create a separate file directory?
2. Do I have to create hive table first and import data from HDFS?
3. Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
4. Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
My questions are
Does hive create a separate file directory?
YES if you create a hive table (managed/external) and load the data using load command.
NO if you create an external table and point to the existing file.
Do I have to create hive table first and import data from HDFS?
Not Necessarily you can create a hive external table and point to this existing file.
Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
You can do it easily using hiveQL. follow the below steps (note: this is not the only approach):
create a external table with 300 column and point to the existing
file.
create another hive table with desired 30 columns and insert data to this new table from 300 column table using "insert into
table30col select ... from table300col". Note: hive will create the
file with 30 columns during this insert operation.
Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
Yes this can be an alternative.
I personally like solution mentioned in question 3 as I don't have to recreate the file and I can do all of that in hadoop without depending on some other system.
You have several options. One is to have Hive simply point to the existing file, i.e. create an external HIVE table:
CREATE EXTERNAL TABLE ... LOCATION '<your existing hdfs file>';
This table in Hive will, obviously, match exactly your existing table. You must declare all 300 columns. There will be no data duplication, there is only one one file, Hive simply references the already existing file.
A second option would be to either IMPORT or LOAD the data into a Hive table. This would copy the data into a Hive table and let Hive control the location. But is important to understand that neither IMPORT nor LOAD do not transform the data, so the result table will have exactly the same structure layout and storage as your original table.
Another option, which I would recommend, is to create a specific Hive table and then import the data into it, using a tool like Sqoop or going through an intermediate staging table created by one of the methods above (preferably external reference to avoid an extra copy). Create the desired table, create the external reference staging table, insert the data into the target using INSERT ... SELECT, then drop the staging table. I recommend this because it lets you control not only the table structure/schema (ie. have only the needed 30 columns) but also, importantly, the storage. Hive has a highly columnar performant storage format, namely ORC, and you should thrive to use this storage format because will give you tremendous query performance boost.
I am now preparing to store data in .csv files into hive. Of course, because of the good performance of parquet file format, the hive table should is parquet format. So, the normal way, is to create a temp table whose format is textfile, then I load local CSV file data into this temp table, and finally, create a same-structure parquet table and use sql insert into parquet_table values (select * from textfile_table);.
But I don't think this temp textfile table is necessary. So, my question is, is there a way for me to load these local .csv files into hive parquet-format table directly, namely, not to resort the a temp table? Or a easier way to accomplish this task?
As stated in Hive documentation:
NO verification of data against the schema is performed by the load command.
If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
You could skip a step by using CREATE TABLE AS SELECT for the parquet table.
So you'll have 3 steps:
Create text table defining the schema
Load data into text table (move the file into the new table)
CREATE TABLE parquet_table AS SELECT * FROM textfile_table STORED AS PARQUET; supported from hive 0.13
In a database, I have 50+ tables, I was wondering is there any way to copy these tables into second database at one shot?
I have used this, but running this 50+ times isn't efficient.
create table database2.table1 as select * from database1.table1;
Thanks!
Copying data from one database table to another database table in Hive is like copying data file from existing location in HDFS to new location in HDFS.
The best way of copying data from one Database table to another Database table would be to create external Hive table in new Database and put the location value as for e.g. LOCATION '/user/hive/external/ and copy the file of older table data using distcp to from the old HDFS location to new one.
Example: Existing table in older Database:
CREATE TABLE stations( number STRING, latitude INT, longitude INT, elevation INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH "/home/cloudera/Desktop/Stations.csv"
Now you create external table in new Database:
CREATE EXTERNAL TABLE external_stations( number STRING, latitude INT, longitude INT, elevation INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/external/';
Now you just copy the data file from /user/hive/warehouse/training.db/stations/ to /user/hive/external/ using distcp command. These two paths are specific to my hive locations. You can have similarly in yours.
In this way you can copy table data of any number.
One approach would be to create your table structures in the new database and use distcp to copy data from the old HDFS location to new one.