Copy tables in HIVE, from one database to another database - sql

In a database, I have 50+ tables, I was wondering is there any way to copy these tables into second database at one shot?
I have used this, but running this 50+ times isn't efficient.
create table database2.table1 as select * from database1.table1;
Thanks!

Copying data from one database table to another database table in Hive is like copying data file from existing location in HDFS to new location in HDFS.
The best way of copying data from one Database table to another Database table would be to create external Hive table in new Database and put the location value as for e.g. LOCATION '/user/hive/external/ and copy the file of older table data using distcp to from the old HDFS location to new one.
Example: Existing table in older Database:
CREATE TABLE stations( number STRING, latitude INT, longitude INT, elevation INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH "/home/cloudera/Desktop/Stations.csv"
Now you create external table in new Database:
CREATE EXTERNAL TABLE external_stations( number STRING, latitude INT, longitude INT, elevation INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/external/';
Now you just copy the data file from /user/hive/warehouse/training.db/stations/ to /user/hive/external/ using distcp command. These two paths are specific to my hive locations. You can have similarly in yours.
In this way you can copy table data of any number.

One approach would be to create your table structures in the new database and use distcp to copy data from the old HDFS location to new one.

Related

Automatic updating of an ORC table

I want to do the following thing in Hive: I create an external table stored as a Textfile and I convert this table in an ORC table (with the usual way: first create an empty ORC table, and second load the data from the original one).
For my TextFile table, my data is located in HDFS in a directory, say /user/MY_DATA/.
So when I add/drop files from MY_DATA, my TextFile table is automatically updated. Now I would like the ORC table to be automatically updated too. Do you know if this is possible?
Thank you!
No, there is no straight forward way for this, u need to add the new data in the ORC table as u did for the first load, or u can create a new orc
CREATE TABLE orc_emp STORED AS ORC AS SELECT * FROM employees.emp;
table and drop the old orc table.

Load local csv file to hive parquet table directly,not resort to a temp textfile table

I am now preparing to store data in .csv files into hive. Of course, because of the good performance of parquet file format, the hive table should is parquet format. So, the normal way, is to create a temp table whose format is textfile, then I load local CSV file data into this temp table, and finally, create a same-structure parquet table and use sql insert into parquet_table values (select * from textfile_table);.
But I don't think this temp textfile table is necessary. So, my question is, is there a way for me to load these local .csv files into hive parquet-format table directly, namely, not to resort the a temp table? Or a easier way to accomplish this task?
As stated in Hive documentation:
NO verification of data against the schema is performed by the load command.
If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
You could skip a step by using CREATE TABLE AS SELECT for the parquet table.
So you'll have 3 steps:
Create text table defining the schema
Load data into text table (move the file into the new table)
CREATE TABLE parquet_table AS SELECT * FROM textfile_table STORED AS PARQUET; supported from hive 0.13

Hive: source data gets moved to hive datawarehouse even when table is external

In Hue --> Hive Query Browser I created an external table in Hive and loaded data from one of my CSV files into it using the following statements:
CREATE EXTERNAL TABLE movies(movieId BIGINT, title VARCHAR(100), genres VARCHAR(100)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
LOAD DATA INPATH '/user/admin/movie_data/movies' INTO TABLE movies;
I see that the source file "movies" disappears from HDFS and moves to the hive datawarehouse. I am under the impression that an external table acts only as a link to original source data.
Should the external table not be independent of source data - as in if I were to drop the table, source file will still persist? How do I create such an external table?
The external tables stores data in a hdfs location mentioned while we create the table. So if we dont provide location while creating the table it will be defaulted to warehouse hdfs folder.
Try running "use mydatabase_name;show create table mytable_name;" to get the table definition to see what is the location it is pointed to.
If you need a hdfs location other than default one you need to mention it while creating table.Refer below query
[Create external table test (col1 string) location '/data/database/tablename';]
Secondly LOAD INPATH will not move data from INPATH to external hdfs location, it will insert data from INPATH to your table table (more like copying data from inpath to tables's hdfs location)

Difference between `load data inpath ` and `location` in hive?

At my firm, I see these two commands used frequently, and I'd like to be aware of the differences, because their functionality seems the same to me:
1
create table <mytable>
(name string,
number double);
load data inpath '/directory-path/file.csv' into <mytable>;
2
create table <mytable>
(name string,
number double);
location '/directory-path/file.csv';
They both copy the data from the directory on HDFS into the directory for the table on HIVE. Are there differences that one should be aware of when using these? Thank you.
Yes, they are used for different purposes at all.
load data inpath command is use to load data into hive table. 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
load data inpath '/directory-path/file.csv' into <mytable>;
load data local inpath '/local-directory-path/file.csv' into <mytable>;
LOCATION keyword allows to point to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
In other words, with specified LOCATION '/your-path/', Hive does not use a default location for this table. This comes in handy if you already have data generated.
Remember, LOCATION can be specified on EXTERNAL tables only. For regular tables, the default location will be used.
To summarize,
load data inpath tell hive where to look for input files and LOCATION keyword tells hive where to save output files on HDFS.
References:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Option 1: Internal table
create table <mytable>
(name string,
number double);
load data inpath '/directory-path/file.csv' into <mytable>;
This command will remove content at source directory and create a internal table
Option 2: External table
create table <mytable>
(name string,
number double);
location '/directory-path/file.csv';
Create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.
When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.
Have a look at this related SE questions regarding use cases for both internal and external tables
Difference between Hive internal tables and external tables?

How do I dump an entire impala database

Is there a way to dump all the schema / data of an impala database so I can recreate in a new database instance?
Something akin to what mysqldump does?
Yes ,
you can take all data from impala warehouse ( usually /user/hive/warehouse)
use dictcp to copy from one cluster to other cluster in same location
Fire show create table to get schema of each table and just change location to destination location
Since there is no DUMP command (or something similar):
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_shell_commands.html
I think the best solution will be to use only external tables in one database.
That way, you can know where your data is saved, and potentially copy it in another place.
CREATE EXTERNAL TABLE table_name(one_field INT, another_field BIGINT,
another_field1 STRING)
COMMENT 'This is an external table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<my_hdfs_location>';