Copying data from External table to database - hive

I'm having a data in a external table. Now I'm copying the data from external table to a newly created table in a database. What kind of table will be the table in the database? Is it a managed table or external table? I need your help to understand the concept behind this question
Thanks,
Madan Mohan S

The hive table get their type "Managed" or "External" at time of their creation, not when data is inserted.
So table employees is external (because it was created using "create External" in DDL and provided location of data file.
The emp is managed table because "external" was NOT used in DDL and also location of data was not needed.
The difference now is, if table employees dropped the data it was reading that was provided in "location" is not deleted. So external table is useful when data is being read by multiple tools i.e pig. If pig script is reading same location, it will still function even though employees table is dropped.
But emp is managed (in other word metadata and data both are managed by hive) so when emp is dropped the data also are deleted. So after dropping it if you check the hive warehouse directory you will no find "emp" hdfs directory anymore.

Related

Unable to load managed table with maptype column (complex datatype) from external table in hive

I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]

How to rename a database in azure databricks?

I am trying to rename a database in azure databricks but I am getting the following error:
no viable alternative at input 'ALTER DATABASE inventory
Below is code:
%sql
use inventory;
ALTER DATABASE inventory MODIFY NAME = new_inventory;
Please explain what is meant by this error "no viable alternative at input 'ALTER DATABASE inventory"
and how can I solve it
It's not possible to rename database on Databricks. If you go to the documentation, then you will see that you can only set DBPROPERTIES.
If you really need to rename database, then you have 2 choices:
if you have unmanaged tables (not created via saveAsTable, etc.), then you can produce SQL using SHOW CREATE TABLE, drop your database (be careful anyway), and recreate all tables from saved SQL
if you have managed tables, then the solution would be to create new database, and either use CLONE (only for Delta tables), or CREATE TABLE ... AS SELECT for other file types, and after that drop your database
Alex Ott's answer, to use Clone, is OK if you do not need to maintain the versioning history of your database when you rename it.
However if you wish to time travel on the database of Delta tables after the renaming, this solution works:
Create your new database, specifying its location
Move the file system from the old location to the new location
For each table on the old database, create a table on the new database, based on the location (my code relies on the standard file structure of {database name}/{table name} being observed). No need to specify schema as it's just taken from the files in place
Drop old database
You will then be left with a database with your new name, that has all of the data and all of the history of your old database, i.e. a renamed database of Delta tables.
Pyspark method (on databricks, with "spark" and "dbutils" already defined by default) :
def rename_db(original_db_name, original_db_location, new_db_name, new_db_location):
spark.sql(f"create database if not exists {new_db_name} location '{new_db_location}'")
dbutils.fs.mv(original_db_location,new_db_location,True)
for table in list(map(lambda x: x.tableName, spark.sql(f"SHOW TABLES FROM {original_db_name}").select("tableName").collect())):
spark.sql(f"create table {new_db_name}.{table} location '{new_db_location}/{table}'")
spark.sql(f"drop database {original_db_name} cascade")
return spark.sql(f"SHOW TABLES FROM {new_db_name}")

How hive create a table from a file present in HDFS?

I am new to HDFS and HIVE. I got some introduction of both after reading some books and documentation. I have a question regarding creation of a table in HIVE for which file is present in HDFS.
I have this file with 300 fields in HDFS. I want to create a table accessing this file in HDFS. But I want to make use of say 30 fields from this file.
My questions are
1. Does hive create a separate file directory?
2. Do I have to create hive table first and import data from HDFS?
3. Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
4. Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
My questions are
Does hive create a separate file directory?
YES if you create a hive table (managed/external) and load the data using load command.
NO if you create an external table and point to the existing file.
Do I have to create hive table first and import data from HDFS?
Not Necessarily you can create a hive external table and point to this existing file.
Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
You can do it easily using hiveQL. follow the below steps (note: this is not the only approach):
create a external table with 300 column and point to the existing
file.
create another hive table with desired 30 columns and insert data to this new table from 300 column table using "insert into
table30col select ... from table300col". Note: hive will create the
file with 30 columns during this insert operation.
Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
Yes this can be an alternative.
I personally like solution mentioned in question 3 as I don't have to recreate the file and I can do all of that in hadoop without depending on some other system.
You have several options. One is to have Hive simply point to the existing file, i.e. create an external HIVE table:
CREATE EXTERNAL TABLE ... LOCATION '<your existing hdfs file>';
This table in Hive will, obviously, match exactly your existing table. You must declare all 300 columns. There will be no data duplication, there is only one one file, Hive simply references the already existing file.
A second option would be to either IMPORT or LOAD the data into a Hive table. This would copy the data into a Hive table and let Hive control the location. But is important to understand that neither IMPORT nor LOAD do not transform the data, so the result table will have exactly the same structure layout and storage as your original table.
Another option, which I would recommend, is to create a specific Hive table and then import the data into it, using a tool like Sqoop or going through an intermediate staging table created by one of the methods above (preferably external reference to avoid an extra copy). Create the desired table, create the external reference staging table, insert the data into the target using INSERT ... SELECT, then drop the staging table. I recommend this because it lets you control not only the table structure/schema (ie. have only the needed 30 columns) but also, importantly, the storage. Hive has a highly columnar performant storage format, namely ORC, and you should thrive to use this storage format because will give you tremendous query performance boost.

loading data to hive dynamic partitioned tables

I have created a hive table with dynamic partitioning on a column. Is there a way to directly load the data from files using "LOAD DATA" statement? Or do we have to only depend on creating a non-partitioned intermediate table and load file data to it and then inserting data from this intermediate table to partitioned table as mentioned in Hive loading in partitioned table?
No, the LOAD DATA command ONLY copies the files to the destination directory. It doesn't read the records of the input file, so it CANNOT do partitioning based on record values.
If your input data is already split into multiple files based on partitions, you could directly copy the files to table location in HDFS under their partition directory manually created by you (OR just point to their current location in case of EXTERNAL table) and use the following ALTER command to ADD the partition. This way you could skip the LOAD DATA statement altogether.
ALTER TABLE <table-name>
ADD PARTITION (<...>)
No other go, if we need to insert directly, we'll need to specify partitions manually.
For dynamic partitioning, we need staging table and then insert from there.

How do I dump an entire impala database

Is there a way to dump all the schema / data of an impala database so I can recreate in a new database instance?
Something akin to what mysqldump does?
Yes ,
you can take all data from impala warehouse ( usually /user/hive/warehouse)
use dictcp to copy from one cluster to other cluster in same location
Fire show create table to get schema of each table and just change location to destination location
Since there is no DUMP command (or something similar):
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_shell_commands.html
I think the best solution will be to use only external tables in one database.
That way, you can know where your data is saved, and potentially copy it in another place.
CREATE EXTERNAL TABLE table_name(one_field INT, another_field BIGINT,
another_field1 STRING)
COMMENT 'This is an external table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<my_hdfs_location>';