I have created a hive table with dynamic partitioning on a column. Is there a way to directly load the data from files using "LOAD DATA" statement? Or do we have to only depend on creating a non-partitioned intermediate table and load file data to it and then inserting data from this intermediate table to partitioned table as mentioned in Hive loading in partitioned table?
No, the LOAD DATA command ONLY copies the files to the destination directory. It doesn't read the records of the input file, so it CANNOT do partitioning based on record values.
If your input data is already split into multiple files based on partitions, you could directly copy the files to table location in HDFS under their partition directory manually created by you (OR just point to their current location in case of EXTERNAL table) and use the following ALTER command to ADD the partition. This way you could skip the LOAD DATA statement altogether.
ALTER TABLE <table-name>
ADD PARTITION (<...>)
No other go, if we need to insert directly, we'll need to specify partitions manually.
For dynamic partitioning, we need staging table and then insert from there.
Related
I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]
Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).
I am new to HDFS and HIVE. I got some introduction of both after reading some books and documentation. I have a question regarding creation of a table in HIVE for which file is present in HDFS.
I have this file with 300 fields in HDFS. I want to create a table accessing this file in HDFS. But I want to make use of say 30 fields from this file.
My questions are
1. Does hive create a separate file directory?
2. Do I have to create hive table first and import data from HDFS?
3. Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
4. Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
My questions are
Does hive create a separate file directory?
YES if you create a hive table (managed/external) and load the data using load command.
NO if you create an external table and point to the existing file.
Do I have to create hive table first and import data from HDFS?
Not Necessarily you can create a hive external table and point to this existing file.
Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
You can do it easily using hiveQL. follow the below steps (note: this is not the only approach):
create a external table with 300 column and point to the existing
file.
create another hive table with desired 30 columns and insert data to this new table from 300 column table using "insert into
table30col select ... from table300col". Note: hive will create the
file with 30 columns during this insert operation.
Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
Yes this can be an alternative.
I personally like solution mentioned in question 3 as I don't have to recreate the file and I can do all of that in hadoop without depending on some other system.
You have several options. One is to have Hive simply point to the existing file, i.e. create an external HIVE table:
CREATE EXTERNAL TABLE ... LOCATION '<your existing hdfs file>';
This table in Hive will, obviously, match exactly your existing table. You must declare all 300 columns. There will be no data duplication, there is only one one file, Hive simply references the already existing file.
A second option would be to either IMPORT or LOAD the data into a Hive table. This would copy the data into a Hive table and let Hive control the location. But is important to understand that neither IMPORT nor LOAD do not transform the data, so the result table will have exactly the same structure layout and storage as your original table.
Another option, which I would recommend, is to create a specific Hive table and then import the data into it, using a tool like Sqoop or going through an intermediate staging table created by one of the methods above (preferably external reference to avoid an extra copy). Create the desired table, create the external reference staging table, insert the data into the target using INSERT ... SELECT, then drop the staging table. I recommend this because it lets you control not only the table structure/schema (ie. have only the needed 30 columns) but also, importantly, the storage. Hive has a highly columnar performant storage format, namely ORC, and you should thrive to use this storage format because will give you tremendous query performance boost.
After insertion of orc files into the folder of a table with hdfs copy, how to update that hive table's data to see those data when querying with hive.
Best Regards.
If the table is not partitioned then once the files are in HDFS in the folder that is specified in the LOCATION clause, then the data should be available for querying.
If the table is partitioned then u first need to run an ADD PARTITION statement.
As mentioned in upper answer by belostoky. if the table is not partitioned then you can directly query your table with the updated data
But in case if you table is partitioned you need to add partitions first in hive table that you can do using
You can use alter table statement to add partitions like shown below
ALTER TABLE table1
ADD PARTITION (dt='<date>')
location '<hdfs file path>'
once partitions are added hive metastore should be aware of changes so you need to run
msck repair table table1
to add partitions in metastore.
Once done you can query your data
I am trying to learn Hive and while referring the The Hadoop Definitive Guide, I had some confusions.
As per the text, partition in Hive is done by creating sub-directories of the same values of partitioning column. But as in Hive data loading simply means copying of files, and no data validation checks are done during loading, but during querying only, so does Hive check the data for partitioning. Or how does it determine which file should go to which directory?
Or how does it determine which file should go to which directory?
It doesn't, you have to set the value of the destination partition in the LOAD DATA command. When you perform a LOAD operation into a partitioned table, you have to specify the specific partition (the directory) in which you are going to load the data by means of the PARTITION argument. According to the documentation:
The target being loaded to can be a table or a partition. If the table
is partitioned, then one must specify a specific partition of the
table by specifying values for all of the partitioning columns.
For instance, in this example:
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
The two files will be stored in the invites/ds=2008-08-15 and invites/ds=2008-08-08 folders.