Inserting csv files into Hive table dynamically - hive

There is a folder named "Sample" which contains n number of csv files. How to load all the csv files into Hive table dynamically?
For normal insert we use
load data inpath "file1.csv" into table Person;
With out hardcoding, can it be done for all the files??

You just need to pass in the directory name like:
load data inpath "/directory/name/here" into table Person;
Quoting the manual :
filepath can refer to a file (in which case Hive will move the file
into the table) or it can be a directory (in which case Hive will move
all the files within that directory into the table). In either case,
filepath addresses a set of files.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

Related

INSERT OVERWRITE on just created table

I have to replicate a process for a client. I have never worked with Hive, so I am trying to understand what they were doing in other cases.
The Hive script I am trying to understand is this one:
DROP TABLE IF EXISTS distribution.030601_TI11;
CREATE EXTERNAL TABLE IF NOT EXISTS distribution.030601_TI11(
mygroup STRING, year STRING, type1 STRING, type2 STRING,
type3 STRING, myvalue INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '/warehouse/distribution/030601_TI11';
INSERT OVERWRITE TABLE distribution.030601_TI11
SELECT *
FROM develop.030601_TI11;
What are they doing?
As far as I have read about Hive, a DROP TABLE IF EXISTS statement over a external table will only delete the table metadata and not the table data. But I would like to know if that INSERT OVERWRITE statement is dropping the previous entries stored in the table, and inserting only the new rows contained in the specified location
And also, how is the LOCATION managed? I want to create the table from a single .csv file. Can I write something like LOCATION '/warehouse/develop/myfile.csv' or I can only provide a HDFS directory as a location?
INSERT OVERWRITE TABLE removes all files inside table location and moves new file. This happens at the very end when the query has already successfully executed and result files are created in the temporary location, after that load task removes all files in table location and moves files from temp location to the table location. See also this answer: https://stackoverflow.com/a/63378038/2700344
If you want to create table on top of single file, put it in some folder and make sure there are no other files in the same folder and and specify that folder as a location in create table DDL. Also you can put that file into existing table location using hdfs dfs -put command or using LOAD command or using some other means. Main point here is that table should have it's own location, does not matter how many files are in the location - single file or many files, location is a folder (directory), not a file. Even if it was possible to create table on top of single file instead of folder, it is unsafe, because overwrite can create another files and table will have location pointing to non-existing file. carefully read answers on this question: How to point to a single file with external table
You are right, the location for external table will remain as is. So, by drop-create statements they are ensuring that the table doesn't exist before dropping or creating. And the table seems to be dynamic in nature so that can be another reason of drop-create.
Please notice you are using CREATE EXTERNAL TABLE IF NOT EXISTS which means if table exist, it will not recreate.
Storage will be cleaned and loaded using INSERT OVERWRITE.
Now, if you want to create a table on top of csv file just use LOCATION '/warehouse/develop/myfile. You dont have to use .csv in location.

Hive: load gziped CSV from hdfs as read-only into a table

I have an hdfs folder with many csv.gz within, all with the same schema. My customer needs to read the content of these tables through Hive.
I tried to apply https://cwiki.apache.org/confluence/display/Hive/CompressedStorage . However it moves the file, whereas I need it to stay in its initial directory.
Another problem is that I should load each file one by one, I would rather create a table from the directory and not manage file individually.
I do not master Hive at all. Is his possible?
Yes, this is possible via Hive. You can create an external table and reference the existing HDFS location containing the gzip files. The schema for the data should be specified during the table creation.
hive> CREATE EXTERNAL TABLE my_data
(
column_1 int,
column_2 string
)
LOCATION 'hdfs:///my_data_folder_with_gzip_files';

Load local csv file to hive parquet table directly,not resort to a temp textfile table

I am now preparing to store data in .csv files into hive. Of course, because of the good performance of parquet file format, the hive table should is parquet format. So, the normal way, is to create a temp table whose format is textfile, then I load local CSV file data into this temp table, and finally, create a same-structure parquet table and use sql insert into parquet_table values (select * from textfile_table);.
But I don't think this temp textfile table is necessary. So, my question is, is there a way for me to load these local .csv files into hive parquet-format table directly, namely, not to resort the a temp table? Or a easier way to accomplish this task?
As stated in Hive documentation:
NO verification of data against the schema is performed by the load command.
If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
You could skip a step by using CREATE TABLE AS SELECT for the parquet table.
So you'll have 3 steps:
Create text table defining the schema
Load data into text table (move the file into the new table)
CREATE TABLE parquet_table AS SELECT * FROM textfile_table STORED AS PARQUET; supported from hive 0.13

Creating external table from file, not directory

When I run the create external table query, I have to provide a directory for the 'Location' attribute. But if the directory I point to has more than one file, then it reads both files. For example, if I put LOCATION 'dir1/', and dir1 contains file1 and file2, both files will be read.
To avoid this, I want to point to a single file. When I tried LOCATION 'dir1/file1', it gave me an error that the file path is not a directory or unable to create one. Is there a way to point to just the single file?
If You want to load data from HDFS so try this
LOAD DATA INPATH '/user/data/file1' INTO TABLE table1;
And if you want to load data from local storage so,
LOAD DATA LOCAL INPATH '/data/file1' INTO TABLE table1;

Difference between `load data inpath ` and `location` in hive?

At my firm, I see these two commands used frequently, and I'd like to be aware of the differences, because their functionality seems the same to me:
1
create table <mytable>
(name string,
number double);
load data inpath '/directory-path/file.csv' into <mytable>;
2
create table <mytable>
(name string,
number double);
location '/directory-path/file.csv';
They both copy the data from the directory on HDFS into the directory for the table on HIVE. Are there differences that one should be aware of when using these? Thank you.
Yes, they are used for different purposes at all.
load data inpath command is use to load data into hive table. 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
load data inpath '/directory-path/file.csv' into <mytable>;
load data local inpath '/local-directory-path/file.csv' into <mytable>;
LOCATION keyword allows to point to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
In other words, with specified LOCATION '/your-path/', Hive does not use a default location for this table. This comes in handy if you already have data generated.
Remember, LOCATION can be specified on EXTERNAL tables only. For regular tables, the default location will be used.
To summarize,
load data inpath tell hive where to look for input files and LOCATION keyword tells hive where to save output files on HDFS.
References:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Option 1: Internal table
create table <mytable>
(name string,
number double);
load data inpath '/directory-path/file.csv' into <mytable>;
This command will remove content at source directory and create a internal table
Option 2: External table
create table <mytable>
(name string,
number double);
location '/directory-path/file.csv';
Create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.
When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.
Have a look at this related SE questions regarding use cases for both internal and external tables
Difference between Hive internal tables and external tables?