At my firm, I see these two commands used frequently, and I'd like to be aware of the differences, because their functionality seems the same to me:
1
create table <mytable>
(name string,
number double);
load data inpath '/directory-path/file.csv' into <mytable>;
2
create table <mytable>
(name string,
number double);
location '/directory-path/file.csv';
They both copy the data from the directory on HDFS into the directory for the table on HIVE. Are there differences that one should be aware of when using these? Thank you.
Yes, they are used for different purposes at all.
load data inpath command is use to load data into hive table. 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
load data inpath '/directory-path/file.csv' into <mytable>;
load data local inpath '/local-directory-path/file.csv' into <mytable>;
LOCATION keyword allows to point to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
In other words, with specified LOCATION '/your-path/', Hive does not use a default location for this table. This comes in handy if you already have data generated.
Remember, LOCATION can be specified on EXTERNAL tables only. For regular tables, the default location will be used.
To summarize,
load data inpath tell hive where to look for input files and LOCATION keyword tells hive where to save output files on HDFS.
References:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Option 1: Internal table
create table <mytable>
(name string,
number double);
load data inpath '/directory-path/file.csv' into <mytable>;
This command will remove content at source directory and create a internal table
Option 2: External table
create table <mytable>
(name string,
number double);
location '/directory-path/file.csv';
Create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.
When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.
Have a look at this related SE questions regarding use cases for both internal and external tables
Difference between Hive internal tables and external tables?
Related
I have a requirement to load Avro file to hive. Using the following to create the table
create external table tblName stored as avro location 'hdfs://host/pathToData' tblproperties ('avro.schema.url'='/hdfsPathTo/schema.avsc');
I am getting an error FOUND NULL, EXPECTED STRING while doing a select on the table. Is it possible to load few columns and find which column data is causing this error?
Actually you need first to create an Hive External table pointing to the location of your AVRO files, and using the AvroSerDe format.
At this stage, nothing is loaded. The external table is just a mask on files.
Then you can create an internal HIVE table and load data (the expected columns) from the external one.
If you are already having AVRO file, then load the file to HDFS in a directory of your choice. Next create an external table on top of the directory.
CREATE EXTERNAL TABLE external_table_name(col1 string, col2 string, col3 string ) STORED AS AVRO LOCATION '<HDFS location>';
Next create an internal hive table on top of the external table to load the data
CREATE TABLE internal_table_name(col2 string, col3 string) AS SELECT col2, col3 FROM external_table_name
You can schedule the internal table load using a batch script in any scripting language or tools.
Hope this helps :)
I have an hdfs folder with many csv.gz within, all with the same schema. My customer needs to read the content of these tables through Hive.
I tried to apply https://cwiki.apache.org/confluence/display/Hive/CompressedStorage . However it moves the file, whereas I need it to stay in its initial directory.
Another problem is that I should load each file one by one, I would rather create a table from the directory and not manage file individually.
I do not master Hive at all. Is his possible?
Yes, this is possible via Hive. You can create an external table and reference the existing HDFS location containing the gzip files. The schema for the data should be specified during the table creation.
hive> CREATE EXTERNAL TABLE my_data
(
column_1 int,
column_2 string
)
LOCATION 'hdfs:///my_data_folder_with_gzip_files';
In Hue --> Hive Query Browser I created an external table in Hive and loaded data from one of my CSV files into it using the following statements:
CREATE EXTERNAL TABLE movies(movieId BIGINT, title VARCHAR(100), genres VARCHAR(100)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
LOAD DATA INPATH '/user/admin/movie_data/movies' INTO TABLE movies;
I see that the source file "movies" disappears from HDFS and moves to the hive datawarehouse. I am under the impression that an external table acts only as a link to original source data.
Should the external table not be independent of source data - as in if I were to drop the table, source file will still persist? How do I create such an external table?
The external tables stores data in a hdfs location mentioned while we create the table. So if we dont provide location while creating the table it will be defaulted to warehouse hdfs folder.
Try running "use mydatabase_name;show create table mytable_name;" to get the table definition to see what is the location it is pointed to.
If you need a hdfs location other than default one you need to mention it while creating table.Refer below query
[Create external table test (col1 string) location '/data/database/tablename';]
Secondly LOAD INPATH will not move data from INPATH to external hdfs location, it will insert data from INPATH to your table table (more like copying data from inpath to tables's hdfs location)
By going through the internet about external tables and managed table, I understood that we need to specify the Location while creating the external table as hive will create the tables in the given location but in case of managed table, the default directory mentioned in hive.metastore.warehouse.dir will be used.
Please correct me if anything wrongly stated.
What confusing me is:
Is the LOCATION clause used to specify where the data exist for External table or where to create the directory to store the actual data?
If the LOCATION clause is used to specify where the data exist, then why are we using the PATH clause in the LOAD statement.
The location clause in the DDL of an external table is used to
specify the hdfs location where the data needs to be stored. Later
on when we query the table the data would be read from this specified
path.
The load data inpath is the path of the source file from where the data
is loaded into the table. The source could be either a local file
path or a hdfs file path.
Hope I have cleared your confusion.
I have 1 TB data in my HDFS in .csv format. When I load it in my Hive table what will be the total size of data. I mean will there be 2 copies of same data i.e 1 Copy in HDFS and other in Hive table ? Plz clarify. Thanks in advance.
If you create a hive external table, you provide a HDFS location for the table and you store that data into that particular location.
When you create a hive internal table hive create a directory into /apps/hive/warehouse/ directory.
Say, your table name is table1 then your directory will be /apps/hive/warehouse/table1
This directory is also a HDFS directory and when you load data into the table into internal table it goes into its directory.
Hive creates a mapping between table and their corresponding HDFS location and hence when you read the data its reading from the corresponding mapped directory.
Hence there wont be duplicate copy of data corresponding to table and their HDFS location.
But if in your Hadoop cluster Data Replication factor is set to 3(default replication) then it will take 3TB cluster disk space(as you have 1TB data) but there wont be any effect of your hive table data.
Please see below link to know more about Data replication.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
It depends whether you are creating an internal or external table in Hive.
If you create an external table in Hive, it will create a mapping on where your data is stored in HDFS and there won't be any duplication at all. Hive will automatically pick the data where ever it is stored in HDFS.
Read more about external tables here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables