importing data into hive table from external s3 bucket url link - amazon-s3

I need to import data from a public s3 bucket which url is shared with me. how to load the data into hive table?
I have tried below command but its not working:
create external table airlines_info (.... ) row format
delimited fields terminated by '|' lines terminated by '\n'
stored as textfile location 'https://ml-cloud-dataset.....*.txt';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:ml-cloud-dataset.s3.amazonaws.com/Airlines_data.txt is not a directory or unable to create one)
I am very new to hive and I am not sure about the code. I also tried below code after creating the table to load the data into hive table but that's also not working
load data inpath 'https://ml-cloud-dataset.....*.txt' into table airlines_info;

Table location should be directory in HDFS or S3, not file and not https link.
Download file manually, put into local filesystem and if you already have the table created then use
load data local inpath 'local_path_to_file' into table airlines_info;
If you do not have the table yet, create it and specify some location inside your S3, or alternatively create MANAGED table (remove EXTERNAL from your DDL), without location specified, it will create location for you, check location using DESCRIBE FORMATTED command, later you can convert table to EXTERNAL if necessary using ALTER TABLE airlines_info SET TBLPROPERTIES('EXTERNAL'='TRUE');
Instead of load data command you can simply copy file into table location using AWS CLI (provide correct local path and table directory S3 URL):
aws s3 cp C:\Users\My_user\Downloads\Airlines_data.txt s3://mybucket/path/airlines_info/

Related

Load Data into Hive on EMR

I created a cluster under the EMR service then I connected with Putty.
In the meantime, I chose 'presto' when building the cluster.
How do I transfer a file from S3 or on my local computer into the hive?
For example, I need to upload the student file but when I run the following code, I naturally get an error. Where do I put the student file?
hive > load data local inpath 'student' into table student_nopart;
I'm trying to make an example here.
https://github.com/weltond/LearnBasicBigDataTech
In your code,
load data local inpath ...
the local is meaning the EMR node, not your computer. By using sftp or something, you should upload the file into EMR first and load it.
OR use this.
load data inpath 's3://bucket/path/to/file/' into table <tablename>
If you already have data in S3, you can build Hive table on top of the S3 location or alter existing Hive table.
ALTER TABLE student SET location='s3://bucket/path/to/folder_with_table_files';

Hive: load gziped CSV from hdfs as read-only into a table

I have an hdfs folder with many csv.gz within, all with the same schema. My customer needs to read the content of these tables through Hive.
I tried to apply https://cwiki.apache.org/confluence/display/Hive/CompressedStorage . However it moves the file, whereas I need it to stay in its initial directory.
Another problem is that I should load each file one by one, I would rather create a table from the directory and not manage file individually.
I do not master Hive at all. Is his possible?
Yes, this is possible via Hive. You can create an external table and reference the existing HDFS location containing the gzip files. The schema for the data should be specified during the table creation.
hive> CREATE EXTERNAL TABLE my_data
(
column_1 int,
column_2 string
)
LOCATION 'hdfs:///my_data_folder_with_gzip_files';

Creating external table from file, not directory

When I run the create external table query, I have to provide a directory for the 'Location' attribute. But if the directory I point to has more than one file, then it reads both files. For example, if I put LOCATION 'dir1/', and dir1 contains file1 and file2, both files will be read.
To avoid this, I want to point to a single file. When I tried LOCATION 'dir1/file1', it gave me an error that the file path is not a directory or unable to create one. Is there a way to point to just the single file?
If You want to load data from HDFS so try this
LOAD DATA INPATH '/user/data/file1' INTO TABLE table1;
And if you want to load data from local storage so,
LOAD DATA LOCAL INPATH '/data/file1' INTO TABLE table1;

Difference between `load data inpath ` and `location` in hive?

At my firm, I see these two commands used frequently, and I'd like to be aware of the differences, because their functionality seems the same to me:
1
create table <mytable>
(name string,
number double);
load data inpath '/directory-path/file.csv' into <mytable>;
2
create table <mytable>
(name string,
number double);
location '/directory-path/file.csv';
They both copy the data from the directory on HDFS into the directory for the table on HIVE. Are there differences that one should be aware of when using these? Thank you.
Yes, they are used for different purposes at all.
load data inpath command is use to load data into hive table. 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
load data inpath '/directory-path/file.csv' into <mytable>;
load data local inpath '/local-directory-path/file.csv' into <mytable>;
LOCATION keyword allows to point to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
In other words, with specified LOCATION '/your-path/', Hive does not use a default location for this table. This comes in handy if you already have data generated.
Remember, LOCATION can be specified on EXTERNAL tables only. For regular tables, the default location will be used.
To summarize,
load data inpath tell hive where to look for input files and LOCATION keyword tells hive where to save output files on HDFS.
References:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Option 1: Internal table
create table <mytable>
(name string,
number double);
load data inpath '/directory-path/file.csv' into <mytable>;
This command will remove content at source directory and create a internal table
Option 2: External table
create table <mytable>
(name string,
number double);
location '/directory-path/file.csv';
Create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.
When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.
Have a look at this related SE questions regarding use cases for both internal and external tables
Difference between Hive internal tables and external tables?

How to load data into Hive table

I'm using the hortonworks's Hue (more like a GUI interface that connects hdfs, hive, pig together)and I want to load the data within the hdfs into my current created table.
Suppose the table's name is "test", and the file which contains the data, the path is:
/user/hdfs/test/test.txt"
But I'm unable to load the data into the table, I tried:
load data local inpath '/user/hdfs/test/test.txt' into table test
But there's error said can't find the file, there's no matching path.
I'm still so confused.
Any suggestions?
Thanks
As you said "load the data within the hdfs into my current created table".
But in you command you are using :
load data local inpath '/user/hdfs/test/test.txt' into table test
Using local keyword it looks for the file in your local filesystem. But you file is in HDFS.
I think you need to remove local keyword from you command.
Hope it helps...!!!
Since you are using the hue and the output is showing not matching path. I think you have to give the complete path.
for example:
load data local inpath '/home/cloudera/hive/Documents/info.csv' into table tablename; same as you can give the complete path where the hdfs in which the document resides.
You can use any other format file
remove local keyword as ur referring to local file system