Load Data into Hive on EMR

Load Data into Hive on EMR - amazon-s3

I created a cluster under the EMR service then I connected with Putty.
In the meantime, I chose 'presto' when building the cluster.
How do I transfer a file from S3 or on my local computer into the hive?
For example, I need to upload the student file but when I run the following code, I naturally get an error. Where do I put the student file?
hive > load data local inpath 'student' into table student_nopart;
I'm trying to make an example here.
https://github.com/weltond/LearnBasicBigDataTech

In your code,
load data local inpath ...
the local is meaning the EMR node, not your computer. By using sftp or something, you should upload the file into EMR first and load it.
OR use this.
load data inpath 's3://bucket/path/to/file/' into table <tablename>

If you already have data in S3, you can build Hive table on top of the S3 location or alter existing Hive table.
ALTER TABLE student SET location='s3://bucket/path/to/folder_with_table_files';

Related

How do I load data into Cloudera Impala Table?

I'm loading data into a Cloudera Impala ODBC table using a post SQL statement but I'm getting a "URI path must be absolute" error. Below is my SQL.
REFRESH sw_cfnusdata.CPN_Sales_Data;
DROP TABLE IF EXISTS sw_cfnusdata.CPN_Sales_Data_parquet;
CREATE TABLE IF NOT EXISTS sw_cfnusdata.CPN_Sales_Data_parquet LIKE
sw_cfnusdata.CPN_Sales_Data STORED AS PARQUET;
REFRESH sw_cfnusdata.CPN_Sales_Data_parquet;
LOAD DATA INPATH 'data/shared_workspace/sw_cfnusdata/Alteryx_CPN_Sales_Data' OVERWRITE INTO TABLE sw_cfnusdata.CPN_Sales_Data_parquet;
REFRESH sw_cfnusdata.CPN_Sales_Data_parquet;
COMPUTE STATS sw_cfnusdata.CPN_Sales_Data;
DROP TABLE sw_cfnusdata.CPN_Sales_Data;
Any ideas on what I'm missing here. I tried the same statement without the Compute Stats function and still got the same error. Thank you in advance.

You need to provide hdfs path.
Upload that file into hdfs and try same command with hdfs path like hdfs://DEV/data/sampletable.
Or else you can upload the file into local disc and try below command
load data local inpath "/data/sampletable.txt" into table sampletable;
So, below section need to be changed and you need to add either hdfs path or local path.
LOAD DATA INPATH 'data/shared_workspace/sw_cfnusdata/Alteryx_CPN_Sales_Data' OVERWRITE INTO TABLE sw_cfnusdata.CPN_Sales_Data_parquet;

How read data partitons in S3 from Trino

I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command

In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS

Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html

importing data into hive table from external s3 bucket url link

I need to import data from a public s3 bucket which url is shared with me. how to load the data into hive table?
I have tried below command but its not working:
create external table airlines_info (.... ) row format
delimited fields terminated by '|' lines terminated by '\n'
stored as textfile location 'https://ml-cloud-dataset.....*.txt';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:ml-cloud-dataset.s3.amazonaws.com/Airlines_data.txt is not a directory or unable to create one)
I am very new to hive and I am not sure about the code. I also tried below code after creating the table to load the data into hive table but that's also not working
load data inpath 'https://ml-cloud-dataset.....*.txt' into table airlines_info;

Table location should be directory in HDFS or S3, not file and not https link.
Download file manually, put into local filesystem and if you already have the table created then use
load data local inpath 'local_path_to_file' into table airlines_info;
If you do not have the table yet, create it and specify some location inside your S3, or alternatively create MANAGED table (remove EXTERNAL from your DDL), without location specified, it will create location for you, check location using DESCRIBE FORMATTED command, later you can convert table to EXTERNAL if necessary using ALTER TABLE airlines_info SET TBLPROPERTIES('EXTERNAL'='TRUE');
Instead of load data command you can simply copy file into table location using AWS CLI (provide correct local path and table directory S3 URL):
aws s3 cp C:\Users\My_user\Downloads\Airlines_data.txt s3://mybucket/path/airlines_info/

Using Azure HDInsight and Hive

I have created an HDInsight cluster but wants to upload a database on portal and use hive on it. What are the steps i need to take?
I know how to use hive but don't know how to connect the data being uploaded in container blob and hive. Btw I am using Powershell

Need to link storage account of the container with hdinsight cluster.
To do that, add following property in core-site.xml
<property>
<name>fs.azure.account.key.[STORAGE ACCOUNT NAME].blob.core.windows.net</name>
<value>[STORAGE ACCOUNT KEY]</value>
</property>
Once its linked, you will be to access that storage account.
To Create hive table on data residing in blob, use external hive table with location pointing to blob directory of your data.
example : CREATE EXTERNAL TABLE (col1 datatype, ....)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Location 'wasb://#.blob.core.windows.net/PATH/OF/DATA/'

How to load data into Hive table

I'm using the hortonworks's Hue (more like a GUI interface that connects hdfs, hive, pig together)and I want to load the data within the hdfs into my current created table.
Suppose the table's name is "test", and the file which contains the data, the path is:
/user/hdfs/test/test.txt"
But I'm unable to load the data into the table, I tried:
load data local inpath '/user/hdfs/test/test.txt' into table test
But there's error said can't find the file, there's no matching path.
I'm still so confused.
Any suggestions?
Thanks

As you said "load the data within the hdfs into my current created table".
But in you command you are using :
load data local inpath '/user/hdfs/test/test.txt' into table test
Using local keyword it looks for the file in your local filesystem. But you file is in HDFS.
I think you need to remove local keyword from you command.
Hope it helps...!!!

Since you are using the hue and the output is showing not matching path. I think you have to give the complete path.
for example:
load data local inpath '/home/cloudera/hive/Documents/info.csv' into table tablename; same as you can give the complete path where the hdfs in which the document resides.
You can use any other format file

remove local keyword as ur referring to local file system

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Load Data into Hive on EMR - amazon-s3

In your code, load data local inpath ... the local is meaning the EMR node, not your computer. By using sftp or something, you should upload the file into EMR first and load it. OR use this. load data inpath 's3://bucket/path/to/file/' into table <tablename>

If you already have data in S3, you can build Hive table on top of the S3 location or alter existing Hive table. ALTER TABLE student SET location='s3://bucket/path/to/folder_with_table_files';

Related

How do I load data into Cloudera Impala Table?

How read data partitons in S3 from Trino

importing data into hive table from external s3 bucket url link

Using Azure HDInsight and Hive

How to load data into Hive table

Categories

Resources