Exporting Hive Table to a S3 bucket - amazon-s3

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this:
CREATE TABLE csvimport(id BIGINT, time STRING, log STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/home/hadoop/file.csv' OVERWRITE INTO TABLE csvimport;
I now want to store the Hive table in a S3 bucket so the table is preserved once I terminate the MapReduce instance.
Does anyone know how to do this?

Yes you have to export and import your data at the start and end of your hive session
To do this you need to create a table that is mapped onto S3 bucket and directory
CREATE TABLE csvexport (
id BIGINT, time STRING, log STRING
)
row format delimited fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
LOCATION 's3n://bucket/directory/';
Insert data into s3 table and when the insert is complete the directory will have a csv file
INSERT OVERWRITE TABLE csvexport
select id, time, log
from csvimport;
Your table is now preserved and when you create a new hive instance you can reimport your data
Your table can be stored in a few different formats depending on where you want to use it.

Above Query needs to use EXTERNAL keyword, i.e:
CREATE EXTERNAL TABLE csvexport ( id BIGINT, time STRING, log STRING )
row format delimited fields terminated by ',' lines terminated by '\n'
STORED AS TEXTFILE LOCATION 's3n://bucket/directory/';
INSERT OVERWRITE TABLE csvexport select id, time, log from csvimport;
An another alternative is to use the query
INSERT OVERWRITE DIRECTORY 's3n://bucket/directory/' select id, time, log from csvimport;
the table is stored in the S3 directory with HIVE default delimiters.

If you could access aws console and have the "Access Key Id" and "Secret Access Key" for your account
You can try this too..
CREATE TABLE csvexport(id BIGINT, time STRING, log STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3n://"access id":"secret key"#bucket/folder/path';
Now insert the data as other stated above..
INSERT OVERWRITE TABLE csvexport select id, time, log from csvimport;

Related

Skipping header in hive is removing first line of my data

I have the following query in hive:
CREATE EXTERNAL TABLE shop.id_store (
person_id INT,
shop_category STRING
)
row format delimited fields terminated by ',' stored as textfile
LOCATION "user/schema/table"
tblproperties('skip.header.line.count'='1', 'external.table.purge'='true');
LOAD DATA INPATH 'tmp/ids.csv' OVERWRITE INTO TABLE shop.id_store;
INSERT OVERWRITE TABLE shop.id_store
SELECT
*
FROM
shop.id_store
my csv ids.csv, does contain headers, however i have noticed that the above code actually removes the first row of my actual data. What is going on?

Hive overwrite table with new s3 location

I have a hive external table point to a location on s3. My requirement is I will be uploading a new file to this s3 location everyday and the data in my hive table should be overwritten.
Every day my script will create a folder under 's3://employee-data/' and place a csv file there.
eg. s3://employee-data/20190812/employee_data.csv
Now I want my hive table to pick up this new file under new folder everyday and overwrite the existing data. I can get the folder name - '20190812' through my ETL.
Can someone help.
I tried ALTER table set location 'new location'. However, this does not overwrite the data.
create external table employee
{
name String,
hours_worked Integer
}
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';
Set new location and the data will be accessible:
ALTER table set location 's3://employee-data/20190812/';
This statement points table to the new location, nothing is being overwritten of course.
Or alternatively make the table partitioned:
create external table employee
(
name String,
hours_worked Integer
)
partitioned by (load_date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';
then do ALTER TABLE employee recover partitions;
and all dates will be mounted in separate partitions and you can query them using
WHERE load_date='20190812'

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

why hive can‘t select data from hdfs when use partition?

I use flume to write data to hdfs,path like /hive/logs/dt=20151002.Then,i use hive to select data,but the count of response is always 0.
Here is my create table sql,CREATE EXTERNAL TABLE IF NOT EXISTS test (id STRING) partitioned by (dt string) ROW FORMAT DELIMITED fields terminated by '\t' lines terminated by '\n' STORED AS TEXTFILE LOCATION '/hive/logs'
Here is my select sql,select count(*) from test
It seems that you are not registering partition in hive meta-store.
Although partition is present in hdfs path,Hive won't know it if its not registered in meta store. To register it you can do the following:
ALTER TABLE test ADD PARTITION (dt='20151002') location '/hive/logs/dt=20151002';