Hive overwrite table with new s3 location - amazon-s3

I have a hive external table point to a location on s3. My requirement is I will be uploading a new file to this s3 location everyday and the data in my hive table should be overwritten.
Every day my script will create a folder under 's3://employee-data/' and place a csv file there.
eg. s3://employee-data/20190812/employee_data.csv
Now I want my hive table to pick up this new file under new folder everyday and overwrite the existing data. I can get the folder name - '20190812' through my ETL.
Can someone help.
I tried ALTER table set location 'new location'. However, this does not overwrite the data.
create external table employee
{
name String,
hours_worked Integer
}
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';

Set new location and the data will be accessible:
ALTER table set location 's3://employee-data/20190812/';
This statement points table to the new location, nothing is being overwritten of course.
Or alternatively make the table partitioned:
create external table employee
(
name String,
hours_worked Integer
)
partitioned by (load_date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';
then do ALTER TABLE employee recover partitions;
and all dates will be mounted in separate partitions and you can query them using
WHERE load_date='20190812'

Related

Find hive external table name from HDFS directory

Is it possible to get the external table name if the only information I have is the HDFS directory.
For example, I create the table with
CREATE EXTERNAL TABLE IF NOT EXISTS userinfo(id String, name String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs:///user/testuser/log/2019-02-18/‘
To get the location from table name, I can use
show create table userinfo;
But if I want to get the table name from "hdfs:///user/testuser/log/2019-02-18/"?
Is it possible to find the table name "userinfo" from the directory?
Thanks
David

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

Bucket is not creating on hadoop-hive

I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.
First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.

Exporting Hive Table to a S3 bucket

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this:
CREATE TABLE csvimport(id BIGINT, time STRING, log STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/home/hadoop/file.csv' OVERWRITE INTO TABLE csvimport;
I now want to store the Hive table in a S3 bucket so the table is preserved once I terminate the MapReduce instance.
Does anyone know how to do this?
Yes you have to export and import your data at the start and end of your hive session
To do this you need to create a table that is mapped onto S3 bucket and directory
CREATE TABLE csvexport (
id BIGINT, time STRING, log STRING
)
row format delimited fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
LOCATION 's3n://bucket/directory/';
Insert data into s3 table and when the insert is complete the directory will have a csv file
INSERT OVERWRITE TABLE csvexport
select id, time, log
from csvimport;
Your table is now preserved and when you create a new hive instance you can reimport your data
Your table can be stored in a few different formats depending on where you want to use it.
Above Query needs to use EXTERNAL keyword, i.e:
CREATE EXTERNAL TABLE csvexport ( id BIGINT, time STRING, log STRING )
row format delimited fields terminated by ',' lines terminated by '\n'
STORED AS TEXTFILE LOCATION 's3n://bucket/directory/';
INSERT OVERWRITE TABLE csvexport select id, time, log from csvimport;
An another alternative is to use the query
INSERT OVERWRITE DIRECTORY 's3n://bucket/directory/' select id, time, log from csvimport;
the table is stored in the S3 directory with HIVE default delimiters.
If you could access aws console and have the "Access Key Id" and "Secret Access Key" for your account
You can try this too..
CREATE TABLE csvexport(id BIGINT, time STRING, log STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3n://"access id":"secret key"#bucket/folder/path';
Now insert the data as other stated above..
INSERT OVERWRITE TABLE csvexport select id, time, log from csvimport;