Bucket is not creating on hadoop-hive - hive

I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.

First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.

Related

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

Athena table with multiple locations

My data are distributed over multiple directories and multiple tab-separated files within those directories. The general structure looks like this:
s3://bucket_name/directory/{year}{month}/{iso_2}/{year}{month}{day}_table.bcp.gz
where {year} is the 4-digit year, {month} is the 2-digit month, {day} is the 2-digit day and {iso_2} is the ISO2 country code.
How do I set this up as a table in Athena?
Athena uses Hive DDL, so you just need to run a normal Hive create statement:
CREATE EXTERNAL TABLE table_name(
col_1 string,
...
col_n string)
PARTITIONED BY (
year_month string,
iso_2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://bucket_name/directory/';
Then register these directories as new partitions to the required table by running MSCK REPAIR TABLE table_name. If this fails for some reason (which it sometimes does in Athena) you'll need to run all the add partition statements for your existing directories:
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=US) LOCATION 's3://bucket_name/directory/201601/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201602,iso=US) LOCATION 's3://bucket_name/directory/201602/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=GB) LOCATION 's3://bucket_name/directory/201601/GB/';
etc.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

Delete records from Hive table using filename

I have a use case where I build a hive table from a bunch of csv files. While writing csv information into hive table, I assign INPUT__FILE__NAME (part of the name) to one of the columns. When I want to the update the records for the same filename, I need to delete the records of the csv file before writing it again.
I use the below query but failed
CREATE EXTERNAL TABLE T_TEMP_CSV(
F_FRAME_RANK BIGINT,
F_FRAME_RATE BIGINT,
F_SOURCE STRING,
F_PARAMETER STRING,
F_RECORDEDVALUE STRING,
F_VALIDITY INT,
F_VALIDITY_INTERPRETATION STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
location '/user/baamarna5617/HUMS/csv'
TBLPROPERTIES ("skip.header.line.count"="2");
DELETE FROM T_RECORD
WHERE T_RECORD.F_SESSION = split(reverse(split(reverse(T_TEMP_CSV.INPUT__FILE__NAME),"/")[0]), "[.]")[0]
from T_TEMP_CSV;
The T_RECORD table has a column called F_SESSION which was assigned part of the INPUT__FILE__NAME using the split method shown above. I want to use the same method while removing those records. Can someone please point me where i am going wrong in this query?
I could successfully delete the records using the below syntax
DELETE FROM T_RECORD
WHERE F_SESSION = 68;
I need to get that 68 from the INPUT_FILE_NAME.

automatically partition Hive tables based on S3 directory names

I have data stored in S3 like:
/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN
/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...
My understanding is that if I pull in that data via Hive, it will automatically interpret date as a partition. My table creation looks like:
CREATE EXTERNAL TABLE search_input(
col 1 STRING,
col 2 STRING,
...
)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';
However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via:
CREATE EXTERNAL TABLE search_input_20140701(
col 1 STRING,
col 2 STRING,
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';
I can query data just fine.
Why doesn't Hive recognize the nested directories with the "date=date_str" partition?
Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?
In order to get this to work I had to do 2 things:
Enable recursive directory support:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
For some reason it would still not recognize my partitions so I had to recover them via:
ALTER TABLE search_input RECOVER PARTITIONS;
You can use:
SHOW PARTITIONS table;
to check and see that they've been recovered.
I had faced the same issue and realized that hive does not have partitions metadata with it. So we need to add that metadata using ALTER TABLE ADD PARTITION query. It becomes tedious, if you have few hundred partitions to create same queries with different values.
ALTER TABLE <table name> ADD PARTITION(<partitioned column name>=<partition value>);
Once you run above query for all available partitions. You should see the results in hive queries.