Hive - partition external table by content of data - hive

I have a bunch of gzipped files in HDFS under directories of the form /home/myuser/salesdata/some_date/ALL/<country>.gz , for instance /home/myuser/salesdata/20180925/ALL/us.gz
The data is of the form
<country> \t count1,count2,count3
So essentially it's first tab separated and then I need to extract the comma separated values into separate columns
I'd like to create an external table, partitioning this by country, year, month and day. The size of the data is pretty huge, potentially 100s of TB and so I'd like to have an external table itself, rather than having to duplicate the data by importing it into a standard table.
Is it possible to achieve this by using only an external table?

considering your country is separated by tab '\t' and other fields separated by , this is what you can do.
You can create a temporary table which has first columns as string and rest as array.
create external table temp.test_csv (country string, count array<int>)
row format delimited
fields terminated by "\t"
collection items terminated by ','
stored as textfile
location '/apps/temp/table';
Now if you drop your files into the /apps/temp/table location you should be able to to select the data as mentioned below.
select country, count[0] as count_1, count[1] count_2, count[2] count_3 from temp.test_csv
Now to create partitions create another table, as mentioned below.
drop table temp.test_csv_orc;
create table temp.test_csv_orc ( count_1 int, count_2 int, count_3 int)
partitioned by(year string, month string, day string, country string)
stored as orc;
And load the data from temporary table into this one.
insert into temp.test_csv_orc partition(year="2018", month="09", day="28", country)
select count[0] as count_1, count[1] count_2, count[2] count_3, country from temp.test_csv
I have taken country as Dynamic Parition as it's coming from file however others aren't so it's static.

Related

Hive: How to create multiple tables from multiple files or count entries per file

My goal is to combine entries from multiple files into 1 table but am having some trouble getting there.
So I understand that you can add all entries into a table by doing:
CREATE EXTERNAL TABLE tablename
(
teams STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///hive-data';
Where each text file's data looks something like:
student#A18645
student#V86541
student#X78543
However, with the code above, this combines all the data from N number of files together in my directory which makes it hard to combine the data. What I want is to have the first entries from all files be concatenated together into a single string and entered into a new table and so forth.
I have tried to number each entry using ROW_NUMBER() but that does not give the number of their place in the file but rather in the table itself.
Therefore, is there a way I can create a table per file, number the entries, and join all the tables together so that in the end, I can get a table that looks like:
number students
1 student#A18645,student#D94655,...student#S45892
2 student#V86541,student#D45645,...student#F46444
3 student#X78543,student#T78722,...student#M99846
Or rather, a way to number each entry as the line number of the file it came from so I can do an inner join on my table.
Note: the number of files can vary so I do not have a set number of files to loop through
You can use this approach to build the final table.
Let's say that these are two files for two teams.
-- team1.txt
student#A18645
student#V86541
student#X78543
-- team2.txt
student#P20045
student#Q30041
student#R40043
Load them in HDFS, each file into a separate directory
hadoop fs -mkdir /hive-data/team1
hadoop fs -mkdir /hive-data/team2
hadoop fs -put team1.txt /hive-data/team1
hadoop fs -put team2.txt /hive-data/team2
Create two tables in Hive, one for each team
CREATE EXTERNAL TABLE team1
(
teams STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:////hive-data/team1';
CREATE EXTERNAL TABLE team2
(
teams STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:////hive-data/team2';
Create the final table in Hive to hold combined data
CREATE TABLE teams
(
team_number INT,
students STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
Populate the final table
(Since comma is a field delimiter, colon is used to generate the string of list of students)
INSERT INTO teams (team_number, students)
SELECT 1, CONCAT_WS(":", COLLECT_LIST(teams)) FROM team1;
INSERT INTO teams (team_number, students)
SELECT 2, CONCAT_WS(":", COLLECT_LIST(teams)) FROM team2;
Verify the final table
SELECT * FROM teams;
teams.team_number teams.students
1 student#A18645:student#V86541:student#X78543
2 student#P20045:student#Q30041:student#R40043

Filter Dynamic Partitioning in Apache Hive

Trying to create a Hive table but due to the folder structure it's going to take hours just to partition.
Below is an example of what I am currently using to create the table, but it would be really helpful if I could filter the partioning.
In the below I need every child_company, just one year, every month, and just one type of report.
Is there any way to do something like set hcat.dynamic.partitioning.custom.pattern = '${child_company}/year=${2016}/${month}/report=${inventory}'; When partitioning to avoid the need to read through all folders (> 300k)?
Language: Hive
Version: 1.2
Interface: Quobole
use my_database;
set hcat.dynamic.partitioning.custom.pattern = '${child_company}/${year}/${month}/${report}';
drop table if exists table_1;
create external table table_1
(
Date_Date string,
Product string,
Quantity int,
Cost int
)
partitioned by
(
child_company string,
year int,
month int,
report string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
location 's3://mycompany-myreports/parent/partner_company-12345';
alter table table_1 recover partitions;
show partitions table_1;

Athena table with multiple locations

My data are distributed over multiple directories and multiple tab-separated files within those directories. The general structure looks like this:
s3://bucket_name/directory/{year}{month}/{iso_2}/{year}{month}{day}_table.bcp.gz
where {year} is the 4-digit year, {month} is the 2-digit month, {day} is the 2-digit day and {iso_2} is the ISO2 country code.
How do I set this up as a table in Athena?
Athena uses Hive DDL, so you just need to run a normal Hive create statement:
CREATE EXTERNAL TABLE table_name(
col_1 string,
...
col_n string)
PARTITIONED BY (
year_month string,
iso_2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://bucket_name/directory/';
Then register these directories as new partitions to the required table by running MSCK REPAIR TABLE table_name. If this fails for some reason (which it sometimes does in Athena) you'll need to run all the add partition statements for your existing directories:
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=US) LOCATION 's3://bucket_name/directory/201601/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201602,iso=US) LOCATION 's3://bucket_name/directory/201602/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=GB) LOCATION 's3://bucket_name/directory/201601/GB/';
etc.

Bucket is not creating on hadoop-hive

I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.
First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.

Is there a way to partition an existing text file with Impala without pre-splitting the files into the partitioned directories?

Say I have a single file "fruitsbought.csv" that contains many records that contain a date field.
Is it possible to "partition" for better performance by creating the "fruits" table based on that text file, while creating a partition in which all the rows in fruitsbought.txt that would match that partition, say if I wanted to do it by year and month, to be created?
Or do I have to as part of a separate process, create a directory for each year and then put the appropriate ".csv" files that are filtered down for that year into the directory structure on HDFS prior to creating the table in impala-shell?
I heard that you can create an empty table, set up partitions, then use "Insert" statements that happen to contain the partition that that record goes into. Though in my current case, I already have a single "fruitsbought.csv" that contains every record I want in it that I like how I can just make that into a table right there (though it does not have parititionig).
Do I have to develop a separte process to presplit the one file into the multiple files sorted under the right partition? (The one file is very very big).
Create external table using fruitsbought.csv example (id is just example, ...- mean rest of columns in table):
CREATE EXTERNAL TABLE fruitsboughexternal
(
id INT,
.....
mydate STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'somelocationwithfruitsboughtfile/';
Create table with partition on date
CREATE TABLE fruitsbought(id INT, .....)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Import data to fruitsbought table, partition parameters have to be last in select (of course mydate have to be in format understand by impala like 2014-06-20 06:05:25)
INSERT INTO fruitsbought PARTITION(year, month, day) SELECT id, ..., year(mydate), month(mydate), day(mydate) FROM fruitsboughexternal;