My data are distributed over multiple directories and multiple tab-separated files within those directories. The general structure looks like this:
s3://bucket_name/directory/{year}{month}/{iso_2}/{year}{month}{day}_table.bcp.gz
where {year} is the 4-digit year, {month} is the 2-digit month, {day} is the 2-digit day and {iso_2} is the ISO2 country code.
How do I set this up as a table in Athena?
Athena uses Hive DDL, so you just need to run a normal Hive create statement:
CREATE EXTERNAL TABLE table_name(
col_1 string,
...
col_n string)
PARTITIONED BY (
year_month string,
iso_2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://bucket_name/directory/';
Then register these directories as new partitions to the required table by running MSCK REPAIR TABLE table_name. If this fails for some reason (which it sometimes does in Athena) you'll need to run all the add partition statements for your existing directories:
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=US) LOCATION 's3://bucket_name/directory/201601/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201602,iso=US) LOCATION 's3://bucket_name/directory/201602/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=GB) LOCATION 's3://bucket_name/directory/201601/GB/';
etc.
Related
I have created a table using partition. I tried two ways for my s3 bucket folder as following but both ways I get no records found when I query with where clause containing partition clause.
My S3 bucket looks like following. part*.csv is what I want to query in Athena. There are other folders at same location along side output, within output.
s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/part-filename121231.csv
s3://bucket-rootname/XYZ-CASE/report/678d1234-2c3a-481b-a1eb-5169d2a97747/output/part-filename213123.csv
my table looks like following
Version 1:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",id="f78dea49-2c3a-481b-a1eb-5169d2a97747") location "s3://bucket-rootname/casename=ABC-CASE/report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
select * from mytable_trial1 where casename='ABC-CASE' and report='report' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747' and foldername='output';
Version 2:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`report` string,
`id` string,
`foldername` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/casename=ABC-CASE/report=report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/foldername=output/";
select * from mytable_trial1 where casename='ABC-CASE' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747'
Show partitions shows this partition but no records found with where clause.
I worked with the AWS Support and we were able to narrow down the issue. Version 2 was right one to use since it has four partitions like my S3 bucket. Also, the Alter table command had issue with location. I used hive format location which was incorrect since my actual S3 location is not hive format. So correcting the command to following worked for me.
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
Preview table now shows my entries.
Trying to create a Hive table but due to the folder structure it's going to take hours just to partition.
Below is an example of what I am currently using to create the table, but it would be really helpful if I could filter the partioning.
In the below I need every child_company, just one year, every month, and just one type of report.
Is there any way to do something like set hcat.dynamic.partitioning.custom.pattern = '${child_company}/year=${2016}/${month}/report=${inventory}'; When partitioning to avoid the need to read through all folders (> 300k)?
Language: Hive
Version: 1.2
Interface: Quobole
use my_database;
set hcat.dynamic.partitioning.custom.pattern = '${child_company}/${year}/${month}/${report}';
drop table if exists table_1;
create external table table_1
(
Date_Date string,
Product string,
Quantity int,
Cost int
)
partitioned by
(
child_company string,
year int,
month int,
report string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
location 's3://mycompany-myreports/parent/partner_company-12345';
alter table table_1 recover partitions;
show partitions table_1;
I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!
Let's imagine I store one file per day in a format:
/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv
How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?
Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :
/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv
You could then create an external table partitioned by dt and pointing to the location /path/to/files/
CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'
If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.
msck repair table yourtable
You can then simply select data within a date range by specifying the partition range
SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
Without moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
Loading files into tables
Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')
Moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have
/dbname.db/tableName/dt=2016-07-31/file1.csv
/dbname.db/tableName/dt=2016-08-01/file1.csv
/dbname.db/tableName/dt=2016-08-02/file1.csv
load partition with
alter table tableName add partition (dt=2016-07-31);
See Add partitions
In Spark-shell, read hive table
/path/to/data/user_info/dt=2016-07-31/0000-0
1.create sql
val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"
2. run it
spark.sql(sql)
3.load data
val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")
4.now you can select data from table
val df = spark.sql("select * from user_info")
I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.
First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.