Read multiple files in Hive table by date range - hive

Let's imagine I store one file per day in a format:
/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv
How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?

Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :
/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv
You could then create an external table partitioned by dt and pointing to the location /path/to/files/
CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'
If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.
msck repair table yourtable
You can then simply select data within a date range by specifying the partition range
SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'

Without moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
Loading files into tables
Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')
Moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have
/dbname.db/tableName/dt=2016-07-31/file1.csv
/dbname.db/tableName/dt=2016-08-01/file1.csv
/dbname.db/tableName/dt=2016-08-02/file1.csv
load partition with
alter table tableName add partition (dt=2016-07-31);
See Add partitions

In Spark-shell, read hive table
/path/to/data/user_info/dt=2016-07-31/0000-0
1.create sql
val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"
2. run it
spark.sql(sql)
3.load data
val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")
4.now you can select data from table
val df = spark.sql("select * from user_info")

Related

Athena returns blank response for Partitioned data, what am I missing?

I have created a table using partition. I tried two ways for my s3 bucket folder as following but both ways I get no records found when I query with where clause containing partition clause.
My S3 bucket looks like following. part*.csv is what I want to query in Athena. There are other folders at same location along side output, within output.
s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/part-filename121231.csv
s3://bucket-rootname/XYZ-CASE/report/678d1234-2c3a-481b-a1eb-5169d2a97747/output/part-filename213123.csv
my table looks like following
Version 1:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",id="f78dea49-2c3a-481b-a1eb-5169d2a97747") location "s3://bucket-rootname/casename=ABC-CASE/report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
select * from mytable_trial1 where casename='ABC-CASE' and report='report' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747' and foldername='output';
Version 2:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`report` string,
`id` string,
`foldername` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/casename=ABC-CASE/report=report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/foldername=output/";
select * from mytable_trial1 where casename='ABC-CASE' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747'
Show partitions shows this partition but no records found with where clause.
I worked with the AWS Support and we were able to narrow down the issue. Version 2 was right one to use since it has four partitions like my S3 bucket. Also, the Alter table command had issue with location. I used hive format location which was incorrect since my actual S3 location is not hive format. So correcting the command to following worked for me.
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
Preview table now shows my entries.

Filter Dynamic Partitioning in Apache Hive

Trying to create a Hive table but due to the folder structure it's going to take hours just to partition.
Below is an example of what I am currently using to create the table, but it would be really helpful if I could filter the partioning.
In the below I need every child_company, just one year, every month, and just one type of report.
Is there any way to do something like set hcat.dynamic.partitioning.custom.pattern = '${child_company}/year=${2016}/${month}/report=${inventory}'; When partitioning to avoid the need to read through all folders (> 300k)?
Language: Hive
Version: 1.2
Interface: Quobole
use my_database;
set hcat.dynamic.partitioning.custom.pattern = '${child_company}/${year}/${month}/${report}';
drop table if exists table_1;
create external table table_1
(
Date_Date string,
Product string,
Quantity int,
Cost int
)
partitioned by
(
child_company string,
year int,
month int,
report string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
location 's3://mycompany-myreports/parent/partner_company-12345';
alter table table_1 recover partitions;
show partitions table_1;

Overwrite hive schema metadata without dropping and creating

Say I have a predefined Hive table with partitions loaded to it.
CREATE EXTERNAL TABLE t1
(
c1 STRING
)
PARTITIONED BY ( dt STRING )
LOCATION...
ALTER TABLE t1 ADD PARTITION ( dt = '2017-01-01' )
Now I got a new text representing the schema:
CREATE EXTERNAL TABLE t1
(
user_id STRING
)
PARTITIONED BY ( dt STRING )
LOCATION...
If I drop and then recreate the table, I'll lose partitions info.
I am looking for a way to redefine the columns schema part without manual add/remove/rename columns ( not a one time thing, trying to automate a schema update process ).
I found a way to do 'almost' what I needed:
Hive supports
REPLACE COLUMNS
Which means I can replace all old columns with new ones.

Athena table with multiple locations

My data are distributed over multiple directories and multiple tab-separated files within those directories. The general structure looks like this:
s3://bucket_name/directory/{year}{month}/{iso_2}/{year}{month}{day}_table.bcp.gz
where {year} is the 4-digit year, {month} is the 2-digit month, {day} is the 2-digit day and {iso_2} is the ISO2 country code.
How do I set this up as a table in Athena?
Athena uses Hive DDL, so you just need to run a normal Hive create statement:
CREATE EXTERNAL TABLE table_name(
col_1 string,
...
col_n string)
PARTITIONED BY (
year_month string,
iso_2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://bucket_name/directory/';
Then register these directories as new partitions to the required table by running MSCK REPAIR TABLE table_name. If this fails for some reason (which it sometimes does in Athena) you'll need to run all the add partition statements for your existing directories:
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=US) LOCATION 's3://bucket_name/directory/201601/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201602,iso=US) LOCATION 's3://bucket_name/directory/201602/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=GB) LOCATION 's3://bucket_name/directory/201601/GB/';
etc.

Create External Hive Table Pointing to HBase Table

I have a table named "HISTORY" in HBase having column family "VDS" and the column names ROWKEY, ID, START_TIME, END_TIME, VALUE. I am using Cloudera Hadoop Distribution. I want to provide SQL interface to HBase table using Impala. In order to do this we have to create respective External Table in Hive? So how to create external hive table pointing to this HBase table?
Run the following code in Hive Query Editor:
CREATE EXTERNAL TABLE IF NOT EXISTS HISTORY
(
ROWKEY STRING,
ID STRING,
START_TIME STRING,
END_TIME STRING,
VALUE DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
(
"hbase.columns.mapping" = ":key,VDS:ID,VDS:START_TIME,VDS:END_TIME,VDS:VALUE"
)
TBLPROPERTIES("hbase.table.name" = "HISTORY");
Don't forget to Refresh Impala Metadata after External Table Creation with the following bash command:
echo "INVALIDATE METADATA" | impala-shell;