We have a table (table1) which is partitioned on year, month and day.
I created an orc table similar to table1 with similar partitions but of type ORC. I am trying to insert
date into partitions using the following statement but i am getting data dumped in folders with partition names.
How can i make sure where the folders don't have partition names in them?
create external table table1_orc(
col1 string,
col2 string,
col3 int
PARTITIONED BY (
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION '/base_path_orc/';
set hive.exec.dynamic.partition=true;
insert overwrite table table1_orc partition(year,month,day) select * from table1 where year = '2015' and month = '10' and day = '01';
Path to table1 in hdfs - /base_path/2015/10/01/data.csv
Path to orc table in hdfs (current output) -/base_path_orc/year=2015/month=10/day=01/000000_0
Desired output - /base_path_orc/2015/10/01/000000_0
Related
I need to create an external hive table on top of a csv file. CSV is having col1, col2, col3 and col4.
But my external hive table should be partitioned on month but my csv file doesn't have any month field. col1 is date field.
How can I do this?
You need to reload data into partitioned table.
Create non-partitioned table (mytable) on top of folder with CSV.
Create partitioned table (mytable_part)
create table mytable_part(
--columns specification here for col1, col2, col3, col4
)
partitioned by (part_month string) ...
stored as textfile --you can chose any format you need
Load data into partitioned table using dynamic partitioning, calculate partition column in the query:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table mytable_part partition (part_month)
select
col1, col2, col3, col4,
substr(col1, 1, 7) as part_month --partition column in yyyy-MM format
from mytable
distribute by substr(col1, 1, 7) --to reduce the number of files
;
Try this way
Copy the csv data into a folder in HDFS location hdfs://somepath/5 and add that path to your external table as partition.
create external table ext1(
col1 string
,col2 string
,col3 string
,col4 string
)
partition by (mm int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS ORC;
alter table ext1 add partition(mm = 5) location 'hdfs://yourpath/5';
This "create table" statement is working correctly.
CREATE EXTERNAL TABLE default.no_details_2018_csv (
`id` string,
`client_id` string,
`client_id2` string,
`id_1` string,
`id_2` string,
`client_id3` string,
`code_1` string,
`code_2` string,
`code_3` string
)
STORED AS PARQUET
LOCATION 's3://some_bucket/athena-parquet/no_details/2018/'
tblproperties ("parquet.compress"="SNAPPY");
The data for the year 2018 available in parquet format can be found in that bucket / folder.
1) How do I add partitions to this table? I need to add the year 2019 data to the same table by referring to the new location of s3://some_bucket/athena-parquet/no_details/2019/ The data for both years is available in parquet (snappy) format.
2) Is it possible to partition by month instead of years? In other words is it OK to have 24 partitions instead of 2? Will the new target table will also have parquet format just like source data? The code_2 column mentioned above looks like this "20181013133839". I need to use first 4 characters for yearly (or 6 for monthly) partitions.
First table needs be created as EXTERNAL TABLE Check this
Sample -
CREATE EXTERNAL TABLE default.no_details_table (
`id` string,
`client_id` string,
`client_id2` string,
`id_1` string,
`id_2` string,
`client_id3` string,
`code_1` string,
`code_2` string,
`code_3` string
)
PARTITIONED BY (year string)
STORED AS PARQUET
LOCATION 's3://some_bucket/athena-parquet/no_details/'
tblproperties ("parquet.compress"="SNAPPY");
You can add a partition as
ALTER TABLE default.no_details_table ADD PARTITION (year='2018') LOCATION 's3://some_bucket/athena-parquet/no_details/2018/';
If you want to have more partitions for each month or day, create table with
PARTITIONED BY (day string)
But you need to put data of a day to path -
s3://some_bucket/athena-parquet/no_details/20181013/
So, I have a table that data partitioned by datetime(dt) and stored in S3 which the partition look like this
dt=2019-03-22/
dt=2019-03-23/
dt=2019-03-24/
and so on, What I wanted to do is to change how I partition data from this pattern into a subpartition like this
year=2019/month=03/day=22/
year=2019/month=03/day=23/
year=2019/month=03/day=24/
But I don't want to alter the original table so I created an external table that point to another location in S3 which will be the location for this new partition pattern. I have tried creating a table that point to that location using (the same schema as the original one)
CREATE EXTERNAL TABLE `test_partition_new`(
`order_id` string,
`outlet_code` string,
.
.
.
.
`business_date` string,
.
.
.
.
)
PARTITIONED BY (
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://data-test/test_partition/db.new_partition/'
TBLPROPERTIES (
'orc.compress'='SNAPPY',
)
which will partition by year, month and day respectively. So from what I understand I should insert data from the original table into this one. How should I insert data into this new table which a date to be partition by are from column 'business_date' that contain data like '2019-03-20'. Are there any function that can separate this column into three column containing year, month and day
If the date format is consistent, you can split them into 3 columns and load.
INSERT INTO `test_partition_new` PARTITION(year,month,day)
SELECT --cols to select
,SPLIT(business_date,'-')[0] --year
,SPLIT(business_date,'-')[1] --month
,SPLIT(business_date,'-')[2] --day
FROM ORIGINAL_TABLE
I have created a partitioned external table in hive that stores parquet format files. I have timestamp column in that table, when i load data its giving nulls in timestamp column.
create table query
CREATE EXTERNAL TABLE abc(
timestamp1 timestamp,
tagname string,
value string,
quality bigint,
own string)
PARTITIONED BY (
etldate string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'adl://refdatalakeprod.azuredatalakestore.net/iconic'
TBLPROPERTIES (
'PARQUET.COMPRESS'='SNAPPY');
Any suggestions pls?
Thanks in advance.
Your question is wrong.It's not timestamp type, it is a string type.I think you need to check your data.
Let's imagine I store one file per day in a format:
/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv
How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?
Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :
/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv
You could then create an external table partitioned by dt and pointing to the location /path/to/files/
CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'
If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.
msck repair table yourtable
You can then simply select data within a date range by specifying the partition range
SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
Without moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
Loading files into tables
Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')
Moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have
/dbname.db/tableName/dt=2016-07-31/file1.csv
/dbname.db/tableName/dt=2016-08-01/file1.csv
/dbname.db/tableName/dt=2016-08-02/file1.csv
load partition with
alter table tableName add partition (dt=2016-07-31);
See Add partitions
In Spark-shell, read hive table
/path/to/data/user_info/dt=2016-07-31/0000-0
1.create sql
val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"
2. run it
spark.sql(sql)
3.load data
val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")
4.now you can select data from table
val df = spark.sql("select * from user_info")