Hive creates partitions with leading zero - hive

I'm using hive Hive 1.0.0-amzn-3,
In this version hive creates partitions dynamically, it ignores the type of the partition field type. In my case it creates partitions with leading zero, while field type is Integer.
day_ts=16/hour_ts=05
When I've upgraded hive to Hive 2.1.1-amzn-0
the behavior changed, the new partitions aligned with the type of the field and leading zero removed.
day_ts=16/hour_ts=4
Appreciate any lead to how allow hive 2.1 to keep the leading zero.
thanks.
Here is an example
CREATE EXTERNAL TABLE `partitions_hell`(
`a` string COMMENT,
`b` string COMMENT
)
PARTITIONED BY (
`day_ts` bigint,
`hour_ts` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://xarkadiy-snbx/partitions_hell'
TBLPROPERTIES (
'orc.compress'='SNAPPY'
)
Run Following query in hive 1.0.0
INSERT INTO TABLE partitions_hell_arkadiy partition(day_ts=16, hour_ts=05)
SELECT '11',
'rwr'
FROM partitions_hell_arkadiy
Run Following query in hive 2.1.0
INSERT INTO TABLE partitions_hell_arkadiy partition(day_ts=16, hour_ts=04)
select '131', 'rw33rr'
from partitions_hell_arkadiy
Result:
hive> show partitions default.partitions_hell
> ;
OK
day_ts=16/hour_ts=05
day_ts=16/hour_ts=4
Time taken: 0.111 seconds, Fetched: 5 row(s)

Related

Hive table with Avro Schema

I have created a hive external table with Avro Schema (complex types) with partition columns. After adding the required partition files, select query returns null values for all the columns except the partition columns. Avro schema has arrays and structs types inside.
Here is the DDL,
CREATE EXTERNAL TABLE mytable PARTITIONED BY(date int, city string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/cloud/location'
TBLPROPERTIES ('avro.schema.url'='cloud location for schema file');
I tried to give the schema file directly and tried with schema literal as well in the TBLPROPERTIES.
Select query returns null for all the columns.
Any suggestions for fixing this issue? is there anything missing in this scenario?

data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

How to rename partition value in Hive?

I have a hive table 'videotracking_playevent' which uses the following partition format (all strings): source/createyear/createmonth/createday.
Example: source=home/createyear=2016/createmonth=9/createday=1
I'm trying to update the partition values of createmonth and createday to consistently use double digits instead.
Example: source=home/createyear=2016/createmonth=09/createday=01
I've tried to the following query:
ALTER TABLE videotracking_playevent PARTITION (
source='home',
createyear='2015',
createmonth='11',
createday='1'
) RENAME TO PARTITION (
source='home',
createyear='2015',
createmonth='11',
createday='01'
);
However that returns the following, non-descriptive error from hive: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null
I've confirmed that this partition exists, and I think I'm using the correct syntax. My hive version is Hive 1.1.0
Any ideas what I might be doing wrong?
There was an issue with old version of Hive with renaming partition. This might be an issue for your case too. Please see this link for detail.
You need to set below two property before executing the rename partition command if you are using Older version of Hive.
set fs.hdfs.impl.disable.cache=false;
set fs.file.impl.disable.cache=false;
Now run the query by setting this property.
hive> set fs.hdfs.impl.disable.cache=false;
hive> set fs.file.impl.disable.cache=false;
hive> ALTER TABLE partition_test PARTITION (year='2016',day='1') RENAME TO PARTITION (year='2016',day='01');
OK
Time taken: 0.28 seconds
hive> show partitions partition_test;
OK
year=2016/day=01
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive>
This issue is fixed in Hive latest version. In my case Hive version is 1.2.1 and it works, without setting that property. Please see the example below.
Create a partitioned table.
hive> create table partition_test(
> name string,
> age int)
> partitioned by (year string, day string);
OK
Time taken: 5.35 seconds
hive>
Now add the partition and check the newly added partition.
hive> alter table partition_test ADD PARTITION (year='2016', day='1');
OK
Time taken: 0.137 seconds
hive>
hive> show partitions partition_test;
OK
year=2016/day=1
Time taken: 0.169 seconds, Fetched: 1 row(s)
hive>
Rename the partition using RENAME TO PARTITION command and check it.
hive> ALTER TABLE partition_test PARTITION (year='2016',day='1') RENAME TO PARTITION (year='2016',day='01');
OK
Time taken: 0.28 seconds
hive> show partitions partition_test;
OK
year=2016/day=01
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive>
Hope it helps you.
Rename lets you change the value of a partition column. One of use cases is that you can use this statement to normalize your legacy partition column value to conform to its type. In this case, the type conversion and normalization are not enabled for the column values in old partition_spec even with property hive.typecheck.on.insert set to true (default) which allows you to specify any legacy data in form of string in the old partition_spec"
Bug open
https://issues.apache.org/jira/browse/HIVE-10362
You can create a copy of the table without partition, then update the column of the table, and then recreate the first one with partition
create table table_name partitioned by (table_column) as
select
*
from
source_table
That worked for me.

Read multiple files in Hive table by date range

Let's imagine I store one file per day in a format:
/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv
How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?
Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :
/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv
You could then create an external table partitioned by dt and pointing to the location /path/to/files/
CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'
If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.
msck repair table yourtable
You can then simply select data within a date range by specifying the partition range
SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
Without moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
Loading files into tables
Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')
Moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have
/dbname.db/tableName/dt=2016-07-31/file1.csv
/dbname.db/tableName/dt=2016-08-01/file1.csv
/dbname.db/tableName/dt=2016-08-02/file1.csv
load partition with
alter table tableName add partition (dt=2016-07-31);
See Add partitions
In Spark-shell, read hive table
/path/to/data/user_info/dt=2016-07-31/0000-0
1.create sql
val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"
2. run it
spark.sql(sql)
3.load data
val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")
4.now you can select data from table
val df = spark.sql("select * from user_info")

hive query is not working properly

I have created hive table loading data from another table when i execute the query its starting but dint produce any results
CREATE TABLE fact_orders1 (order_number String, created timestamp, last_upd timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC;
OK Time taken: 0.188 seconds
INSERT OVERWRITE TABLE fact_orders1 SELECT * FROM fact_orders;
Query ID = hadoop_20151230051654_78edfb70-4d41-4fa7-9110-fa9a98d5405d
Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set
to 0 since there's no reduce operator Starting Job =
job_1451392201160_0007, Tracking URL =
http://localhost:8088/proxy/application_1451392201160_0007/ Kill
Command = /home/hadoop/hadoop-2.6.1/bin/hadoop job -kill
job_1451392201160_0007
You have no output from query because there is no data stored in it. I assume you use default metastore under /user/hive/warehouse so what you need to do is:
LOAD DATA LOCAL INPATH '/path/on/hdfs/to/data' OVERWRITE INTO TABLE fact_orders1;
That should work.
Also edit your query for table creation adding the LOCATION statement:
CREATE TABLE fact_orders1 (order_number String, created timestamp, last_upd timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC
LOCATION /user/hive/warehouse/fact_orders1;
In case if you want to use the data outside the hive metastore you need to use external tables