hive tables with parquet data format reorder columns - hive

We are using hive 1.1.0, have a requirement to reorder columns for one of huge table with 100's of columns for user readability. But when we reorder columns in table its failing with below error.
Alternative to reorder columns is create a view on table.
hive> desc test_parquet;
OK
name string
age int
dept string
salary string
city string
# Partition Information
# col_name data_type comment
city string
Time taken: 0.053 seconds, Fetched: 10 row(s)
hive> ALTER TABLE test_parquet REPLACE COLUMNS (age int,name string, dept string, salary string);
OK
Time taken: 0.451 seconds
hive> desc test_parquet;
OK
age int
name string
dept string
salary string
city string
# Partition Information
# col_name data_type comment
city string
Time taken: 0.051 seconds, Fetched: 10 row(s)
hive> select * from test_parquet;
OK
Failed with exception java.io.IOException:java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.IntWritable
Time taken: 0.121 seconds

ALTER TABLE ... REPLACE COLUMNS works on the metadata level (metastore).
The new columns order does not match the actual data.
For parquet you get an exception, for textfile you would get NULL values.
You should go with your alternative solution - using a view.

This post is a bit old, however, I thought it would benefit others if have a similar issue!
You can try the below change column method that wors:
ALTER TABLE test_parquet CHANGE COLUMN age age int FIRST; which moves the column to first otherwise, if you want to re order after certain column, below would help:
ALTER TABLE test_parquet CHANGE COLUMN name name string AFTER age;

Related

How to specify fields when there are keywords in the field list in Hive?

I am trying to parse some historical SQL in the high version of hive (version is 2.3.7) so that tasks can be migrated to the high version. I encountered the following keyword problem. I cannot delete the field list after the table name because it may disrupt the insertion order.
How to deal with such a keyword problem? The field name of the old version of SQL cannot be changed
hive> create database db_test;
OK
Time taken: 0.017 seconds
hive> use db_test;
OK
Time taken: 0.007 seconds
hive> create table tb_test_to(
> `name` String,
> `interval` STRING
> );
OK
Time taken: 0.037 seconds
hive> create table tb_test_from(
> `name` String,
> `interval` STRING
> );
OK
Time taken: 0.052 seconds
hive> show tables;
OK
tb_test_from
tb_test_to
Time taken: 0.011 seconds, Fetched: 2 row(s)
hive> insert into tb_test_to (name,`interval`) select name, `interval` from tb_test_from;
FAILED: SemanticException 1:24 '`interval`' in insert schema specification is not found among regular columns of db_test.tb_test_to nor dynamic partition columns.. Error encountered near token '`interval`'
hive>
You can only do
insert into tb_test_to select name, `interval` from tb_test_from;
because you cannot specify a column list in HiveQL queries, according to the HiveQL manual at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Synopsis.3
Values must be provided for every column in the table. The standard SQL syntax that allows the user to insert values into only some columns is not yet supported. To mimic the standard SQL, nulls can be provided for columns the user does not wish to assign a value to.

How to create n number of external tables with a single hdfs path using Hive

Is it possible to create n number of external tables are pointing to a single hdfs path using Hive. If yes what are the advantages and its limitations.
It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS.
Creating tables with exactly the same schema on top of the same data is not useful at all, but you can create different tables with different number of columns for example or with differently parsed columns using RegexSerDe for example, so you can have different schemas in these tables. And you can have different permissions on these tables in Hive. Also table can be created on top of the sub-folder of some other tables folder, in this case it will contain a sub-set of data. Better use partitions in single table for the same.
And the drawback is that it is confusing because you can rewrite the same data using more than one table and also you may drop it accidentally, thinking this data belongs to the only table and you can drop data because you do not need that table any more.
And this is few tests:
Create table with INT column:
create table T(id int);
OK
Time taken: 1.033 seconds
Check location and other properties:
hive> describe formatted T;
OK
# col_name data_type comment
id int
# Detailed Table Information
Database: my
Owner: myuser
CreateTime: Fri Jan 04 04:45:03 PST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://myhdp/user/hive/warehouse/my.db/t
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1546605903
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.134 seconds, Fetched: 26 row(s)
sts)
Create second table on top of the same location but with STRING column:
hive> create table T2(id string) location 'hdfs://myhdp/user/hive/warehouse/my.db/t';
OK
Time taken: 0.029 seconds
Insert data:
hive> insert into table T values(1);
OK
Time taken: 33.266 seconds
Check data:
hive> select * from T;
OK
1
Time taken: 3.314 seconds, Fetched: 1 row(s)
Insert into second table:
hive> insert into table T2 values( 'A');
OK
Time taken: 23.959 seconds
Check data:
hive> select * from T2;
OK
1
A
Time taken: 0.073 seconds, Fetched: 2 row(s)
Select from first table:
hive> select * from T;
OK
1
NULL
Time taken: 0.079 seconds, Fetched: 2 row(s)
String was selected as NULL because this table is defined as having INT column.
And now insert STRING into first table (INT column):
insert into table T values( 'A');
OK
Time taken: 84.336 seconds
Surprise, it is not failing!
What was inserted?
hive> select * from T2;
OK
1
A
NULL
Time taken: 0.067 seconds, Fetched: 3 row(s)
NULL was inserted, because during previous insert string was converted to int and this resulted in NULL
Now let's try to drop one table and select from another one:
hive> drop table T;
OK
Time taken: 4.996 seconds
hive> select * from T2;
OK
Time taken: 6.978 seconds
Returned 0 rows because first table was MANAGED and drop table also removed common location.
THE END,
data is removed, do We need T2 table without data in it?
drop table T2;
OK
Second table is removed, you see, it was metadata only. The table was also managed and drop table should remove the location with data also, but it's already nothing to remove in HDFS, only metadata was removed.

How to rename partition value in Hive?

I have a hive table 'videotracking_playevent' which uses the following partition format (all strings): source/createyear/createmonth/createday.
Example: source=home/createyear=2016/createmonth=9/createday=1
I'm trying to update the partition values of createmonth and createday to consistently use double digits instead.
Example: source=home/createyear=2016/createmonth=09/createday=01
I've tried to the following query:
ALTER TABLE videotracking_playevent PARTITION (
source='home',
createyear='2015',
createmonth='11',
createday='1'
) RENAME TO PARTITION (
source='home',
createyear='2015',
createmonth='11',
createday='01'
);
However that returns the following, non-descriptive error from hive: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null
I've confirmed that this partition exists, and I think I'm using the correct syntax. My hive version is Hive 1.1.0
Any ideas what I might be doing wrong?
There was an issue with old version of Hive with renaming partition. This might be an issue for your case too. Please see this link for detail.
You need to set below two property before executing the rename partition command if you are using Older version of Hive.
set fs.hdfs.impl.disable.cache=false;
set fs.file.impl.disable.cache=false;
Now run the query by setting this property.
hive> set fs.hdfs.impl.disable.cache=false;
hive> set fs.file.impl.disable.cache=false;
hive> ALTER TABLE partition_test PARTITION (year='2016',day='1') RENAME TO PARTITION (year='2016',day='01');
OK
Time taken: 0.28 seconds
hive> show partitions partition_test;
OK
year=2016/day=01
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive>
This issue is fixed in Hive latest version. In my case Hive version is 1.2.1 and it works, without setting that property. Please see the example below.
Create a partitioned table.
hive> create table partition_test(
> name string,
> age int)
> partitioned by (year string, day string);
OK
Time taken: 5.35 seconds
hive>
Now add the partition and check the newly added partition.
hive> alter table partition_test ADD PARTITION (year='2016', day='1');
OK
Time taken: 0.137 seconds
hive>
hive> show partitions partition_test;
OK
year=2016/day=1
Time taken: 0.169 seconds, Fetched: 1 row(s)
hive>
Rename the partition using RENAME TO PARTITION command and check it.
hive> ALTER TABLE partition_test PARTITION (year='2016',day='1') RENAME TO PARTITION (year='2016',day='01');
OK
Time taken: 0.28 seconds
hive> show partitions partition_test;
OK
year=2016/day=01
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive>
Hope it helps you.
Rename lets you change the value of a partition column. One of use cases is that you can use this statement to normalize your legacy partition column value to conform to its type. In this case, the type conversion and normalization are not enabled for the column values in old partition_spec even with property hive.typecheck.on.insert set to true (default) which allows you to specify any legacy data in form of string in the old partition_spec"
Bug open
https://issues.apache.org/jira/browse/HIVE-10362
You can create a copy of the table without partition, then update the column of the table, and then recreate the first one with partition
create table table_name partitioned by (table_column) as
select
*
from
source_table
That worked for me.

Hive external table not showing partitions

I have created an external table using Hive. My
hive> desc <table_name>;
shows the following output:
OK
transactiontype string
transactionid int
sourcenumber int
destnumber int
amount int
assumedfield1 int
transactionstatus string
assumedfield2 int
assumedfield3 int
transactiondate date
customerid int
# Partition Information
# col_name data_type comment
transactiondate date
customerid int
Time taken: 0.094 seconds, Fetched: 17 row(s)
But when I execute the following command:
hive> show partitions <dbname.tablename>;
OK
Time taken: 0.11 seconds
No partitions are shown. What might be the problem? When i see the hive.log, data in the table seems to be paritioned properly according to the 'transactiondate' and the 'customerid' fields. What is the max number of partitions that a single node should have? I have set 1000 partitions.
2015-06-15 10:33:44,713 INFO [LocalJobRunner Map Task Executor #0]: exec.FileSinkOperator (FileSinkOperator.java:createBucketForFileIdx(593)) - Writing to temp file: FS hdfs://localhost:54310/home/deepak/mobile_money_jan.txt/.hive-staging_hive_2015-06-15_10-30-53_308_5507019849041735537-1/_task_tmp.-ext-10002/transactiondate=2015-01-16/customerid=34560544/_tmp.000002_0
I am running hive on a single node hadoop cluster.
Try adding partitions manually
> alter table db.table add IF NOT EXISTS
> partition(datadate='2017-01-01') location
>'hdfs_location/datadate=2017-01-01'
HI whenever we create an external table it's location is changed to a specified location in HIVE METADATA,it means now this changes reflects in hive meta store too.
BUT the partition information remain unchanged ,partition information is not updated in hive meta store so we need to add those partitions manually.
ALTER TABLE "your-table" ADD PARTITION(transactiondate='datevalue',customerid='id-value');

Adding a default value to a column while creating table in hive

I'm able to create a hive table from data in external file. Now I wish to create another table from data in previous table with additional columns with default value.
I understand that CREATE TABLE AS SELECT can be used but how do I add additional columns with default value?
You could specify which columns to select from table on create/update. Simply provide default value as one of columns. Example with UPDATE is below:
Creating simple table and populating it with value:
hive> create table table1(col1 string);
hive> insert into table table1 values('val1');
hive> select col1 from table1;
OK
val1
Time taken: 0.087 seconds, Fetched: 1 row(s)
Allowing dynamic partitions:
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
Creating second table:
hive> create table table2(col1 string, col2 string);
Populating it from table1 with default value:
hive> insert overwrite table table2 select col1, 'DEFAULT' from table1;
hive> select * from table2;
OK
val1 DEFAULT
Time taken: 0.081 seconds, Fetched: 1 row(s)
I've been looking for a solution for this too and came up with this:
CREATE TABLE test_table AS SELECT
CASE
WHEN TRUE
THEN "desired_value"
END AS default_column_name;