is it possible to change the the hive orc tables schema metadata camel case? - hive

This is my old ddl This table has history data.
CREATE TABLE orc_table (
first_name STRING,
address array<struct<num:int,p_num:int,STREEt:string>>
)
STORED AS ORC;
after we altered STREEt to street (camel case changes) in the array of struct column address,all the history data gets nullified for the column.
** Alter command used **
ALTER TABLE orc_table CHANGE COLUMN address address array<struct<num:int,p_num:int,street:string>>
Is it possible to make orc tables metadata as case insensitive or any way to fix history and current data?

Related

Change column name of an external partitioned parquet table in hive without null/lost data

I have the following table:
CREATE EXTERNAL TABLE aggregate_status(
m_point VARCHAR(50),
territory VARCHAR(50),
reading_meter VARCHAR(50),
meter_type VARCHAR(500)
)
PARTITIONED BY(
insert_date VARCHAR(10))
STORED AS PARQUET
LOCATION '<the s3 route>/aggregate_status'
TBLPROPERTIES(
'parquet.compression'='SNAPPY'
)
I wish to change the reading_meter column to reading_mode, without losing data.
ALTER TABLE works, but the field now shows null.
I'm not the owner of the Hadoop enviroment I'm working on so changing properties such as set parquet.column.index.access = true is discarded.
Any help would be appreciated. Thanks.
Managed to find a solution, at least for short amounts of data.
Create a backup of the table, with the column name already changed.
CREATE TABLE aggregate_status_bkp AS
SELECT
m_point,
territory,
reading_meter AS reading_mode,
meter_type,
insert_date
FROM aggregate_status
Perform the ALTER TABLE
ALTER TABLE aggregate_status CHANGE COLUMN reading_meter reading_mode VARCHAR (50)
INSERT OVERWRITE from the backup to the original.
--You might need to temporarily disable strict partition mode depending on your case, this is safe since it's only a lock.
--set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE aggregate_status PARTITION(insert_date)
SELECT
m_point,
territory,
reading_mode,
meter_type,
insert_date
FROM aggregate_status_bkp;
--set hive.exec.dynamic.partition.mode=strict;
Another situation we want to protect against dynamic partition insert is that the user may accidentally specify all partitions to be dynamic partitions without specifying one static partition, while the original intention is to just overwrite the sub-partitions of one root partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case.
See https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-QueryingandInsertingData
Optional Delete the backup table after you're finished.
DROP TABLE aggregate_status_bkp;

Table can't be queried after change column position

When querying table using "select * from t2p", the reponse is as blow. I think I have missed some concepts, please help me out.
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazy.objectinspector.LazyMapObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
Step1, create table
create table t2p(id int, name string, score map<string,double>)
partitioned by (class int)
row format delimited
fields terminated by ','
collection items terminated by '\\;'
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
Step2, insert data like
1,zs,math:90.0;english:92.0
2,ls,chinese:89.0;math:80.0
3,xm,geo:87.0;math:80.0
4,lh,chinese:89.0;english:81.0
5,xw,physics:91v;english:81.0
Step3, add another column
alter table t2p add columns (school string);
Step4, change column's order
alter table t2p change school school string after name;
Step5, do query and get error as mentioned above.
select * from t2p;
This is an obvious error.
Your command alter table t2p change school school string after name; changes metadata only. If you are moving columns, the data must already match the new schema or you must change it to match by some other means.
Which means, the map column has to be matching to the new column. In other words, if you want to move column around, make sure new column and existing data types are same.
I did a simple experiment with int data type. It worked because data type are not hugely different but you can see metadata changed but data stayed same.
create table t2p(id int, name string, score int)
partitioned by (class int)
stored as textfile;
insert into t2p partition(class=1) select 100,'dum', 199;
alter table t2p add columns (school string);
alter table t2p change school school string after name;
MSCK REPAIR TABLE t2p ;
select * from t2p;
You can see new column school is mapped to position 3( defined as INT).
Solution - You can do this but make sure new structure+data type is compatible to old structure.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

Bucket is not creating on hadoop-hive

I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.
First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.

External table does not return the data in its folder

I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".