Hiveserver2 is unable to read the HDFS data. I have built the table on top of HDFS.
These are the table properties which I have tried to resolve the problem, none of them are working perfectly in HDP 3.x
Tried both Internal and External tables
Stored as ORC
777 permission recursively for all the folders
Executing the table as the same owner of the table
Transnational true (internal table only)
Orc compress zlib
Msck repair executed successfully. Showing partition values and size of the folders are same as in prod
Partitioned and Bucketed
CREATE EXTERNAL TABLE `machine_data`(`ids` string,`delta`
string,`locatio` string,`time_data` string,`valid` boolean,`measure`
string,`val` float
PARTITIONED BY (`nodename` string)
I have moved the data from Server A to Server B and Server C to HDFS and built a table on top of HDFS data. All the three servers are in HDP 3.1. Server A it is production server which is working fine from initial setup. Moved the data to dev and test server respectively.
Server B is test server which seems to be working without any change next day.
Server C is Dev server is not at all working after three days.
HS2 configs are compared. Almost same for across the servers
It is really strange to handle these type of scenario.
I tried creating a hive external table:
CREATE EXTERNAL TABLE TestXML (storexml string)
LOCATION 'wasb:///test/';
However when i try executing query like below, its not able to extract the fields:
xpath_string (storexml, '/trades/trade/USI')
I saw a post, that talked about specifying the input format.
add JARS <>
set xmlinput.element=Store;
CREATE EXTERNAL TABLE EventStoreXML (storexml string)
STORED AS INPUTFORMAT 'msdn.hadoop.mapreduce.input.XmlElementStreamingInputFormat'
LOCATION 'wasb:///';
I could not determine, which jars to include in the add JARs statement. I am using HDInsight on Linux.
Any pointers will be appreciated.
Realised the issue was with the XML having carriage return, as a result it was not able to read the XML.
Create table script in HIVE is hanging and it is not completing for long time. I am using CDH 5.7, 'show databases' takes time to retrieve the data and finally it showed list of all databases. Below create script i am using:
create table dept
( dep_id int,
dep_name string
Am I missing some configuration settings with related to HIVE? Also I am able to see green image in Cloudera Manager(CM) for HIVE.
Looks like Hive metastore was hanging, after restarting Hive service it started working. Thanks for your help in Cloduera community
I'm trying to use Hive(0.13) msck repair table command to recover partitions and it only lists the partitions not added to metastore instead of adding them to metastore as well.
here's the ouput of the command
partitions not in metastore externalexample:CreatedAt=26 04%3A50%3A56 UTC 2014/profileLocation="Chicago"
here's how I'm creating the external table
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
location '/user/hue/exttable/';
Am I missing something?
I had a similar issue with the MSCK REPAIR TABLE listing the partitions that were not in the metastore but not actually adding them (and no error message).
I tried manually adding the partition with the ALTER TABLE ADD PARTITION command, and this gave me an error message, leading me to the root cause which was that the HDFS folder containing the 'missing' partition had been set up with incorrect permissions.
Once the permissions issue was resolved, then the MSCK REPAIR TABLE command worked correctly.
If you encounter this issue, it may be worthwhile to try adding it manually with the ALTER TABLE ADD PARTITION command. It may produce a useful error message that would help you determine the root cause of the problem.
Please make sure that the name of the partitions defined in your table definition match the name of the partition on hdfs.
For example, in your table creation example, I see that you haven't defined any paritions at all.
I think you want to do something like this (note the use of PARTITIONED BY):
create external table ExternalExample(tweetId BIGINT, username STRING, txt STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT) PARTITIONED BY (CreatedAt STRING, profileLocation STRING) COMMENT 'This is the Twitter streaming data' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE location '/user/hue/exttable/';
Then on hdfs you should have the following folder structure:
The partition names for MSCK REPAIR TABLE ExternalTable should be in lowercase then only it will add it to hive metastore, I faced the similar issue in hive 1.2.1 where there was no support for ALTER TABLE ExternalTable RECOVER PARTITION, but after spending some time debugging found the issue that the partition names should be in lowercase i.e /some_external_path/mypartion=01 is valid and /some_external_path/myParition=01 is invalid;
Make your profileLocation to profilelocation or profile_location and test it should work.
My question is here Not able to recover partitions through alter table in Hive 1.2
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (manually by hadoop fs -put command), the metastore will not be aware of these partitions.
you need to add partition
for every partition
or in short you can run
MSCK REPAIR TABLE ExternalExample;
It will add any partitions that exist on HDFS but not in metastore to the metastore.
1) You need to specify partitions
2) Partition names must have all lower case letters . See this -
you might not be running as the hive user:
sudo -u hive** hive -e "set hive.msck.path.validation=ignore;msck repair table T1"
set hive.msck.path.validation=ignore; ( this is for tables with large number of partitions.)
You are just missing the PARTITIONED BY (CreatedAt STRING, profileLocation STRING).
I have a requirement to make datawarehouse in Hive and use HBase to serve real time access
So I would like to know what would be the architecture for the same
Can I first dump the data on HBase and access it as Rest Service and create external table in Hive and run hive queries on it ?
Will Hive be distributed i.e i need to install Hive on all nodes of my cluster or it it will be central
In answer to your questions:
Hive will be distributed.
For best performance, I would consider installing Hive on every node of the cluster. Hive translates HiveQL into MapReduce jobs - the jobs will be performed where the data is. If that's not possible, the data will have to move to the job. For the sake of response time, you'll want Hive on every node.
To create a Hive table that references data stored in HBase, you can check out the Hive - HBase Integration wiki. Here's a quick example:
CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("" = "xyz");