while creating ORC file table in hive Ubuntu its getting failed.FAILED: Error in semantic analysis: Unrecognized file format in STORED AS clause: orc - hive

hive> create table orc_table (name string,img_loc string) stored as orc tblproperties("orc.compress"="none");
FAILED: Error in semantic analysis: Unrecognized file format in STORED AS clause: orc
hive> create table orc_table (name string,img_loc string) stored as orcfile tblproperties("orc.compress"="none");
FAILED: Error in semantic analysis: Unrecognized file format in STORED AS clause: orcfile
hive> create table orc_table(name string,img_loc string) stored as orcfile;
FAILED: Error in semantic analysis: Unrecognized file format in STORED AS clause: orcfile
hive> create table orc_table(name string,img_loc string) stored as orc;
FAILED: Error in semantic analysis: Unrecognized file format in STORED AS clause: orc

You need to make sure the your HIVE version is greater then 0.11. ORC is introduced in 0.11 version
ORC -- (Note: Available in Hive 0.11.0 and later)
How to check hive version
$ hive --version
Hive 0.14.0.2.2.4.8-40
HIVE-3874: Create a new Optimized Row Columnar file format for Hive. - this is the implementation ticket.
Hive Create Table syntax - check file_format to know minimum requirement for each storage type.
ORC Files - Information about ORC files.

here you Load non ORC file thats why this error occur. So best Solution is first make a table load a data and insert this tables into orc table
CREATE TABLE data(value1 string, value2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
here terminated by b "|" Because I am using PSV file you can set as per your file formate.
LOAD DATA INPATH '/user/hive/data.psv' INTO TABLE data;
create data2 stored as ORC tblproperties ("orc.compress" = "SNAPPY");
insert into data2 select * from data;

Related

Creating external hive table in databricks

I am using databricks community edition.
I am using a hive query to create an external table , the query is running without any error but the table is not getting populated with the specified file that has been specified in the hive query.
Any help would be appreciated .
from official docs ... make sure your s3/storage location path and schema (with respects to the file format [TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM]) are correct
DROP TABLE IF EXISTS <example-table> // deletes the metadata
dbutils.fs.rm("<your-s3-path>", true) // deletes the data
CREATE TABLE <example-table>
USING org.apache.spark.sql.parquet
OPTIONS (PATH "<your-s3-path>")
AS SELECT <your-sql-query-here>
// alternative
CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"

Migrating data from Hive PARQUET table to BigQuery, Hive String data type is getting converted in BQ - BYTES datatype

I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.
That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.

Getting ClassCastException in Hive ORC table

Using cloudera 8.1. In Hive, loaded a table in ORC format with a CSV file. Getting this error on attempting to query the loaded table:
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.IntWritable
This is common issue I see lots of people make,
You can create hive external table with CSV format and then say
"INSERT INTO TABLE FINAL SELECT * FROM TEMP_TABLE" which will copy the CSV data into ORC table.
By using this method Hive will convert the CSV data into ORC using inbuilt libraries.

Dynamically create Hive external table with Avro schema on Parquet Data

I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file.
My try is to use below DDL:
CREATE EXTERNAL TABLE parquet_test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS PARQUET
LOCATION 'hdfs://myParquetFilesPath'
TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc');
My Hive table is successfully created with the right schema, but when I try to read the data :
SELECT * FROM parquet_test;
I get the following error :
java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable
Is there a way to successfully create and read Parquet files, without mentioning columns name and types list in DDL?
Below query works:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';

Apache hive create table for given structure

My csv file contains data structure like:
99999,{k1:v1,k2:v2,k3:v3},9,1.5,http://www.asd.com
what is the create table query for this structure?
I don't have to do any processing on csv file before it is loaded into table.
You need to use Opencsv serde to read/write csv data to/from hive table. Download it here:https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar
Add the serde to the library path of Hive. - Could be skipped, but do upload it to the hdfs cluster your hive server is running. We will use it later to query.
Create Table
CREATE TABLE my_table(a int, b string, c int, d double, url string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Notice that if you use openCSV serde, no matter what type you give, it will be taken as String by hive. But no worries as Hive is loosely type language. It will typecast string into int, json etc. at runtime.
Query
To query at the hive prompt first add the library if not added to the library path of hive
add jar hdfs:///user/hive/aux_jars/opencsv.jar;
Now you could query as:
select a, get_json_object(b, '$.k1') from my_table where get_json_object(b, '$.k2') > val;
Above is the example to access the JSON field from an Hive table.
References:
http://documentation.altiscale.com/using-csv-serde-with-hive
http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html
PS: Json Tuple is the faster way to access the json elements, but I found the syntax of get_json_object more appealing.