Hive create table for json data - hive

I am trying to create the hive table which can read the json data, but when I am executing the create statement it is throwing an error.
Create statement:
CREATE TABLE employee_exp_json
( id INT,
fname STRING,
lname STRING,
profession STRING,
experience INT,
exp_service STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serede2.Jsonserede'
STORED AS TEXTFILE;
Error:
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde:
org.apache.hadoop.hive.contrib.serede2.Jsonserede
I have also added the jar hive-json-serde.jar, but I'm still facing the same issue. I am creating this table on cloudera and hive version is 1.1.0.

The correct class name is
org.apache.hive.hcatalog.data.JsonSerDe
Refer: Hive SerDes
As for the other JAR you added, check its documentation. Still a different class
org.openx.data.jsonserde.JsonSerDe

Try adding the json-serde-with-dependencies.jar.
You can Download it from Download Hive Serde
Also try the class
'org.openx.data.jsonserde.JsonSerDe'

Related

Presto fails to import PARQUET files from S3

I have a presto table that imports PARQUET files based on partitions from s3 as follows:
create table hive.data.datadump
(
tUnixEpoch varchar,
tDateTime varchar,
temperature varchar,
series varchar,
sno varchar,
date date
)
WITH (
format = 'PARQUET',
partitioned_by = ARRAY['series','sno','date'],
external_location = 's3a://dev/files');
The S3 folder structure where the parquet files are stored looks like:
s3a://dev/files/series=S5/sno=242=/date=2020-1-23
and the partition starts from series.
The original code in pyspark that produces the parquet files has all the schema as String type and I am trying to import that as a string but when I run my create script in Presto, it successfully created the table but fails to import the data.
On Running,
select * from hive.data.datadump;
I get the following error:
[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE
Can you guys help to resolve this issue?
Thank You in advance!
I ran into same issues and I found out that this was caused by one of the records in my source doesnt have a matching datatype for the column it was complaining about. I am sure this is just data. You need to trap the exact record which doesnt have the right type.
This might have been solved, just for info, this could be due to column declaration mismatch between hive and parquet file. To use the column names instead of the order, use the property -
hive.parquet.use-column-names=true

Creating external hive table in databricks

I am using databricks community edition.
I am using a hive query to create an external table , the query is running without any error but the table is not getting populated with the specified file that has been specified in the hive query.
Any help would be appreciated .
from official docs ... make sure your s3/storage location path and schema (with respects to the file format [TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM]) are correct
DROP TABLE IF EXISTS <example-table> // deletes the metadata
dbutils.fs.rm("<your-s3-path>", true) // deletes the data
CREATE TABLE <example-table>
USING org.apache.spark.sql.parquet
OPTIONS (PATH "<your-s3-path>")
AS SELECT <your-sql-query-here>
// alternative
CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"

How to load data to Hive table and make it also accessible in Impala

I have a table in Hive:
CREATE EXTERNAL TABLE sr2015(
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',', 'skip.header.line.count'='1',
'quoteChar'= "\"")
The table is loaded data this way:
LOAD DATA INPATH "hdfs:///user/rxie/SR2015.csv" INTO TABLE sr2015;
Why the table is only accessible in Hive? when I attempt to access it in HUE/Impala Editor I got the following error:
AnalysisException: Could not resolve table reference: 'sr2015'
which seems saying there is no such a table, but the table does show up in the left panel.
In Impala-shell, error is different as below:
ERROR: AnalysisException: Failed to load metadata for table: 'sr2015'
CAUSED BY: TableLoadingException: Failed to load metadata for table:
sr2015 CAUSED BY: InvalidStorageDescriptorException: Impala does not
support tables of this type. REASON: SerDe library
'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported.
I have always been thinking Hive table and Impala table are essentially the same and difference is Impala is a more efficient query engine.
Can anyone help sort it out? Thank you very much.
Assuming that sr2015 is located in DB called db, in order to make the table visible in Impala, you need to either issue
invalidate metadata db;
or
invalidate metadata db.sr2015;
in Impala shell
However in your case, the reason is probably the version of Impala you're using, since it doesn't support the table format altogether

Getting exception while updating table in Hive

I have created one table in hive from existing s3 file as follows:
create table reconTable (
entryid string,
run_date string
)
LOCATION 's3://abhishek_data/dump1';
Now I would like to update one entry as follows:
update reconTable set entryid='7.24E-13' where entryid='7.24E-14';
But I am getting following error:
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
I have gone through a few posts here, but not getting any idea how to fix this.
I think you should create an external table when reading data from a source like S3.
Also, you should declare the table in ORC format and set properties 'transactional'='true'.
Please refer to this for more info: attempt-to-do-update-or-delete-using-transaction-manager
You can refer to this Cloudera Community Thread:
https://community.cloudera.com/t5/Support-Questions/Hive-update-delete-and-insert-ERROR-in-cdh-5-4-2/td-p/29485

Apache hive create table for given structure

My csv file contains data structure like:
99999,{k1:v1,k2:v2,k3:v3},9,1.5,http://www.asd.com
what is the create table query for this structure?
I don't have to do any processing on csv file before it is loaded into table.
You need to use Opencsv serde to read/write csv data to/from hive table. Download it here:https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar
Add the serde to the library path of Hive. - Could be skipped, but do upload it to the hdfs cluster your hive server is running. We will use it later to query.
Create Table
CREATE TABLE my_table(a int, b string, c int, d double, url string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Notice that if you use openCSV serde, no matter what type you give, it will be taken as String by hive. But no worries as Hive is loosely type language. It will typecast string into int, json etc. at runtime.
Query
To query at the hive prompt first add the library if not added to the library path of hive
add jar hdfs:///user/hive/aux_jars/opencsv.jar;
Now you could query as:
select a, get_json_object(b, '$.k1') from my_table where get_json_object(b, '$.k2') > val;
Above is the example to access the JSON field from an Hive table.
References:
http://documentation.altiscale.com/using-csv-serde-with-hive
http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html
PS: Json Tuple is the faster way to access the json elements, but I found the syntax of get_json_object more appealing.