Create a table in Hive and populate it with data - hive

While trying to load data in a Hive table I encountered a behavior that looks strange to me. My data is made up of JSON objects loaded as records in a table called twitter_test containing a single column named "json".
Now I want to extract three fields from each JSON and build a new table called "my_twitter". I thus issue the command
CREATE TABLE my_twitter AS SELECT regexp_replace(get_json_object(t.json, '$.body\[0]'), '\n', '') as text, get_json_object(t.json, '$.publishingdate\[0]') as created_at, get_json_object(t.json, '$.author_screen_name\[0]') as author from twitter_test AS t;
The result is a table with three columns that contains no data. However, if I run the SELECT command alone it returns data as expected.
By trial and error I found out that i need to add LIMIT x at the end of the query for data to be inserted in the new table. The question is: why?
Furthermore, seems strange that I need to know in advance the number x of rows returned by the SELECT statement for the CREATE to work correctly. Is there any workaround?

You could create a table on this json data using the JSON serde which would parse the json objects and then you could easily select each individual columns easily.
Find below a sample hive DDL for creating a json table using json serde
CREATE EXTERNAL TABLE `json_table`(
A string
,B string
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'PATH'

Related

Hive: FieldName in Uppercase in avrofile schema required

I have created a hive table. Below is the create statement:
CREATE EXTERNAL TABLE schemanm.tbl_name(
FIELD_NAME_1 string COMMENT
FIELD_NAME_2 string COMMENT
.....)
PARTITIONED BY (
`part_1` string,
`part_2` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'path/to/directory'
When I am writing the data in the table through an INSERT OVERWRITE statement, the data in the avro file is getting written with field names in schema in lowercase. The requirement is to have the field names in uppercase in the schema. Below is a snippet showing the schema from the avrofile.
Objavro.schema∂,
{"type":"record","name":"tbl_name","namespace":"schemanm","fields":
[{"name":"field_name_1","type":["null","string"],"default":null},
{"name":"field_name_2","type":["null","string"],"default":null}]
Here field_name_1 and field_name_2 should be FIELD_NAME_1, FIELD_NAME_2 respectively.
I am stuck on this. Any help would be appreciated. I am not able to figure out what changes I can do in the create statement so that the field names get written in uppercase. Thanks in advance.

select row from orc snappy table in hive

I have created a table employee_orc which is orc format with snappy compression.
create table employee_orc(emp_id string, name string)
row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY");
I have uploaded data into the table using the insert statement.
employee_orc table has 1000 records.
When I run the below query, it shows all the records
select * from employee_orc;
But when run the below query, it shows zero results even though the records exist.
select * from employee_orc where emp_id = "EMP456";
Why I am unable to retrieve a single record from the employee_orc table?
The record does not exist. You may think they are the same because they look the same, but there is some difference. One possibility are spaces at the beginning or end of the string. For this, you can use like:
where emp_id like '%EMP456%'
This might help you.
On my part, I don't understand why you want to specify a delimiter in ORC. Are you confusing CSV and ORC or external vs managed ?
I advice you to create your table differently
create table employee_orc(emp_id string, name string)
stored as ORC
TBLPROPERTIES (
"orc.compress"="ZLIB");

Is there a way to specify Date/Timestamp format for the incoming data within the Hive CREATE TABLE statement itself?

I've have a CSV files which contain date and timestamp values in the below formats. Eg:
Col1|col2
01JAN2019|01JAN2019:17:34:41
But when I define Col1 as Date and Col2 as Timestamp in my create statement, the Hive tables simply returns NULL when I query.
CREATE EXTERNAL TABLE IF NOT EXISTS my_schema.my_table
(Col1 date,
Col2 timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
STORED AS TEXTFILE
LOCATION 'my_path';
Instead, if I define the data types as simply string then it works. But that's not how I want my tables to be.
I want the table to be able to read the incoming data in correct type. How can I achieve this? Is it possible to define the expected data format of the incoming data with the CREATE statement itself?
Can someone please help?
As of Hive 1.2.0 it is possible to provide additional SerDe property "timestamp.formats". See this Jira for more details: HIVE-9298
ALTER TABLE timestamp_formats SET SERDEPROPERTIES ("timestamp.formats"="ddMMMyyyy:HH:mm:ss");

Can hive tables that contain DATE type columns be queried using impala?

Everytime I am trying to select in IMPALA a DATE type field from a table created in HIVE I get the AnalysisException: Unsupported type 'DATE'.
Are there any workarounds?
UPDATE this is an example of a create table schema from hive and an impala query
Schema:
CREATE TABLE myschema.mytable(day_dt date,
event string)
PARTITIONED BY (day_id int)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Impala query
select b.day_dt
from myschema.mytable b;
Impala doesn't have a DATE datatype, whereas Hive has. You will get AnalysisException: Unsupported type 'DATE' when you access it from Impala. A quick fix would be to create a string column of that date value in Hive and access it in whichever way you want from Impala.
If you're storing as strings, it may work to create a new external hive table that points to the same HDFS location as the existing table, but with the schema having day_dt with datatype STRING instead of DATE.
This is a true workaround, it may only suit some use cases, and you'd at least need to do "MSCK REPAIR" on the external hive table whenever a new partition is added.

Ingesting from Existing Table string field in Serde

I'm looking to parse out a json string in HIVE using Serde, but don't see an easy way of doing so from a string already in HIVE tables. Do you know how I can do this?
To make my scenario more understandable, here is a butchered example I may try:
ADD JAR hdfs:////user/d/libs/json-serde-1.3.8-jar-with-dependencies.jar;
CREATE Temporary TABLE TN (v string);
Insert overwrite table TN select '
[
{"t1":31646203,"t2":"h","s1":
[
{"r1":"w","r2":"w2"}
]
}
]' as v;
CREATE TABLE deserializeThis (jsonDeserialized array<struct<t1:int,t2:string,s1:array<struct<r1:string, r2:string>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
Insert overwrite table deserializeThis select v FROM TN;
Thanks for all your help!
In order to use SerDe we can operate on the file system level. To do this, we can push the information into a table and read from that location using SerDe.
This answer describes the above:
How can I parse a Json column of a Hive table using a Json serde?