select row from orc snappy table in hive - sql

I have created a table employee_orc which is orc format with snappy compression.
create table employee_orc(emp_id string, name string)
row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY");
I have uploaded data into the table using the insert statement.
employee_orc table has 1000 records.
When I run the below query, it shows all the records
select * from employee_orc;
But when run the below query, it shows zero results even though the records exist.
select * from employee_orc where emp_id = "EMP456";
Why I am unable to retrieve a single record from the employee_orc table?

The record does not exist. You may think they are the same because they look the same, but there is some difference. One possibility are spaces at the beginning or end of the string. For this, you can use like:
where emp_id like '%EMP456%'
This might help you.

On my part, I don't understand why you want to specify a delimiter in ORC. Are you confusing CSV and ORC or external vs managed ?
I advice you to create your table differently
create table employee_orc(emp_id string, name string)
stored as ORC
TBLPROPERTIES (
"orc.compress"="ZLIB");

Related

Is there a way to specify Date/Timestamp format for the incoming data within the Hive CREATE TABLE statement itself?

I've have a CSV files which contain date and timestamp values in the below formats. Eg:
Col1|col2
01JAN2019|01JAN2019:17:34:41
But when I define Col1 as Date and Col2 as Timestamp in my create statement, the Hive tables simply returns NULL when I query.
CREATE EXTERNAL TABLE IF NOT EXISTS my_schema.my_table
(Col1 date,
Col2 timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
STORED AS TEXTFILE
LOCATION 'my_path';
Instead, if I define the data types as simply string then it works. But that's not how I want my tables to be.
I want the table to be able to read the incoming data in correct type. How can I achieve this? Is it possible to define the expected data format of the incoming data with the CREATE statement itself?
Can someone please help?
As of Hive 1.2.0 it is possible to provide additional SerDe property "timestamp.formats". See this Jira for more details: HIVE-9298
ALTER TABLE timestamp_formats SET SERDEPROPERTIES ("timestamp.formats"="ddMMMyyyy:HH:mm:ss");

Create a table in Hive and populate it with data

While trying to load data in a Hive table I encountered a behavior that looks strange to me. My data is made up of JSON objects loaded as records in a table called twitter_test containing a single column named "json".
Now I want to extract three fields from each JSON and build a new table called "my_twitter". I thus issue the command
CREATE TABLE my_twitter AS SELECT regexp_replace(get_json_object(t.json, '$.body\[0]'), '\n', '') as text, get_json_object(t.json, '$.publishingdate\[0]') as created_at, get_json_object(t.json, '$.author_screen_name\[0]') as author from twitter_test AS t;
The result is a table with three columns that contains no data. However, if I run the SELECT command alone it returns data as expected.
By trial and error I found out that i need to add LIMIT x at the end of the query for data to be inserted in the new table. The question is: why?
Furthermore, seems strange that I need to know in advance the number x of rows returned by the SELECT statement for the CREATE to work correctly. Is there any workaround?
You could create a table on this json data using the JSON serde which would parse the json objects and then you could easily select each individual columns easily.
Find below a sample hive DDL for creating a json table using json serde
CREATE EXTERNAL TABLE `json_table`(
A string
,B string
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'PATH'

Is there a way to define replacement of one string to other in external table creation in greenplum.?

I need to create external table for a hdfs location. The data is having null instead of empty space for few fields. If the field length is less than 4 for such fields, it is throwing error when selecting data. Is there a way to define replacement of all such nulls with empty space while creating table it self.?
I am trying it in greenplum, just tagged hive to see what can be done for such cases in hive.
You could use the serialization property for mapping NULL string to empty string.
CREATE TABLE IF NOT EXISTS abc ( ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE TBLPROPERTIES ("serialization.null.format"="")
In this case when you query it from hive you would get empty value for that field and hdfs would have "\N".
Or
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE tabname SELECT NULL, COALESCE(NULL,"") FROM data_table;
the answer to the problem is using NULL as 'null' statement in create table syntax for greenplum. As i have mentioned, i wanted to get few inputs from people who faced such issues in hive. so i have tagged hive as well. But, greenplum external table syntax supports NULL AS phrase in which we can specify the form of NULL that you want to keep.

Join two hive tables and search for a string

I am new to hive and I have two tables which contains access logs created like this.
CREATE EXTERNAL TABLE rwloglines(line string) STORED AS TEXTFILE LOCATION 'hdfs:///rwlogs'
CREATE EXTERNAL TABLE dpxloglines(line string) STORED AS TEXTFILE LOCATION 'hdfs:///dpxlogs'
both of these will contain a Id which is made of 20 characters [A-Z][0-9]. I want to join these two tables and search for the Id. What is the query I should write in hive
Can some one please help me.
The easiest approach would be to spilt the contents of each file into various columns like id , ip adde, error msg etc. And then load it to hive table specifying these columns in its schema.
Then
select id from rwloglines a join dpxloglines b where id='';

Hive non consistent input file

I have inconsistent log file which I would like to partition with Hive using dynamic partitioning. File example:
20/06/13 20:21:42.637 FLW CPTView::OnInitialUpdate nRemoveAppShareQSize0=50000\n
20/06/13 20:21:42.638 FLW \n
BandwidthGlobalSettings:Old Bandwidth common defines\n
Sometimes log file contains line which started with some word different from date. Each line delimited with \n.
I am running commands:
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_temp (date STRING,time STRING,severity STRING,message STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/tmp';
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_partitioned (time STRING,severity STRING,message STRING) PARTITIONED BY (date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/partitions';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM log_messages_temp pvs INSERT OVERWRITE TABLE log_messages_partitioned PARTITION(date) SELECT pvs.time, pvs.severity, pvs.message, pvs.date;
As a result two dynamic partitions were created: date=20/06/13 and date=BandwidthGlobalSettings:Old
I would like to define to Hive to ignore lines started with not date string.
How can I do this? Or maybe exists another solution?
Thanks.
I think you can write a UDF that will use regular expression to take only date format(ex: 20/06/13) and discard all others like "BandwidthGlobalSettings:Old" . You can use this UDF in your last query while inserting into final table.
I hope this explanation helps your requirement.