Is there a way to define replacement of one string to other in external table creation in greenplum.? - hive

I need to create external table for a hdfs location. The data is having null instead of empty space for few fields. If the field length is less than 4 for such fields, it is throwing error when selecting data. Is there a way to define replacement of all such nulls with empty space while creating table it self.?
I am trying it in greenplum, just tagged hive to see what can be done for such cases in hive.

You could use the serialization property for mapping NULL string to empty string.
CREATE TABLE IF NOT EXISTS abc ( ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE TBLPROPERTIES ("serialization.null.format"="")
In this case when you query it from hive you would get empty value for that field and hdfs would have "\N".
Or
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE tabname SELECT NULL, COALESCE(NULL,"") FROM data_table;

the answer to the problem is using NULL as 'null' statement in create table syntax for greenplum. As i have mentioned, i wanted to get few inputs from people who faced such issues in hive. so i have tagged hive as well. But, greenplum external table syntax supports NULL AS phrase in which we can specify the form of NULL that you want to keep.

Related

Why is DataStage writing NULL string values as empty strings, while other data types correctly have NULL values

I have a DataStage parallel job that writes to Hive as the final stage in a long job. I can view the data that is about to be written and there are many NULL strings that I want to see in the Hive table.
However, when I view the table that is created, there are no NULL strings, they all get converted into empty strings '' instead. I can see other datatypes, like DECIMAL(5,0) have NULL values and I can select these, e.g.
SELECT * FROM mytable WHERE decimal_column IS NULL;
The process for writing to Hive is to store the data in a staging table in a delimited text format. This is then pushed through a generic CDC process and results in data being written to a new partition in an ORC format table.
The only option I can see for handling NULL values is "Null Value" in the HDFS File Connector Stage. If I leave this blank then I get empty strings and if I type in 'NULL' then 'NULL' is what I get, i.e. not a NULL, but the string 'NULL'.
I can't change the process as it's in place for literally thousands of jobs already. Is there any way to get my string values to be NULL or am I stuck with empty strings?
According to the IBM documentation, an empty String in double-quotation "" should help.
Null value
Specify the character or string that represents null values in the data. For a source stage, input data that has the value
that you specify is set to null on the output link. For a target
stage, in the output file that is written to the file system, null
values are represented by the value that is specified for this
property. To specify that an empty string represents a null value,
specify "" (two double quotation marks).
Source: https://www.ibm.com/docs/en/iis/11.7?topic=reference-properties-file-connector

select row from orc snappy table in hive

I have created a table employee_orc which is orc format with snappy compression.
create table employee_orc(emp_id string, name string)
row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY");
I have uploaded data into the table using the insert statement.
employee_orc table has 1000 records.
When I run the below query, it shows all the records
select * from employee_orc;
But when run the below query, it shows zero results even though the records exist.
select * from employee_orc where emp_id = "EMP456";
Why I am unable to retrieve a single record from the employee_orc table?
The record does not exist. You may think they are the same because they look the same, but there is some difference. One possibility are spaces at the beginning or end of the string. For this, you can use like:
where emp_id like '%EMP456%'
This might help you.
On my part, I don't understand why you want to specify a delimiter in ORC. Are you confusing CSV and ORC or external vs managed ?
I advice you to create your table differently
create table employee_orc(emp_id string, name string)
stored as ORC
TBLPROPERTIES (
"orc.compress"="ZLIB");

Why array values appear in impala but not hive?

I have a column defined as array in my table (HIVE) .
create external table rule
id string,
names array<string>
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '|'stored as parquet
location 'hdfs://folder'
Exemple of value in names : Joe|Jimmy
As i query the table in Impala, i retrieve the data but in hive i only have NULL. Why this behavior? I would even understand the inverse.
I found the answer. the data was written from a spark job in string instead of array.

Create hive timestamp from pig

How i can create a timestamp field in pig from a string that hive accepts as timestamp?
I have formatted the string in pig to match timestamp format in hive, but after loading it is null instead of showing the date.
2014-04-10 09:45:56 this is how the format looks like in pig, and this is matching the format with hive timestamp, but cannot load. (only if i load into string field)
any ideas why?
quick update: no hcatalog is available
problem is some case the timestamp fields contains null values and all the filed become null when using timestamp data type. When putting timestamp to a column where all the row is in the above format it works fine. So the real question is how null values can be handle
I suspect you have written your data to HDFS using PigStorage and you want to load it into a Hive table. The problem is that a missing tuple field will be written by Pig as null which will be treated by Hive 0.11 as null. So far so good.
But then all the subsequent fields will be treated as null, however they can have different values. Hive 0.12 doesn't have this issue.
Depending on the SerDe type, Hive can interpret different strings as null. In case of LazySimpleSerDe it is \N.
You have two option:
set the table's null format property to the empty string which is produced by Pig
or store \N in Pig for null fields
E.g:
Given the following data in Pig 0.11 :
A = load 'data' as (txt:chararray, ts:chararray);
dump A;
(a,2014-04-10 09:45:56)
(b,2014-04-11 10:45:56)
(,)
(e,2014-04-12 11:45:56)
Option 1:
store A into '/user/data';
Hive 0.11 :
CREATE EXTERNAL TABLE test (txt string, tms TimeStamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/user/data';
alter table test SET SERDEPROPERTIES('serialization.null.format' = '');
Option 2:
...
B = foreach A generate txt, (ts is null?'\\N':ts);
store B into '/user/data';
Then create the table in Hive without setting the serde property.

Hive non consistent input file

I have inconsistent log file which I would like to partition with Hive using dynamic partitioning. File example:
20/06/13 20:21:42.637 FLW CPTView::OnInitialUpdate nRemoveAppShareQSize0=50000\n
20/06/13 20:21:42.638 FLW \n
BandwidthGlobalSettings:Old Bandwidth common defines\n
Sometimes log file contains line which started with some word different from date. Each line delimited with \n.
I am running commands:
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_temp (date STRING,time STRING,severity STRING,message STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/tmp';
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_partitioned (time STRING,severity STRING,message STRING) PARTITIONED BY (date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/partitions';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM log_messages_temp pvs INSERT OVERWRITE TABLE log_messages_partitioned PARTITION(date) SELECT pvs.time, pvs.severity, pvs.message, pvs.date;
As a result two dynamic partitions were created: date=20/06/13 and date=BandwidthGlobalSettings:Old
I would like to define to Hive to ignore lines started with not date string.
How can I do this? Or maybe exists another solution?
Thanks.
I think you can write a UDF that will use regular expression to take only date format(ex: 20/06/13) and discard all others like "BandwidthGlobalSettings:Old" . You can use this UDF in your last query while inserting into final table.
I hope this explanation helps your requirement.