Why is DataStage writing NULL string values as empty strings, while other data types correctly have NULL values - hive

I have a DataStage parallel job that writes to Hive as the final stage in a long job. I can view the data that is about to be written and there are many NULL strings that I want to see in the Hive table.
However, when I view the table that is created, there are no NULL strings, they all get converted into empty strings '' instead. I can see other datatypes, like DECIMAL(5,0) have NULL values and I can select these, e.g.
SELECT * FROM mytable WHERE decimal_column IS NULL;
The process for writing to Hive is to store the data in a staging table in a delimited text format. This is then pushed through a generic CDC process and results in data being written to a new partition in an ORC format table.
The only option I can see for handling NULL values is "Null Value" in the HDFS File Connector Stage. If I leave this blank then I get empty strings and if I type in 'NULL' then 'NULL' is what I get, i.e. not a NULL, but the string 'NULL'.
I can't change the process as it's in place for literally thousands of jobs already. Is there any way to get my string values to be NULL or am I stuck with empty strings?

According to the IBM documentation, an empty String in double-quotation "" should help.
Null value
Specify the character or string that represents null values in the data. For a source stage, input data that has the value
that you specify is set to null on the output link. For a target
stage, in the output file that is written to the file system, null
values are represented by the value that is specified for this
property. To specify that an empty string represents a null value,
specify "" (two double quotation marks).
Source: https://www.ibm.com/docs/en/iis/11.7?topic=reference-properties-file-connector

Related

Can one map more than one string to NULL in an SQL COPY command? [duplicate]

I have a source of csv files from a web query which contains two variations of a string that I would like to class as NULL when copying to a PostgreSQL table.
e.g.
COPY my_table FROM STDIN WITH CSV DELIMITER AS ',' NULL AS ('N/A', 'Not applicable');
I know this query will throw an error so I'm looking for a way to specify two separate NULL strings in a COPY CSV query?
I think your best bet in this case, since COPY does not support multiple NULL strings, is to set the NULL string argument to one of them, and then, once it's all loaded, do an UPDATE that will set values in any column you wish having the other NULL string you want to the actual NULL value (the exact query would depend on which columns could have those values).
If you have a bunch of columns, you could use CASE statements in your SET clause to return NULL if it matches your special string, or the value otherwise. NULLIF could also be used (that would be more compact). e.g. NULLIF(col1, 'Not applicable')

how to get the null values named as some string after performing rolup

how to get the null values named as some string after performing rolup
Check out the "NULL in Result Sets" section in the below link:
https://technet.microsoft.com/en-us/library/bb522495(v=sql.105).aspx
Actually I assumed you want to rather replace the nulls with a string rather than renaming the column...
Even if replacing, there is a difference in imputing original null values vs. the rolled up totals being displayed with null values.

importing data with commas in numeric fields into redshift

I am importing data into redshift using the SQL COPY statement. The data has comma thousands separators in the numeric fields which the COPY statement rejects.
The COPY statement has a number of options to specify field separators, date and time formats and NULL values. However I do not see anything to specify number formatting.
Do I need to preprocess the data before loading or is there a way to get redshift to parse the numbers corerctly?
Import the columns as TEXT data type in a temporary table
Insert the temporary table to your target table. Have your SELECT statement for the INSERT replace commas with empty strings, and cast the values to the correct numeric type.

Is there a way to define replacement of one string to other in external table creation in greenplum.?

I need to create external table for a hdfs location. The data is having null instead of empty space for few fields. If the field length is less than 4 for such fields, it is throwing error when selecting data. Is there a way to define replacement of all such nulls with empty space while creating table it self.?
I am trying it in greenplum, just tagged hive to see what can be done for such cases in hive.
You could use the serialization property for mapping NULL string to empty string.
CREATE TABLE IF NOT EXISTS abc ( ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE TBLPROPERTIES ("serialization.null.format"="")
In this case when you query it from hive you would get empty value for that field and hdfs would have "\N".
Or
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE tabname SELECT NULL, COALESCE(NULL,"") FROM data_table;
the answer to the problem is using NULL as 'null' statement in create table syntax for greenplum. As i have mentioned, i wanted to get few inputs from people who faced such issues in hive. so i have tagged hive as well. But, greenplum external table syntax supports NULL AS phrase in which we can specify the form of NULL that you want to keep.

Create hive timestamp from pig

How i can create a timestamp field in pig from a string that hive accepts as timestamp?
I have formatted the string in pig to match timestamp format in hive, but after loading it is null instead of showing the date.
2014-04-10 09:45:56 this is how the format looks like in pig, and this is matching the format with hive timestamp, but cannot load. (only if i load into string field)
any ideas why?
quick update: no hcatalog is available
problem is some case the timestamp fields contains null values and all the filed become null when using timestamp data type. When putting timestamp to a column where all the row is in the above format it works fine. So the real question is how null values can be handle
I suspect you have written your data to HDFS using PigStorage and you want to load it into a Hive table. The problem is that a missing tuple field will be written by Pig as null which will be treated by Hive 0.11 as null. So far so good.
But then all the subsequent fields will be treated as null, however they can have different values. Hive 0.12 doesn't have this issue.
Depending on the SerDe type, Hive can interpret different strings as null. In case of LazySimpleSerDe it is \N.
You have two option:
set the table's null format property to the empty string which is produced by Pig
or store \N in Pig for null fields
E.g:
Given the following data in Pig 0.11 :
A = load 'data' as (txt:chararray, ts:chararray);
dump A;
(a,2014-04-10 09:45:56)
(b,2014-04-11 10:45:56)
(,)
(e,2014-04-12 11:45:56)
Option 1:
store A into '/user/data';
Hive 0.11 :
CREATE EXTERNAL TABLE test (txt string, tms TimeStamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/user/data';
alter table test SET SERDEPROPERTIES('serialization.null.format' = '');
Option 2:
...
B = foreach A generate txt, (ts is null?'\\N':ts);
store B into '/user/data';
Then create the table in Hive without setting the serde property.