How to extract null values from Bigquery table as a TableRow object - google-bigquery

I am trying to extract data from a BigQuery table using Google Cloud Dataflow.
My BigQuery table has few empty values(for String datatype) and null (for Numeric data types).
When I try to extract the data in dataflow using BigQueryIO.readTableRows().fromQuery(select * from table_name), I don't see the columns with null values.
How can I achieve this to get all the columns as part of the TableRow object?
Any help is appreciated

I believe this is the current behavior BigQueryIO connector. Null values are ignored in the resulting element. But empty string values should be available. Can you just assume values that are not available in the resulting element to be null ?

Related

Parquet with Null Value for column is converted to Integer

I'm using python pandas to write a DataFrame to parquet in GCS, then using Bigquery Transfer Service to transfer the GCS parquet file to a Bigquery table. Sometimes when the DataFrame is small, an entire column might have NULL values. When this occurs, Bigquery treats that null value column as an INTEGER type instead of what the parquet claims it to be.
When trying to append it to an existing table that expects that column to be NULLABLE STRING, Big Query Transfer Service will fail with INVALID_ARGUMENT: Provided Schema does not match Table project.dataset.dataset_health_reports. Field asin has changed type from STRING to INTEGER; JobID: xxx
When I use BQDTS to write the parquet to a new table, it can create the table, but the null column becomes an Integer type.
Any idea how to make BQDTS respect the original type or to manually specify types?
to remedy this issue you can pre-define the schema for columns which can be ambigous. For example I want the street_address_two column to be string then I can define the schema argument in LoadJobConfig as:
[bigquery.SchemaField("street_address_two", "STRING")].
The code will look like:
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("street_address_two", "STRING")
],
source_format=bigquery.SourceFormat.PARQUET,
)

Why is DataStage writing NULL string values as empty strings, while other data types correctly have NULL values

I have a DataStage parallel job that writes to Hive as the final stage in a long job. I can view the data that is about to be written and there are many NULL strings that I want to see in the Hive table.
However, when I view the table that is created, there are no NULL strings, they all get converted into empty strings '' instead. I can see other datatypes, like DECIMAL(5,0) have NULL values and I can select these, e.g.
SELECT * FROM mytable WHERE decimal_column IS NULL;
The process for writing to Hive is to store the data in a staging table in a delimited text format. This is then pushed through a generic CDC process and results in data being written to a new partition in an ORC format table.
The only option I can see for handling NULL values is "Null Value" in the HDFS File Connector Stage. If I leave this blank then I get empty strings and if I type in 'NULL' then 'NULL' is what I get, i.e. not a NULL, but the string 'NULL'.
I can't change the process as it's in place for literally thousands of jobs already. Is there any way to get my string values to be NULL or am I stuck with empty strings?
According to the IBM documentation, an empty String in double-quotation "" should help.
Null value
Specify the character or string that represents null values in the data. For a source stage, input data that has the value
that you specify is set to null on the output link. For a target
stage, in the output file that is written to the file system, null
values are represented by the value that is specified for this
property. To specify that an empty string represents a null value,
specify "" (two double quotation marks).
Source: https://www.ibm.com/docs/en/iis/11.7?topic=reference-properties-file-connector

Why array values appear in impala but not hive?

I have a column defined as array in my table (HIVE) .
create external table rule
id string,
names array<string>
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '|'stored as parquet
location 'hdfs://folder'
Exemple of value in names : Joe|Jimmy
As i query the table in Impala, i retrieve the data but in hive i only have NULL. Why this behavior? I would even understand the inverse.
I found the answer. the data was written from a spark job in string instead of array.

Is there a way to define replacement of one string to other in external table creation in greenplum.?

I need to create external table for a hdfs location. The data is having null instead of empty space for few fields. If the field length is less than 4 for such fields, it is throwing error when selecting data. Is there a way to define replacement of all such nulls with empty space while creating table it self.?
I am trying it in greenplum, just tagged hive to see what can be done for such cases in hive.
You could use the serialization property for mapping NULL string to empty string.
CREATE TABLE IF NOT EXISTS abc ( ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE TBLPROPERTIES ("serialization.null.format"="")
In this case when you query it from hive you would get empty value for that field and hdfs would have "\N".
Or
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE tabname SELECT NULL, COALESCE(NULL,"") FROM data_table;
the answer to the problem is using NULL as 'null' statement in create table syntax for greenplum. As i have mentioned, i wanted to get few inputs from people who faced such issues in hive. so i have tagged hive as well. But, greenplum external table syntax supports NULL AS phrase in which we can specify the form of NULL that you want to keep.

Create hive timestamp from pig

How i can create a timestamp field in pig from a string that hive accepts as timestamp?
I have formatted the string in pig to match timestamp format in hive, but after loading it is null instead of showing the date.
2014-04-10 09:45:56 this is how the format looks like in pig, and this is matching the format with hive timestamp, but cannot load. (only if i load into string field)
any ideas why?
quick update: no hcatalog is available
problem is some case the timestamp fields contains null values and all the filed become null when using timestamp data type. When putting timestamp to a column where all the row is in the above format it works fine. So the real question is how null values can be handle
I suspect you have written your data to HDFS using PigStorage and you want to load it into a Hive table. The problem is that a missing tuple field will be written by Pig as null which will be treated by Hive 0.11 as null. So far so good.
But then all the subsequent fields will be treated as null, however they can have different values. Hive 0.12 doesn't have this issue.
Depending on the SerDe type, Hive can interpret different strings as null. In case of LazySimpleSerDe it is \N.
You have two option:
set the table's null format property to the empty string which is produced by Pig
or store \N in Pig for null fields
E.g:
Given the following data in Pig 0.11 :
A = load 'data' as (txt:chararray, ts:chararray);
dump A;
(a,2014-04-10 09:45:56)
(b,2014-04-11 10:45:56)
(,)
(e,2014-04-12 11:45:56)
Option 1:
store A into '/user/data';
Hive 0.11 :
CREATE EXTERNAL TABLE test (txt string, tms TimeStamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/user/data';
alter table test SET SERDEPROPERTIES('serialization.null.format' = '');
Option 2:
...
B = foreach A generate txt, (ts is null?'\\N':ts);
store B into '/user/data';
Then create the table in Hive without setting the serde property.