Parquet with Null Value for column is converted to Integer - google-bigquery

I'm using python pandas to write a DataFrame to parquet in GCS, then using Bigquery Transfer Service to transfer the GCS parquet file to a Bigquery table. Sometimes when the DataFrame is small, an entire column might have NULL values. When this occurs, Bigquery treats that null value column as an INTEGER type instead of what the parquet claims it to be.
When trying to append it to an existing table that expects that column to be NULLABLE STRING, Big Query Transfer Service will fail with INVALID_ARGUMENT: Provided Schema does not match Table project.dataset.dataset_health_reports. Field asin has changed type from STRING to INTEGER; JobID: xxx
When I use BQDTS to write the parquet to a new table, it can create the table, but the null column becomes an Integer type.
Any idea how to make BQDTS respect the original type or to manually specify types?

to remedy this issue you can pre-define the schema for columns which can be ambigous. For example I want the street_address_two column to be string then I can define the schema argument in LoadJobConfig as:
[bigquery.SchemaField("street_address_two", "STRING")].
The code will look like:
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("street_address_two", "STRING")
],
source_format=bigquery.SourceFormat.PARQUET,
)

Related

Exploding array of struct, spark-sql returns null for orc file format

I've external table which stores the data in ORC file format. When I'm trying to explode the array<struct> field from the table it returns the null value
spark.sql.hive.convertMetastoreOrc=false
Works for me
https://spark.apache.org/docs/latest/configuration.html

AWS Athena: HIVE_BAD_DATA ERROR: Field type DOUBLE in parquet is incompatible with type defined in table schema

I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression.
I have three columns with string values, one column called "key" with int values and one column called "result" which have both double and int values.
With those columns, I created Schema like:
create external table (
key int,
result double,
location string,
vehicle_name string.
filename string
)
When I queried the table, I would get
HIVE_BAD_DATA: Field results type INT64 in parquet is incompatible with type DOUBLE defined in table schema
So, I modified a schema with result datatype as INT.
Then I queried the table and got,
HIVE_BAD_DATA: Field results type DOUBLE in parquet is incompatible with type INT defined in table schema
I've looked around to try to understand why this might happen but found no solution.
Any suggestion is much appreciated.
It sounds to me like you have some files where the column is typed as double and some where it is typed as int. When you type the column of the table as double Athena will eventually read a file where the corresponding column is int and throw this error, and vice versa if you type the table column as int.
Athena doesn't do type coercion as far as I can tell, but even if it did, the types are not compatible: a DOUBLE column in Athena can't represent all possible values of a Parquet INT64 column, and an INT column in Athena can't represent a floating point number (and a BIGINT column is required in Athena for a Parquet INT64).
The solution is to make sure your files all have the same schema. You probably need to be explicit in the code that produces the files about what schema to produce (e.g. make it always use DOUBLE).

Why array values appear in impala but not hive?

I have a column defined as array in my table (HIVE) .
create external table rule
id string,
names array<string>
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '|'stored as parquet
location 'hdfs://folder'
Exemple of value in names : Joe|Jimmy
As i query the table in Impala, i retrieve the data but in hive i only have NULL. Why this behavior? I would even understand the inverse.
I found the answer. the data was written from a spark job in string instead of array.

Is there a way to define replacement of one string to other in external table creation in greenplum.?

I need to create external table for a hdfs location. The data is having null instead of empty space for few fields. If the field length is less than 4 for such fields, it is throwing error when selecting data. Is there a way to define replacement of all such nulls with empty space while creating table it self.?
I am trying it in greenplum, just tagged hive to see what can be done for such cases in hive.
You could use the serialization property for mapping NULL string to empty string.
CREATE TABLE IF NOT EXISTS abc ( ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE TBLPROPERTIES ("serialization.null.format"="")
In this case when you query it from hive you would get empty value for that field and hdfs would have "\N".
Or
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE tabname SELECT NULL, COALESCE(NULL,"") FROM data_table;
the answer to the problem is using NULL as 'null' statement in create table syntax for greenplum. As i have mentioned, i wanted to get few inputs from people who faced such issues in hive. so i have tagged hive as well. But, greenplum external table syntax supports NULL AS phrase in which we can specify the form of NULL that you want to keep.

Create hive timestamp from pig

How i can create a timestamp field in pig from a string that hive accepts as timestamp?
I have formatted the string in pig to match timestamp format in hive, but after loading it is null instead of showing the date.
2014-04-10 09:45:56 this is how the format looks like in pig, and this is matching the format with hive timestamp, but cannot load. (only if i load into string field)
any ideas why?
quick update: no hcatalog is available
problem is some case the timestamp fields contains null values and all the filed become null when using timestamp data type. When putting timestamp to a column where all the row is in the above format it works fine. So the real question is how null values can be handle
I suspect you have written your data to HDFS using PigStorage and you want to load it into a Hive table. The problem is that a missing tuple field will be written by Pig as null which will be treated by Hive 0.11 as null. So far so good.
But then all the subsequent fields will be treated as null, however they can have different values. Hive 0.12 doesn't have this issue.
Depending on the SerDe type, Hive can interpret different strings as null. In case of LazySimpleSerDe it is \N.
You have two option:
set the table's null format property to the empty string which is produced by Pig
or store \N in Pig for null fields
E.g:
Given the following data in Pig 0.11 :
A = load 'data' as (txt:chararray, ts:chararray);
dump A;
(a,2014-04-10 09:45:56)
(b,2014-04-11 10:45:56)
(,)
(e,2014-04-12 11:45:56)
Option 1:
store A into '/user/data';
Hive 0.11 :
CREATE EXTERNAL TABLE test (txt string, tms TimeStamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/user/data';
alter table test SET SERDEPROPERTIES('serialization.null.format' = '');
Option 2:
...
B = foreach A generate txt, (ts is null?'\\N':ts);
store B into '/user/data';
Then create the table in Hive without setting the serde property.