Pub/Sub Bigquery subscriptions schema with GEOGRAPHY fields - google-bigquery

I testes pub/sub BigQuery subscriptions with normal string, integer and other typical field types, but I can't find a way to push GEOMETRY field types.
With Streaming I use geoJSON format and send it as string and BigQuery does the conversion to GEOMETRY, but with Pub/Sub when I try to have GEOMETRY on BigQuery and String on AVRO Pub/Sub got an error message:
Incompatible schema type for field 'gps': field is STRING in the topic
schema, but GEOGRAPHY in the BigQuery table schema.
How can this be done? Any suggestion?
Regards,
Rui

Related

Why loading parquet files into Bigquery gives me back gibberish values into the table?

When I load parquet files into Bigquery table, values stored are wierd. It seems to be the encoding of BYTES fields or whatever else.
Here's the format of the create fields
So when I read the table with casted fields, I get the readable values.
I found the solution here
Ma question is WHY TF bigquery is bahaving like this?
According to this GCP documentation, there are some parquet data types that can be converted into multiple BigQuery data types. A workaround is to add the data type that you want to parse to BigQuery.
For example, to convert the Parquet INT32 data type to the BigQuery DATE data type, specify the following:
optional int32 date_col (DATE);
And another way is to add the schema to the bq load command:
bq load --source_format=PARQUET --noreplace --noautodetect --parquet_enum_as_string=true --decimal_target_types=STRING [project]:[dataset].[tables] gs://[bucket]/[file].parquet Column_name:Data_type

Pubsub Subscription error for TIMESTAMP data type: Incompatible schema type

I was working on the new bigquery subscription for pubsub and came across an error when trying to create the subscription:
Incompatible schema type for field 'timestamp': field is STRING in the topic schema, but TIMESTAMP in the BigQuery table schema.
However, in the pubsub documentation there is this statement:
When the type in the topic schema is a string and the type in the BigQuery table is TIMESTAMP, DATETIME, DATE, or TIME, then any value for this field in a Pub/Sub message must adhere to the format specified for the BigQuery data type.
As far as I understand, that means it's possible to insert into a TIMESTAMP column in bigquery as long as the string complies with canonical format defined for TIMESTAMP data type.
Am I missing something?
For more info, this is the conflicting part of bigquery and pubsub topic schema that I use:
Bigquery schema
[
{
"description": "The exact time when this request is being sent. Set as UTC",
"mode": "REQUIRED",
"name": "timestamp",
"type": "TIMESTAMP"
}
]
Pubsub topic schema
syntax = "proto2";
message ExampleMessage {
required string timestamp = 10;
}
Update: removed space in message name for pubsub topic
It looks like the issue isn't with the field timestamp itself, but with the name of the message type, Example Message. The message type name has a space, which should not be permitted. At this time, Pub/Sub's Protocol Buffer schema validation is more permissive than the standard parser and does not catch errors like this. If you change the name to ExampleMessage, it should work.
A fix for this issue has already been prepared. You can follow the progress of the fix in the public bug on issue tracker.

Is there a way to avoid the data type conversion from STRING to STRUCT<string STRING, text STRING, provided STRING> for Datastore imports to BigQuery?

We are automatically loading Datastore Backups to BigQuery for further analysis overwriting the table every day.
When a Datastore Kind with at least one Entity with long text is imported in BigQuery, that field is automatically converted to a STRUCT<string STRING, text STRING, provided STRING> instead of a STRING field like all the other text/string fields. This then changes the schema of the BigQuery table and makes any further processing or analysis really hard as queries need to be adapted to account for this. We cannot control the length of text on the Datastore side, so we need to find a way to at least stabilize the schema on the BigQuery side.
Any idea on how to deal with this elegantly?
Any way this conversion can be avoided so the schema of the BigQuery table does not change?
Setting a schema to a Load Job from a Datastore export is not possible in BigQuery. It means that the schema will always be inferred from the data. If you try to load it through the UI for example, you will see a message saying
Source file defines the schema
In this link you can find how the type conversion works between Datastore and BigQuery.
Try to use a View as the final table or create a scheduled query to read your table when its loaded and save the results in another table with the right schema.

Parquet troubles with decimal in Azure Data Factory V2

Since 3 or 4 days i'm experiencing troubles in writing decimal values in parquet file format with Azure Data Factory V2.
The repro steps are quite simple, from an SQL source containing a numeric value i map it to a parquet file using the copy activity.
At runtime the following exception is thrown:
{
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=UserErrorParquetTypeNotSupported,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Decimal Precision or Scale information is not found in schema for column: ADDRESSLONGITUDE,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=System.InvalidCastException,Message=Object cannot be cast from DBNull to other types.,Source=mscorlib,'",
"failureType": "UserError",
"target": "Copy Data"
}
In the source the complaining column is defined as numeric(32,6) type.
I think the problem is circumscribed to the parquet sink because changing the destination format to csv result in a succeeded pipeline.
Any suggestions?
Based on Jay's answer, here is the whole dataset :
SELECT
[ADDRESSLATITUDE]
FROM
[dbo].[MyTable]
Based on the SQL Types to Parquet Logical Types and Data type mapping for Parquet files in data factory copy activity,it supports Decimal data type.Decimal data is converted into binary data type.
Back to your error message:
Failure happened on 'Source' side.
ErrorCode=UserErrorParquetTypeNotSupported,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Decimal Precision or Scale information is not found in schema
for column:
ADDRESSLONGITUDE,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''
Type=System.InvalidCastException,Message=Object cannot be cast from
DBNull to other types.,Source=mscorlib,'
If your numeric data has null value, it will be converted into Int data type without any
Decimal precision or scale information.
Csv format does not have this transformation process so you could set default value for your numeric data.

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.