What is the proper way to declare a simple Timestamp in Avro - hive

How can we declare a simple timestamp in Avro please.
type:timestamp doesnt work. So I actually use a simple string but I want it as a timestamp. (this is my variable: 27/01/1999 08:45:34 )
Thank you

Use Avro's logical type:
{"name":"timestamp","type": {"type": "string", "logicalType": "timestamp-millis"}
Few useful links:
Avro timestamp-millis
Avro Logical types
Hortonworks community question about Avro timestamp

Related

Pubsub Subscription error for TIMESTAMP data type: Incompatible schema type

I was working on the new bigquery subscription for pubsub and came across an error when trying to create the subscription:
Incompatible schema type for field 'timestamp': field is STRING in the topic schema, but TIMESTAMP in the BigQuery table schema.
However, in the pubsub documentation there is this statement:
When the type in the topic schema is a string and the type in the BigQuery table is TIMESTAMP, DATETIME, DATE, or TIME, then any value for this field in a Pub/Sub message must adhere to the format specified for the BigQuery data type.
As far as I understand, that means it's possible to insert into a TIMESTAMP column in bigquery as long as the string complies with canonical format defined for TIMESTAMP data type.
Am I missing something?
For more info, this is the conflicting part of bigquery and pubsub topic schema that I use:
Bigquery schema
[
{
"description": "The exact time when this request is being sent. Set as UTC",
"mode": "REQUIRED",
"name": "timestamp",
"type": "TIMESTAMP"
}
]
Pubsub topic schema
syntax = "proto2";
message ExampleMessage {
required string timestamp = 10;
}
Update: removed space in message name for pubsub topic
It looks like the issue isn't with the field timestamp itself, but with the name of the message type, Example Message. The message type name has a space, which should not be permitted. At this time, Pub/Sub's Protocol Buffer schema validation is more permissive than the standard parser and does not catch errors like this. If you change the name to ExampleMessage, it should work.
A fix for this issue has already been prepared. You can follow the progress of the fix in the public bug on issue tracker.

Spark createdataframe cannot infer schema - default data types?

I am creating a spark dataframe in databricks using createdataframe and getting the error:
'Some of types cannot be determined after inferring'
I know I can specify the schema, but that does not help if I am creating the dataframe each time with source data from an API and they decide to restructure it.
Instead I would like to tell spark to use 'string' for any column where a data type cannot be inferred.
Is this possible?
This can be easily handled with schema evaluation with delta format. Quick ref: https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html

Google BigQuery: Importing DATETIME fields using Avro format

I have a script that downloads data from an Oracle database, and uploads it to Google BigQuery. This is done by writing to an Avro file, which is then uploaded directly using BQ's python framework. The BigQuery tables I'm uploading the data to has predefined schemas, some of which contain DATETIME fields.
As BigQuery now has support for Avro Logical fields, import of timestamp data is no longer a problem. However, I'm still not able to import datetime fields. I tried using string, but then I got the following error:
Field CHANGED has incompatible types. Configured schema: datetime; Avro file: string.
I also tried to convert the field data to timestamps on export, but that produced an internal error in BigQuery:
An internal error occurred and the request could not be completed. Error: 3144498
Is it even possible to import datetime fields using Avro?
In Avro, the logical data types must include the attribute logicalType, it is possible that this field is not included in your schema definition.
Here there are a couple of examples like the following one. As far as I know the type can be int or long, but logicalType should be date:
{
'name': 'DateField',
'type': 'int',
'logicalType': 'date'
}
Once the logical data type is set, try again. The documentation does indicate it should work:
Avro logical type --> date
Converted BigQuery data type --> DATE
In case you get an error, it would be helpful to check the schema of your avro file, you can use this command to obtain its details:
java -jaravro-tools-1.9.2.jargetschema my-avro-file.avro
UPDATE
For cases where DATE alone doesn't work, please consider that the TIMESTAMP can store the date and time with a number of micro/nano seconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC (UTC seems to be the default for avro). Additionally, the values stored in an avro file (of type DATE o TIMESTAMP) are independent of a particular time zone, in this sense, it is very similar to BigQuery Timestamp data type.

Parquet troubles with decimal in Azure Data Factory V2

Since 3 or 4 days i'm experiencing troubles in writing decimal values in parquet file format with Azure Data Factory V2.
The repro steps are quite simple, from an SQL source containing a numeric value i map it to a parquet file using the copy activity.
At runtime the following exception is thrown:
{
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=UserErrorParquetTypeNotSupported,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Decimal Precision or Scale information is not found in schema for column: ADDRESSLONGITUDE,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=System.InvalidCastException,Message=Object cannot be cast from DBNull to other types.,Source=mscorlib,'",
"failureType": "UserError",
"target": "Copy Data"
}
In the source the complaining column is defined as numeric(32,6) type.
I think the problem is circumscribed to the parquet sink because changing the destination format to csv result in a succeeded pipeline.
Any suggestions?
Based on Jay's answer, here is the whole dataset :
SELECT
[ADDRESSLATITUDE]
FROM
[dbo].[MyTable]
Based on the SQL Types to Parquet Logical Types and Data type mapping for Parquet files in data factory copy activity,it supports Decimal data type.Decimal data is converted into binary data type.
Back to your error message:
Failure happened on 'Source' side.
ErrorCode=UserErrorParquetTypeNotSupported,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Decimal Precision or Scale information is not found in schema
for column:
ADDRESSLONGITUDE,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''
Type=System.InvalidCastException,Message=Object cannot be cast from
DBNull to other types.,Source=mscorlib,'
If your numeric data has null value, it will be converted into Int data type without any
Decimal precision or scale information.
Csv format does not have this transformation process so you could set default value for your numeric data.

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.