BigQuery: Load avro-files with date column data type as long converted to timestamp - google-bigquery

I have troubles getting BigQuery to load timestamps from avro-files correctly.
The avro-files have date columns stored as long, with logical type timestamp-micros. As per documentation, BigQuery should store this as timestamp data type. I have also tried timestamp-millis for logical type.
The data is stored in avro like this:
{'id': '<masked>', '<masked>': '<masked>', 'tm': 1553990400000, '<masked>': <masked>, '<masked>': <masked>, 'created': 1597056958864}
The fields tm and created are longs, 2019-03-31T00:00:00Z and 2020-08-10T11:50:58.986816592Z respectively.
The schema for the avro is
{"type":"record","name":"SomeMessage","namespace":"com.df",
"fields":
[{"name":"id","type":"string"},
{"name":"<masked>","type":"string"},
{"name":"tm","type":"long","logicalType":"timestamp-micros"},
{"name":"<masked>","type":"int"},
{"name":"<masked>","type":"float"},
{"name":"created","type":"long","logicalType":"timestamp-micros"}]}";
When imported to BigQuery through bq load, records ends up like these:
<masked> <masked> tm <masked> <masked> created
________________________________________________________________________________________________________
<masked> | <masked> | 1970-01-18 23:39:50.400 UTC | <masked> | <masked> | 1970-01-19 11:37:36.958864 UTC
________________________________________________________________________________________________________
The import command used is:
bq load --source_format=AVRO --use_avro_logical_types some_dataset.some_table "gs://some-bucket/some.avro"
The timestamps in BigQuery are nowhere near the actual values provided in avro.
Anyone have any ideas on how to do this properly?

I figured out that the avro schema is actually wrong.
The timestamp fields should be like this:
{"name":"created","type":{"type":"long", "logicalType":"timestamp-millis"}}

Related

timestamp VS TIMESTAMP_NTZ in snowflake sql

I am using an sql script to parse a json into a snowflake table using dbt.
One of the cols contain this datetime value: '2022-02-09T20:28:59+0000'.
What's the correct way to define ISO datetime's data type in Snowflake?
I tried date, timestamp and TIMESTAMP_NTZ like this in my dbt sql script:
JSON_DATA:",my_date"::TIMESTAMP_NTZ AS MY_DATE
but clearly, these aren't the correct one because later on when I test it in snowflake with select * , I get this error:
SQL Error [100040] [22007]: Date '2022-02-09T20:28:59+0000' is not recognized
or
SQL Error [100035] [22007]: Timestamp '2022-02-13T03:32:55+0100' is not recognized
so I need to know which Snowflake time/date data type suits the best for this one
EDIT:
This is what I am trying now.
SELECT
JSON_DATA:"date_transmission" AS DATE_TRANSMISSION
, TO_TIMESTAMP(DATE_TRANSMISSION:text, 'YYYY-MM-DDTHH24:MI:SS.FFTZH:TZM') AS DATE_TRANSMISSION_TS_UTC
, JSON_DATA:"authorizerClientId"::text AS AUTHORIZER_CLIENT_ID
, JSON_DATA:"apiPath"::text API_PATH
, MASTERCLIENT_ID
, META_FILENAME
, META_LOAD_TS_UTC
, META_FILE_TS_UTC
FROM {{ source('INGEST_DATA', 'TABLENAME') }}
I get this error:
000939 (22023): SQL compilation error: error line 6 at position 4
10:21:46 too many arguments for function [TO_TIMESTAMP(GET(DATE_TRANSMISSION, 'text'), 'YYYY-MM-DDTHH24:MI:SS.FFTZH:TZM')] expected 1, g
However, if I comment out the the first 2 lines(related to timpstamp types), the other two work perfectly fine. What's the correct syntax of parsing json with TO_TIMESTAMP?
Not that JSON_DATA:"apiPath"::text API_PATH gives the correct value for it in my snowflake tables.
Did some testing and it seems you have 2 options.
You can either get rid of the +0000 at the end: left(column_date, len(column_date)-5)
or try_to_timestamp with format
try_to_timestamp('2022-02-09T20:28:59+0000','YYYY-MM-DD"T"HH24:MI:SS+TZHTZM')
TZH and TZM are TimeZone Offset Hours and Minutes
So there are 2 main points here.
when getting data from JSON to pass to any of the timestamp functions that want a ::TEXT object, but the values to get from JSON are still ::VARIANT so they need to be cast. This is the cause of the error you quote
(22023): SQL compilation error: error line 6 at position 4
10:21:46 too many arguments for function [TO_TIMESTAMP(GET(DATE_TRANSMISSION, 'text'), 'YYYY-MM-DDTHH24:MI:SS.FFTZH:TZM')] expected 1, g
also your SQL is wrong there it should have been
TO_TIMESTAMP(DATE_TRANSMISSION::text,
How you handle the timezone format.As other have noted you (as I did in your last question) do you want to ignore the timezone values or read them. I forgot about the TZHTZM formatting. Given you have timezone data, you should use the TO_TIMESTAMP_TZ`TRY_TO_TIMESTAMP_TZto make sure the time zone data is keep, given you second example shows+0100`
putting those together (assuming you didn't want an extra date_transmission as a variant in you data) :
SELECT
TO_TIMESTAMP_TZ(JSON_DATA:"date_transmission"::text, 'YYYY-MM-DDTHH24:MI:SS+TZHTZM') AS DATE_TRANSMISSION_TS_UTC
, JSON_DATA:"authorizerClientId"::text AS AUTHORIZER_CLIENT_ID
, JSON_DATA:"apiPath"::text AS API_PATH
, MASTERCLIENT_ID
, META_FILENAME
, META_LOAD_TS_UTC
, META_FILE_TS_UTC
FROM {{ source('INGEST_DATA', 'TABLENAME') }}
You should use timestamp (not date which does not store the time information), but probably the format you are using is not autodetected. You can specify the input format as YYYY-MM-DD"T"HH24:MI:SSTZHTZM as shown here. The autodetected one has a : between the TZHTZM.

How to convert String to Timestamp in kafka connect using transforms and insert into postgres using jdbc sink connector from confluent?

Below is my kafka-connect-sink.properties file
I am using confluent-6.0.1.
name=enba-sink-postgres
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
connection.url=jdbc:postgresql://IP:PORT/DB
connection.user=USERNAME
connection.password=PASSWORD
tasks.max=1
topics=postgresInsert
insert.mode=INSERT
table.name.format=schema."tableName"
auto.create=false
key.converter.schema.registry.url=http://localhost:8081
key.converter.schemas.enable=false
value.converter.schemas.enable=false
config.action.reload=restart
value.converter.schema.registry.url=http://localhost:8081
errors.tolerance=all
errors.log.enable=true
errors.log.include.messages=true
print.key=true
# Transforms
transforms=TimestampConverter
transforms.TimestampConverter.type=org.apache.kafka.connect.transforms.TimestampConverter$Value
transforms.TimestampConverter.format=yyyy-MM-dd HH:mm:ss
transforms.TimestampConverter.target.type=Timestamp
transforms.TimestampConverter.target.field=DATE_TIME
I am using avro data and schema is :
{\"type\":\"record\",\"name\":\"log\",\"namespace\":\"transform.name.space\",\"fields\":[{\"name\":\"TRANSACTION_ID\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"MSISDN\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"TRIGGER_NAME\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"W_ID\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"STEP\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"REWARD_ID\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"CAM_ID\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"STATUS\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"COMMENTS\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"CCR_JSON\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"},{\"name\":\"DATE_TIME\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"avro.java.string\":\"String\"}]}
Basically DATE_TIME column in Postgres is of type Timestamp and from avro I tried sending String date and also of type long .
DATE_TIME = 2022-12-15 14:38:02
Issue is If I dont use transform then I am getting error :
ERROR: column "DATE_TIME" is of type timestamp with time zone but expression is of type character varying
And If I use transforms as mentioned above then error is :
[2021-02-06 21:47:41,897] ERROR Error encountered in task enba-sink-postgres-0. Executing stage 'TRANSFORMATION' with class 'org.apache.kafka.connect.transforms.TimestampConverter$Value', where consumed record is {topic='enba', partition=0, offset=69, timestamp=1612628261605, timestampType=CreateTime}. (org.apache.kafka.connect.runtime.errors.LogReporter:66)
org.apache.kafka.connect.errors.ConnectException: Schema Schema{com.package.kafkaconnect.Enbalog:STRUCT} does not correspond to a known timestamp type format
I got it working using :
# Transforms
transforms= timestamp
transforms.timestamp.type= org.apache.kafka.connect.transforms.TimestampConverter$Value
transforms.timestamp.target.type= Timestamp
transforms.timestamp.field= DATE_TIME
transforms.timestamp.format= yyyy-MM-dd HH:mm:ss
For some reason transforms=TimestampConverter was not working.

Schema conflict when storing dataframes with datetime objects using load_table_from_dataframe()

I'm trying to load data from a Pandas DataFrames into a BigQuery table. The DataFrame has a column of dtype datetime64[ns], and when I try to store the df using load_table_from_dataframe(), I get
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table [table name]. Field computation_triggered_time has changed type from DATETIME to TIMESTAMP.
The table has a schema which reads
CREATE TABLE `[table name]` (
...
computation_triggered_time DATETIME NOT NULL,
...
)
In the DataFrame, computation_triggered_time is a datetime64[ns] column. When I read the original DataFrame from CSV, I convert it from text to datetime like so:
df['computation_triggered_time'] = \
df.to_datetime(df['computation_triggered_time']).values.astype('datetime64[ms]')
Note:
The .values.astype('datetime64[ms]') part is necessary because load_table_from_dataframe() uses PyArrow to serialize the df and that fails if the data has nanosecond-precision. The error is something like
[...] Casting from timestamp[ns] to timestamp[ms] would lose data
This looks like a problem with Google's google-cloud-python package, can you report the bug there? https://github.com/googleapis/google-cloud-python

Syncing Qubole HIve table to Snowflake with Struct field

I have a table like following Qubole:
use dm;
CREATE EXTERNAL TABLE IF NOT EXISTS fact (
id string,
fact_attr struct<
attr1 : String,
attr2 : String
>
)
STORED AS PARQUET
LOCATION 's3://my-bucket/DM/fact'
I have created parallel table in Snowflake like following:
CREATE TABLE IF NOT EXISTS dm.fact (
id string,
fact_attr variant
)
My ETL process loads the data into qubole table like:
+------------+--------------------------------+
| id | fact_attr |
+------------+--------------------------------+
| 1 | {"attr1": "a1", "attr2": "a2"} |
| 2 | {"attr1": "a3", "attr2": null} |
+------------+--------------------------------+
I am trying to sync this data to snowflake using Merge command, like
MERGE INTO DM.FACT dst USING %s src
ON dst.id = src.id
WHEN MATCHED THEN UPDATE SET
fact_attr = parse_json(src.fact_attr)
WHEN NOT MATCHED THEN INSERT (
id,
fact_attr
) VALUES (
src.id,
parse_json(src.fact_attr)
);
I am using PySpark to sync the data:
df.write \
.option("sfWarehouse", sf_warehouse) \
.option("sfDatabase", sf_database) \
.option("sfSchema", sf_schema) \
.option("postactions", query) \
.mode("overwrite") \
.snowflake("snowflake", sf_warehouse, sf_temp_table)
With above command I am getting following error:
pyspark.sql.utils.IllegalArgumentException: u"Don't know how to save StructField(fact_attr,StructType(StructField(attr1,StringType,true), StructField(attr2,StringType,true)),true) of type attributes to Snowflake"
I have read through following links but no success:
Semi-structured Data Types
Querying Semi-structured Data
Question:
How can I insert/sync data from Qubole Hive table which has STRUCT field to snowflake?
The version of your Spark Connector for Snowflake in use at the time of trying this lacked support for variant data types.
Support was introduced in their connector version 2.4.4 (released July 2018) onwards, where the StructType fields are now auto-mapped to a VARIANT data type that will work with your MERGE command.

Hive JSON Serde MetaStore Issue

I have an external table with JSON data and I am using JsonSerde to populate data into the table. I am properly getting the data populated and when I query the data I am able to see the results correctly.
But,when I use desc command on that table I am getting from deserializer text for all the column comments.
Below is the table creation ddl.
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
field1 string COMMENT 'This is a field1',
field2 int COMMENT 'This is a field2',
field3 string COMMENT 'This is a field3',
field4 double COMMENT 'This is a field4'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
Location '/user/uszszb6/json_test/data';
Entries in the data file.
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
When I use use the command desc my_table, I get the below output.
+-----------+------------+--------------------+--+
| col_name | data_type | comment |
+-----------+------------+--------------------+--+
| field1 | string | from deserializer |
| field2 | int | from deserializer |
| field3 | string | from deserializer |
| field4 | double | from deserializer |
+-----------+------------+--------------------+--+
JsonSerde is not able to capture the comments properly. I have also tried with other JSONSerde like
org.openx.data.jsonserde.JsonSerDe
org.apache.hive.hcatalog.data.JsonSerDe
com.amazon.elasticmapreduce.JsonSerde
But desc command output is same. There is a JIRA ticket for this bug [https://issues.apache.org/jira/browse/HIVE-6681][1]
According to ticket it's resolved in version 0.13, I am using hive 1.2.1 but still I am facing this issue.
Could anyone share your thoughts on resolving this issue.
Yeah, it looks like it's an hive bug that affects all the Json SerDes, but have you tried using DESCRIBE EXTENDED ?
DESCRIBE EXTENDED my_table;
hive> describe extended json_serde_test;
OK
browser string from deserializer
device_uuid string from deserializer
custom struct<customer_id:string> from deserializer
Detailed Table Information
Table(tableName:json_serde_test,dbName:default, owner:rcongiu,
createTime:1448477902, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:browser, type:string,
comment:hello), FieldSchema(name:device_uuid, type:string, comment:my
name is elder price), FieldSchema(name:custom,
type:struct<customer_id:string>, comment:null)],
location:hdfs://localhost:9000/user/hive/warehouse/json_serde_test,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.openx.data.jsonserde.JsonSerDe, parameters:
{serialization.format=1, mapping.customer_id=Customer ID}),
bucketCols:[], sortCols:[], parameters:{},
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{}), storedAsSubDirectories:false),
partitionKeys:[], parameters:{numFiles=1,
transient_lastDdlTime=1448477903, COLUMN_STATS_ACCURATE=true,
totalSize=128, numRows=0, rawDataSize=0}, viewOriginalText:null,
viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.073 seconds, Fetched: 5 row(s)
Will output a json-ish detailed description that includes comments..kind of hard to read but it is showing me the comments and may be enough for your purposes..or not.