How to convert the spanner TIMESTAMP to BigQuery TIMESTAMP? - google-bigquery

I am trying to copy some tables from Spanner to BigQuery.
I dumped Spanner database in csv file and when I try to upload that csv to BigQuery it is throwing error of the timestamp format.
Here they mentioned limitation of BigQuery TIMESTAMP.
How do I convert spanner TIMESTAMP to BigQuery TIMESTAMP?

There may be two ways to go about this.
Keep the timestamp field as a string as exported by Cloud Spanner and load it into BigQuery as a string. It should still be sortable and used in predicates.
Use a user-defined function to do the string conversion required to load the timestamp natively in BigQuery, via the TextToBigQuery Dataflow template.
You may also write a script to convert the Timestamp to the BigQuery format.

In addition to what #Biswa-nag wrote -
We export our Spanner tables to avro files then import to BigQuery.
Unfortunately, the timestamps turned out to be Strings in BigQuery.
Our workaround for ad-hoc queries is to use user defined function to convert the timestamp in the queries (it took some time to find the correct format...)
An example:
CREATE TEMP FUNCTION ConvertTimestamp(dt STRING) AS (PARSE_DATETIME("%Y-%m-%dT%H:%M:%E*SZ", dt));
select count(*) from `[db].Games` where ConvertTimestamp(StartTime) >= DateTime(2019,8,1,0,0,0)

I converted timestamp to epoch time like this
SELECT myTime , FORMAT_TIMESTAMP("%s", myTime, "America/Los_Angeles") FROM MyTable
and it worked.

Related

Importing CSV file but getting timestamp error

I'm trying to import CSV files into BigQuery and on any of the hourly reports I attempt to upload it gives the code
Error while reading data, error message: Could not parse 4/12/2016 12:00:00 AM as TIMESTAMP for field SleepDay (position 1) starting at location 65 with message Invalid time zone: AM
I get that the format is trying to use AM as a timezone and causing an error but I'm not sure how best to work around it. All of the hourly entries will have AM or PM after the date-time and that will be thousands of entries.
I'm using the autodetect for my schema and I believe that's where the issue is coming up, but I'm not sure what to put in the edit as text schema option to fix it
To successfully parse an imported string to timestamp in Bigquery, the string must be in the ISO 8601 format.
YYYY-MM-DDThh:mm:ss.sss
If your source data is not available in this format, then try the below approach.
Import the CSV into a temporary table by providing explicit schema, where timestamp fields are strings.
2. Select the data from the created temporary table, use the BigQuery PARSE_TIMESTAMP function as specified below and write to the permanent table.
INSERT INTO `example_project.example_dataset.permanent_table`
SELECT
PARSE_TIMESTAMP('%m/%d/%Y %H:%M:%S %p',time_stamp) as time_stamp,
value
FROM `example_project.example_dataset.temporary_table`;

Why loading parquet files into Bigquery gives me back gibberish values into the table?

When I load parquet files into Bigquery table, values stored are wierd. It seems to be the encoding of BYTES fields or whatever else.
Here's the format of the create fields
So when I read the table with casted fields, I get the readable values.
I found the solution here
Ma question is WHY TF bigquery is bahaving like this?
According to this GCP documentation, there are some parquet data types that can be converted into multiple BigQuery data types. A workaround is to add the data type that you want to parse to BigQuery.
For example, to convert the Parquet INT32 data type to the BigQuery DATE data type, specify the following:
optional int32 date_col (DATE);
And another way is to add the schema to the bq load command:
bq load --source_format=PARQUET --noreplace --noautodetect --parquet_enum_as_string=true --decimal_target_types=STRING [project]:[dataset].[tables] gs://[bucket]/[file].parquet Column_name:Data_type

Google BigQuery: Importing DATETIME fields using Avro format

I have a script that downloads data from an Oracle database, and uploads it to Google BigQuery. This is done by writing to an Avro file, which is then uploaded directly using BQ's python framework. The BigQuery tables I'm uploading the data to has predefined schemas, some of which contain DATETIME fields.
As BigQuery now has support for Avro Logical fields, import of timestamp data is no longer a problem. However, I'm still not able to import datetime fields. I tried using string, but then I got the following error:
Field CHANGED has incompatible types. Configured schema: datetime; Avro file: string.
I also tried to convert the field data to timestamps on export, but that produced an internal error in BigQuery:
An internal error occurred and the request could not be completed. Error: 3144498
Is it even possible to import datetime fields using Avro?
In Avro, the logical data types must include the attribute logicalType, it is possible that this field is not included in your schema definition.
Here there are a couple of examples like the following one. As far as I know the type can be int or long, but logicalType should be date:
{
'name': 'DateField',
'type': 'int',
'logicalType': 'date'
}
Once the logical data type is set, try again. The documentation does indicate it should work:
Avro logical type --> date
Converted BigQuery data type --> DATE
In case you get an error, it would be helpful to check the schema of your avro file, you can use this command to obtain its details:
java -jaravro-tools-1.9.2.jargetschema my-avro-file.avro
UPDATE
For cases where DATE alone doesn't work, please consider that the TIMESTAMP can store the date and time with a number of micro/nano seconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC (UTC seems to be the default for avro). Additionally, the values stored in an avro file (of type DATE o TIMESTAMP) are independent of a particular time zone, in this sense, it is very similar to BigQuery Timestamp data type.

How to export AVRO files from a BigQuery table with a DATE column and load it again to BigQuery

For moving data from a BigQuery (BQ) table that resides in the US, I want to export the table to a Cloud Storage (GCS) bucket in the US, copy it to an EU bucket, and from there import it again.
The problem is that AVRO does not support DATE types, but it is crucial to us as we are using the new partitioning feature that is not relying on ingestion time, but a column in the table itself.
The AVRO files contain the DATE column as a STRING and therefore a
Field date has changed type from DATE to STRING error is thrown, when trying to load the files via bq load.
There has been a similar question, but it is about timestamps - in my case it absolutely needs to be a DATE as dates don't carry timezone information and timestamps are always interpreted in UTC by BQ.
It works when using NEWLINE_DELIMITED_JSON, but is it possible to make this work with AVRO files?
As #ElliottBrossard pointed out in the comments, there's a public feature request regarding this where it's possible to sign up for the whitelist.

TIMESTAMP on HIVE table

I'm trying to load data from Oracle to Hive as parquet. Every time i load a table with date/timestamp column to hive, it automatically converts these columns to BIGINT. Is is possible to load timestamp/date formats to hive using sqoop and as a parquet file?
Already tried creating the table first in hive then using impala to LOAD DATA INPATH the parquet file.
Still failed with errors
"file XX has an incompatible Parquet schema for column XX column:
TIMESTAMP"
BTW, I'm using cloudera quickstart vm. Thanks
From the Cloudera documentation:
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Or you can also use your Hive query like this to get the result in your desired TIMESTAMP format.
FROM_UNIXTIME(CAST(SUBSTR(timestamp_column, 1,10) AS INT)) AS timestamp_column;
Try using configuration of sqoop
--map-column-hive
<cols_name>=TIMESTAMP