Google BigQuery: Importing DATETIME fields using Avro format - google-bigquery

I have a script that downloads data from an Oracle database, and uploads it to Google BigQuery. This is done by writing to an Avro file, which is then uploaded directly using BQ's python framework. The BigQuery tables I'm uploading the data to has predefined schemas, some of which contain DATETIME fields.
As BigQuery now has support for Avro Logical fields, import of timestamp data is no longer a problem. However, I'm still not able to import datetime fields. I tried using string, but then I got the following error:
Field CHANGED has incompatible types. Configured schema: datetime; Avro file: string.
I also tried to convert the field data to timestamps on export, but that produced an internal error in BigQuery:
An internal error occurred and the request could not be completed. Error: 3144498
Is it even possible to import datetime fields using Avro?

In Avro, the logical data types must include the attribute logicalType, it is possible that this field is not included in your schema definition.
Here there are a couple of examples like the following one. As far as I know the type can be int or long, but logicalType should be date:
{
'name': 'DateField',
'type': 'int',
'logicalType': 'date'
}
Once the logical data type is set, try again. The documentation does indicate it should work:
Avro logical type --> date
Converted BigQuery data type --> DATE
In case you get an error, it would be helpful to check the schema of your avro file, you can use this command to obtain its details:
java -jaravro-tools-1.9.2.jargetschema my-avro-file.avro
UPDATE
For cases where DATE alone doesn't work, please consider that the TIMESTAMP can store the date and time with a number of micro/nano seconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC (UTC seems to be the default for avro). Additionally, the values stored in an avro file (of type DATE o TIMESTAMP) are independent of a particular time zone, in this sense, it is very similar to BigQuery Timestamp data type.

Related

Importing CSV file but getting timestamp error

I'm trying to import CSV files into BigQuery and on any of the hourly reports I attempt to upload it gives the code
Error while reading data, error message: Could not parse 4/12/2016 12:00:00 AM as TIMESTAMP for field SleepDay (position 1) starting at location 65 with message Invalid time zone: AM
I get that the format is trying to use AM as a timezone and causing an error but I'm not sure how best to work around it. All of the hourly entries will have AM or PM after the date-time and that will be thousands of entries.
I'm using the autodetect for my schema and I believe that's where the issue is coming up, but I'm not sure what to put in the edit as text schema option to fix it
To successfully parse an imported string to timestamp in Bigquery, the string must be in the ISO 8601 format.
YYYY-MM-DDThh:mm:ss.sss
If your source data is not available in this format, then try the below approach.
Import the CSV into a temporary table by providing explicit schema, where timestamp fields are strings.
2. Select the data from the created temporary table, use the BigQuery PARSE_TIMESTAMP function as specified below and write to the permanent table.
INSERT INTO `example_project.example_dataset.permanent_table`
SELECT
PARSE_TIMESTAMP('%m/%d/%Y %H:%M:%S %p',time_stamp) as time_stamp,
value
FROM `example_project.example_dataset.temporary_table`;

what are the data types allowed with the sqoop option "--map-column-java"?

I want to use sqoop import to import data from SQL Server, however I am facing some data type conversion issues, and I want to use "--map-column-java" to solve that.
Just in case anybody wants to suggest "--map-column-hive". I can't because I am importing to "--as-parquetfile"; therefore I have to cast the columns data types before inserted in the file.
So, what are the data types allowed with the sqoop option "--map-column-java"?
P.S.
Especially I want to know the "datetime" data type that works with "--map-column-java"
It's pretty taught to load from database into parquet through sqoop, keeping the source datatypes, from the datatypes point of view. For example, you can't load timestamp because it's not supported.
I'm suggesting you the next workaround:
Load with sqoop with all the datatypes string;
Insert from table 1 (with all the datatypes string) into table 2, using cast (as timestamp, as decimal ... etc);
Example:
--map-column-java "ID=String,NR_CARD=String,TIP_CARD_ID=String,CONT_CURENT_ID=String,AUTORIZ_CONTURI_ID=String,TIP_STARE_ID=String,DATA_STARE=String,COMIS=String,BUGETARI_ID=String,DATA_SOLICITARII=String,DATA_EMITERII=String,DATA_VALABILITATII=String,TIP_DESCOPERIT_ID=String,BRANCH_CODE_EMIT=String,ORG_ID=String,DATA_REGEN=String,FIRMA_ID=String,VOUCHER_BLOC=String,CANAL_CERERE=String,CODE_BUG_OPER=String,CREATED_BY=String,CREATION_DATE=String,LAST_UPDATED_BY=String,LAST_UPDATE_DATE=String,LAST_UPDATE_LOGIN=String,IDPAN=String,MOTIV_STARE_ID=String,DATA_ACTIVARII=String" \
In this way you will have all the datatypes, correctly loaded from source.

How to export AVRO files from a BigQuery table with a DATE column and load it again to BigQuery

For moving data from a BigQuery (BQ) table that resides in the US, I want to export the table to a Cloud Storage (GCS) bucket in the US, copy it to an EU bucket, and from there import it again.
The problem is that AVRO does not support DATE types, but it is crucial to us as we are using the new partitioning feature that is not relying on ingestion time, but a column in the table itself.
The AVRO files contain the DATE column as a STRING and therefore a
Field date has changed type from DATE to STRING error is thrown, when trying to load the files via bq load.
There has been a similar question, but it is about timestamps - in my case it absolutely needs to be a DATE as dates don't carry timezone information and timestamps are always interpreted in UTC by BQ.
It works when using NEWLINE_DELIMITED_JSON, but is it possible to make this work with AVRO files?
As #ElliottBrossard pointed out in the comments, there's a public feature request regarding this where it's possible to sign up for the whitelist.

How to load Avro File to BigQuery tables with columns having 'Timestamp' type

By default, Avro doesn't support timestamp but I can have 'Epoch' time values having 'Long' type in the file. What I want is to load those values in 'Timestamp' format while loading the Avro file data to Bigquery table using command line tool.
For example : I have a column having value 1511253927 and I want this value to be loaded as 2017-11-21 00:00:00 using command line tool.
Any leads will be appreciated.
You can try to run a query with your file as a federated data source and use TIMESTAMP_SECONDS standard SQL function to convert values.

TIMESTAMP from CSV via API

Does the API support importing a CSV to a new table when there is a TIMESTAMP field?
If I manually (using the BigQuery web interface) upload a CSV file containing timestamp data, and specify the field to be a TIMESTAMP via the schema, it works just fine. The data is loaded. The timestamp data is interpreted as timestamp data and imported into the timestamp field just fine.
However, when I use the API to do the same thing with the same file, I get this error:
"Illegal CSV schema type: TIMESTAMP"
More specifically, I'm using Google Apps Script to connect to the BigQuery API, but the response seems to be coming from the BigQuery API itself, which suggests this is not a feature of the API.
I know I can import as STRING, then convert to TIMESTAMP in my queries, but I was hoping to ultimately end up with a table schema with a timestamp field... populated from a CSV file... using the API... preferably through Apps Script for simplicity.
It looks like TIMESTAMP is missing from the 'inline' schema parser. The fix should be in next week's build In the mean time, if you pass the schema via the 'schema' field rather than the schemaInline field it should work for you.