Is there a way to match avro schema with Bigquery and Bigtable? - google-bigquery

I'd like to import bigquery data to bigtable using Google Composer.
Exporting bigquery rows in Avro format to GCS was successful. However, import Avro data to Bigtable was not.
The error says
Caused by: org.apache.avro.AvroTypeException: Found Root, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
I guess the schema between bigquery and bigtable should match each other. But I have no idea how to do this.

For every record read from the Avro files:
Attributes present in the files and in the table are loaded into the table.
Attributes present in the file but not in the table are subject to ignore_unknown_fields,
Attributes that exist in the table but not in the file will use their default value, if there is one set.
The below links are helpful.
[1] https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-storage-avro-to-bigtable
[2] https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/resources/schema/avro/bigtable.avsc
[3] Avro to BigTable - Schema issue?

For those of you who still have problem like me because they are not familiar with avro, here is one working schema transformation that I found after some tinkering.
For example, if you have table from bigquery like this
And you want to use user_id as the bigtable row_key and ingest all columns, here is the example code to encode them as avro file.
from avro.schema import Parse
from avro.io import DatumWriter
from avro.datafile import DataFileWriter
bigtable_schema = {
"name" : "BigtableRow",
"type" : "record",
"namespace" : "com.google.cloud.teleport.bigtable",
"fields" : [
{ "name" : "key", "type" : "bytes"},
{ "name" : "cells",
"type" : {
"type" : "array",
"items": {
"name": "BigtableCell",
"type": "record",
"fields": [
{ "name" : "family", "type" : "string"},
{ "name" : "qualifier", "type" : "bytes"},
{ "name" : "timestamp", "type" : "long", "logicalType" : "timestamp-micros"},
{ "name" : "value", "type" : "bytes"}
]
}
}
}
]
}
parsed_schema = Parse(json.dumps(bigtable_schema))
row_key = 'user_id'
family_name = 'feature_name'
feature_list = ['channel', 'zip_code', 'history']
with open('features.avro', 'wb') as f:
writer = DataFileWriter(f, DatumWriter(), parsed_schema)
for item in df.iterrows():
row = item[1]
ts = int(datetime.now().timestamp()) * 1000 * 1000
for feat in feature_list:
writer.append({
"key": row[row_key].encode('utf-8'),
"cells": [{"family": family_name,
"qualifier": feat.encode('utf-8'),
"timestamp": ts,
"value": str(row[feat]).encode('utf-8')}]
})
writer.close()
Then you can use dataflow template job to run the ingestion.
Complete code can be found here: https://github.com/mitbal/sidu/blob/master/bigquery_to_bigtable.ipynb

Related

Karate - Conditional JSON schema validation

I am just wondering how can I do conditional schema validation. The API response is dynamic based on customerType key. If customerType is person then, person details will be included and if the customerType is org organization details will be included in the JSON response. So the response can be in either of the following forms
{
"customerType" : "person",
"person" : {
"fistName" : "A",
"lastName" : "B"
},
"id" : 1,
"requestDate" : "2021-11-11"
}
{
"customerType" : "org",
"organization" : {
"orgName" : "A",
"orgAddress" : "B"
},
"id" : 2,
"requestDate" : "2021-11-11"
}
The schema I created to validate above 2 scenario is as follows
{
"customerType" : "#string",
"organization" : "#? response.customerType=='org' ? karate.match(_,personSchema) : karate.match(_,null)",
"person" : "#? response.customerType=='person' ? karate.match(_,orgSchema) : karate.match(_,null)",
"id" : "#number",
"requestDate" : "#string"
}
but the schema fails to match with the actual response. What changes should I make in the schema to make it work?
Note : I am planning to reuse the schema in multiple tests so I will be keeping the schema in separate files, independent of the feature file
Can you refer to this answer which I think is the better approach: https://stackoverflow.com/a/47336682/143475
That said, I think you missed that the JS karate.match() API doesn't return a boolean, but a JSON that contains a pass boolean property.
So you have to do things like this:
* def someVar = karate.match(actual, expected).pass ? {} : {}

How to create Type RECORD of INTEGER in my terraform file for BigQuery

I am trying to make the terraform schema for my BigQuery table and I need a column of type RECORD which will be populated by INTEGER.
The field in question would have the format of brackets with integers inside could be one or mutiple seperated by comma : [1]
I tried writing like this:
resource "google_bigquery_table" "categories" {
project = "abcd-data-ods-${terraform.workspace}"
dataset_id = google_bigquery_dataset.bq_dataset_op.dataset_id
table_id = "categories"
schema = <<EOF
[
{"type":"STRING","name":"a","mode":"NULLABLE"},
{"type":"RECORD[INTEGER]","name":"b","mode":"NULLABLE"}
]
EOF
}
and like this:
resource "google_bigquery_table" "categories" {
project = "abcd-data-ods-${terraform.workspace}"
dataset_id = google_bigquery_dataset.bq_dataset_op.dataset_id
table_id = "categories"
schema = <<EOF
[
{"type":"STRING","name":"a","mode":"NULLABLE"},
{"type":"RECORD","name":"b","mode":"NULLABLE"}
]
EOF
}
But it didn't work as I keep getting an error in my CI/CD on gitlab
The error for the first attempt:
Error: googleapi: Error 400: Invalid value for type: RECORD[INTEGER] is not a valid value, invalid
The error for the second attempt:
Error: googleapi: Error 400: Field b is type RECORD but has no schema, invalid
I presume that the second implementation is the closet to the solution given the error but it is still missing something
Does anyone has an idea about the right way to declare it
Just as stated at the second error:
Error: googleapi: Error 400: Field b is type RECORD but has no schema, invalid
You must provide a schema for RECORD types (you can read more on the docs). For instance, a valid example could be:
resource "google_bigquery_table" "categories" {
project = "abcd-data-ods-${terraform.workspace}"
dataset_id = google_bigquery_dataset.bq_dataset_op.dataset_id
table_id = "categories"
schema = <<EOF
[
{
"type":"STRING",
"name":"a",
"mode":"NULLABLE"
},
{
"type":"RECORD",
"name":"b",
"mode":"NULLABLE",
"fields": [{
"name": "c",
"type": "INTEGER",
"mode": "NULLABLE"
}]
}
]
EOF
}
Hope can help.

BigQuery: --[no]use_avro_logical_types flag doesn't work

I try to use bq command with --[no]use_avro_logical_types flag to load avro files into BigQuery table which does not exist before executing the command. The avro schema contains timestamp-millis logical type value. When the command is executed, a new table is created but the schema of its column becomes INTEGER.
This is a recently released feature so that I cannot find examples and I don't know what I am missing. Could anyone give me a good example?
My avro schema looks like following,
...
}, {
"name" : "timestamp",
"type" : [ "null", "long" ],
"default" : null,
"logicalType" : [ "null", "timestamp-millis" ]
}, {
...
And executing command is this:
bq load --source_format=AVRO --use_avro_logical_types <table> <path/to/file>
To use the timestamp-millis logical type, you can specify the field in the following way:
{
"name" : "timestamp",
"type" : {"type": "long", "logicalType" : "timestamp-millis"}
}
In order to provide an optional 'null' value, you can try out the following spec:
{
"name" : "timestamp",
"type" : ["null", {"type" : "long", "logicalType" : "timestamp-millis"}]
}
For a full list of supported Avro logical types please refer to the Avro spec: https://avro.apache.org/docs/1.8.0/spec.html#Logical+Types.
According to https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro, the avro type, timestamp-millis, is converted to an INTEGER once loaded in BigQuery.

Data not being populated after altering the hive table

I have a hive table which is being populated by the underlying parquet file in HDFS location. Now, I have altered the table schema by changing the column name,but the column is now getting populated with NULL instead of original data in parquet.
Give this a try. Open up the .avsc file. For the column, you will find something like
{
"name" : "start_date",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "start_date",
"sqlType" : "12"
}
Add an alias
{
"name" : "start_date",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "start_date",
"aliases" : [ "sta_dte" ],
"sqlType" : "12"
}

MultiLevel JSON in PIG

I am new to PIG scripting and working with JSONs. I am in the need of parsing multi-level json files in PIG. Say,
{
"firstName": "John",
"lastName" : "Smith",
"age" : 25,
"address" :
{
"streetAddress": "21 2nd Street",
"city" : "New York",
"state" : "NY",
"postalCode" : "10021"
},
"phoneNumber":
[
{
"type" : "home",
"number": "212 555-1234"
},
{
"type" : "fax",
"number": "646 555-4567"
}
]
}
I am able to parse a single level json through JsonLoader() and do join and other operations and get the desired results as JsonLoader('name:chararray,field1:int .....');
Is it possible to parse the above mentioned JSON file using the built-in JsonLoader() function of PIG 0.10.0. If it is. Please explain me how it is done and accessing fields of the particular JSON?
You can handle nested json loading with Twitter's Elephant Bird: https://github.com/kevinweil/elephant-bird
a = LOAD 'file3.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')
This will parse the JSON into a map http://pig.apache.org/docs/r0.11.1/basic.html#map-schema the JSONArray gets parsed into a DataBag of maps.
It is possible by creating your own UDF. A simple UDF example is shown in below link
http://pig.apache.org/docs/r0.9.1/udf.html#udf-java
C = load 'path' using JsonLoader('firstName:chararray,lastName:chararray,age:int,address:(streetAddress:chararray,city:chararray,state:chararray,postalCode:chararray),
phoneNumber:{(type:chararray,number:chararray)}')