Does impala work with nested avro structures? - impala

Assuming I have the following:
"properties" : {
"prop1": "propval",
"prop2": 5
"prop3": {"subprop1":"subpropval1","subprop2":"subpropval2"}
}
"testlist" : [
{"key": "item1", "key2": "value1"},
{"key": "item1", "key2": "value1"}
{"key": "item1", "key2": "value1"}
]
Is this loadable into Impala and queryable without having to specify schema? Or does it have to be a "flat" avro schema without lists/nested structure?

Nested types in avro will be available in Impala 2.2:
https://issues.cloudera.org/browse/IMPALA-345

Querying nested data with Impala is slated for the 2.0 release [1], which should happen in the second half of 2014.
[1] http://blog.cloudera.com/blog/2013/09/whats-next-for-impala-after-release-1-1/

Related

Is there a way to match avro schema with Bigquery and Bigtable?

I'd like to import bigquery data to bigtable using Google Composer.
Exporting bigquery rows in Avro format to GCS was successful. However, import Avro data to Bigtable was not.
The error says
Caused by: org.apache.avro.AvroTypeException: Found Root, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
I guess the schema between bigquery and bigtable should match each other. But I have no idea how to do this.
For every record read from the Avro files:
Attributes present in the files and in the table are loaded into the table.
Attributes present in the file but not in the table are subject to ignore_unknown_fields,
Attributes that exist in the table but not in the file will use their default value, if there is one set.
The below links are helpful.
[1] https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-storage-avro-to-bigtable
[2] https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/resources/schema/avro/bigtable.avsc
[3] Avro to BigTable - Schema issue?
For those of you who still have problem like me because they are not familiar with avro, here is one working schema transformation that I found after some tinkering.
For example, if you have table from bigquery like this
And you want to use user_id as the bigtable row_key and ingest all columns, here is the example code to encode them as avro file.
from avro.schema import Parse
from avro.io import DatumWriter
from avro.datafile import DataFileWriter
bigtable_schema = {
"name" : "BigtableRow",
"type" : "record",
"namespace" : "com.google.cloud.teleport.bigtable",
"fields" : [
{ "name" : "key", "type" : "bytes"},
{ "name" : "cells",
"type" : {
"type" : "array",
"items": {
"name": "BigtableCell",
"type": "record",
"fields": [
{ "name" : "family", "type" : "string"},
{ "name" : "qualifier", "type" : "bytes"},
{ "name" : "timestamp", "type" : "long", "logicalType" : "timestamp-micros"},
{ "name" : "value", "type" : "bytes"}
]
}
}
}
]
}
parsed_schema = Parse(json.dumps(bigtable_schema))
row_key = 'user_id'
family_name = 'feature_name'
feature_list = ['channel', 'zip_code', 'history']
with open('features.avro', 'wb') as f:
writer = DataFileWriter(f, DatumWriter(), parsed_schema)
for item in df.iterrows():
row = item[1]
ts = int(datetime.now().timestamp()) * 1000 * 1000
for feat in feature_list:
writer.append({
"key": row[row_key].encode('utf-8'),
"cells": [{"family": family_name,
"qualifier": feat.encode('utf-8'),
"timestamp": ts,
"value": str(row[feat]).encode('utf-8')}]
})
writer.close()
Then you can use dataflow template job to run the ingestion.
Complete code can be found here: https://github.com/mitbal/sidu/blob/master/bigquery_to_bigtable.ipynb

Querying BigQuery Events data in PowerBI

Hi I have analytics events data moved from firebase to BigQuery and need to create visualization in PowerBI using that BigQuery dataset. I'm able to access the dataset in PowerBI but some fields are in array type I generally use UNNEST while querying in console but how to run the query inside PowerBI. Is there any other option available? Thanks.
Table In BigQuery
What we did until the driver fully supports arrays is to flatten in a view: create a view in bigquery with UNNEST() and query that in PBI instead.
You might need to Transform(parse Json into columns/rows) your specific column in your case event_params
So I have below Json as example for you.
{
"quiz": {
"sport": {
"q1": {
"question": "Which one is correct team name in NBA?",
"options": [
"New York Bulls",
"Los Angeles Kings",
"Golden State Warriros",
"Huston Rocket"
],
"answer": "Huston Rocket"
}
},
"maths": {
"q1": {
"question": "5 + 7 = ?",
"options": [
"10",
"11",
"12",
"13"
],
"answer": "12"
},
"q2": {
"question": "12 - 8 = ?",
"options": [
"1",
"2",
"3",
"4"
],
"answer": "4"
}
}
}
}
I had this json added to my table. currently it has only 1 column
Now I go to Edit queries and go on Transform Tab, there you find Parse, In my case I have Json
When you parse as Json you will have expandable column
Now click on expanding it and sometimes it asks for expand to new row.
Finally you will have such a Table

BigQuery: --[no]use_avro_logical_types flag doesn't work

I try to use bq command with --[no]use_avro_logical_types flag to load avro files into BigQuery table which does not exist before executing the command. The avro schema contains timestamp-millis logical type value. When the command is executed, a new table is created but the schema of its column becomes INTEGER.
This is a recently released feature so that I cannot find examples and I don't know what I am missing. Could anyone give me a good example?
My avro schema looks like following,
...
}, {
"name" : "timestamp",
"type" : [ "null", "long" ],
"default" : null,
"logicalType" : [ "null", "timestamp-millis" ]
}, {
...
And executing command is this:
bq load --source_format=AVRO --use_avro_logical_types <table> <path/to/file>
To use the timestamp-millis logical type, you can specify the field in the following way:
{
"name" : "timestamp",
"type" : {"type": "long", "logicalType" : "timestamp-millis"}
}
In order to provide an optional 'null' value, you can try out the following spec:
{
"name" : "timestamp",
"type" : ["null", {"type" : "long", "logicalType" : "timestamp-millis"}]
}
For a full list of supported Avro logical types please refer to the Avro spec: https://avro.apache.org/docs/1.8.0/spec.html#Logical+Types.
According to https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro, the avro type, timestamp-millis, is converted to an INTEGER once loaded in BigQuery.

Create table and query json data using Amazon Athena?

I want to query JSON data of format using Amazon Athena:
[{"id":"0581b7c92be",
"key":"0581b7c92be",
"value":{"rev":"1-ceeeecaa040"},
"doc":{"_id":"0581b7c92be497d19e5ab51e577ada12","_rev":"1ceeeecaa04","node":"belt","DeviceId":"C001"}},
{"id":"0581b7c92be49",
"key":"0581b7c92be497d19e5",
"value":{"rev":"1-ceeeecaa04031842d3ca"},
"doc":{"_id":"0581b7c92be497","_rev":"1ceeeecaa040318","node":"belt","DeviceId":"C001"}
}
]
Athena DDL is based on Hive, so u will want each json object in your array to be in a separate line:
{"id": "0581b7c92be", "key": "0581b7c92be", "value": {"rev": "1-ceeeecaa040"}, "doc": {"_id": "0581b7c92be497d19e5ab51e577ada12", "_rev": "1ceeeecaa04", "node": "belt", "DeviceId": "C001"} }
{"id": "0581b7c92be49", "key": "0581b7c92be497d19e5", "value": {"rev": "1-ceeeecaa04031842d3ca"}, "doc": {"_id": "0581b7c92be497", "_rev": "1ceeeecaa040318", "node": "belt", "DeviceId": "C001"} }
You might have problems with the nested fields ("value","doc"), so if you can flatten the jsons you will have it easier. (see for example: Hive for complex nested Json)

MultiLevel JSON in PIG

I am new to PIG scripting and working with JSONs. I am in the need of parsing multi-level json files in PIG. Say,
{
"firstName": "John",
"lastName" : "Smith",
"age" : 25,
"address" :
{
"streetAddress": "21 2nd Street",
"city" : "New York",
"state" : "NY",
"postalCode" : "10021"
},
"phoneNumber":
[
{
"type" : "home",
"number": "212 555-1234"
},
{
"type" : "fax",
"number": "646 555-4567"
}
]
}
I am able to parse a single level json through JsonLoader() and do join and other operations and get the desired results as JsonLoader('name:chararray,field1:int .....');
Is it possible to parse the above mentioned JSON file using the built-in JsonLoader() function of PIG 0.10.0. If it is. Please explain me how it is done and accessing fields of the particular JSON?
You can handle nested json loading with Twitter's Elephant Bird: https://github.com/kevinweil/elephant-bird
a = LOAD 'file3.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')
This will parse the JSON into a map http://pig.apache.org/docs/r0.11.1/basic.html#map-schema the JSONArray gets parsed into a DataBag of maps.
It is possible by creating your own UDF. A simple UDF example is shown in below link
http://pig.apache.org/docs/r0.9.1/udf.html#udf-java
C = load 'path' using JsonLoader('firstName:chararray,lastName:chararray,age:int,address:(streetAddress:chararray,city:chararray,state:chararray,postalCode:chararray),
phoneNumber:{(type:chararray,number:chararray)}')