hive can't create table with nested avro schema - hive

I'm trying to use a nested avro schema to create a hive table. But it does not work. I'm using hive 1.1 in cdh5.7.2.
here is my nested avro schema:
[
{
"type": "record",
"name": "Id",
"namespace": "com.test.app_list",
"doc": "Device ID",
"fields": [
{
"name": "idType",
"type": "int"
},{
"name": "id",
"type": "string"
}
]
},
{
"type": "record",
"name": "AppList",
"namespace": "com.test.app_list",
"doc": "",
"fields": [
{
"name": "appId",
"type": "string",
"avro.java.string": "String"
},
{
"name": "timestamp",
"type": "long"
},
{
"name": "idList",
"type": [{"type": "array", "items": "com.test.app_list.Id"}]
}
]
}
]
And my sql to create table:
CREATE EXTERNAL TABLE app_list
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='/hive/schema/test_app_list.avsc');
But hive gives me:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.avro.AvroSerdeException Schema for table must be of type RECORD. Received type: UNION)
hive doc shows: Supports arbitrarily nested schemas. from : https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-Overview–WorkingwithAvrofromHive
data sample:
{
"appId":{"string":"com.test.app"},
"timestamp":{"long":1495893601606},
"idList":{
"array":[
{"idType":15,"id":"6c:5c:14:c3:a5:39"},
{"idType":13,"id":"eb297afe56ff340b6bb7de5c5ab09193"}
]
}
}
But I don't know how to. I need some help to fix this. Thanks!

the top level of the your avro schema expect to be a Record Type, that is why Hive doesn't allow this. A workaround could be create the top level as Record and inside create two fields as record Type.
{
"type": "record",
"name": "myRecord",
"namespace": "com.test.app_list"
"fields": [
{
"type": "record",
"name": "Id",
"doc": "Device ID",
"fields": [
{
"name": "idType",
"type": "int"
},{
"name": "id",
"type": "string"
}
]
},
{
"type": "record",
"name": "AppList",
"doc": "",
"fields": [
{
"name": "appId",
"type": "string",
"avro.java.string": "String"
},
{
"name": "timestamp",
"type": "long"
},
{
"name": "idList",
"type": [{"type": "array", "items": "com.test.app_list.Id"}]
}
]
}
]
}

Related

Create table from Google Sheet with multi line headings

I have been creating BigQuery tables from Google Sheets via the BQ console GUI. This has been working well, but I am now trying to import a file with multiple heading lines and I don't know how to go about it.
If I try to auto detect schema and skip the first 2 columns I get the error Failed to create table: Duplicate column names: 'Date'.
If I try and mannually create the schema, similar to below. Then the import works but when I go to query the records I get the error Unsupported field type: RECORD
[
{
"name": "SubjectName",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "SubjectCode",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "Class1",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "Date",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "Tutor",
"type": "STRING",
"mode": "NULLABLE"
}
]
},
{
"name": "Class2",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "Date",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "Tutor",
"type": "STRING",
"mode": "NULLABLE"
}
]
},
{
"name": "Class3",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "Date",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "Tutor",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
Is there any way to import a google sheet with multiple headings via the BQ Console gui?

Convert JSON to Avro with nifi

I m trying to read RabbitMQ queue and transfer data to hive.
My flow like that : ConsumeAMQP -> ConvertJSONToAvro -> PutHiveStreaming.
I have an error on ConvertJSONTOAvro process.
JSON :
{
"bn":"/27546/0","bt":48128.94568269015,"e":
[
{"n":"1000","sv":"8125333b8-5cae-4c8d-a5312-bbb215211dab"},
{"n":"1001","v":57.520565032958984},
{"n":"1002","v":22.45258230712891},
{"n":"1003","v":1331.0},
{"n":"1005","v":53.0},
{"n":"1011","v":50.0},
{"n":"5518","t":44119.703412761854},
{"n":"1023","v":0.0},
{"n":"1024","v":48128.94568269015},
{"n":"1025","v":7.0}
]
}
Record schema :
{
"type": "record",
"namespace": "nifi",
"fields": [{
"name": "bn",
"type": "string"
},
{
"name": "bt",
"type": "number"
},
{
"name": "e",
"type": "array",
"items": {
"type": "record",
"fields": [{
"name": "n",
"type": "string"
},
{
"name": "sv",
"type": "string"
},
{
"name": "v",
"type": "number"
},
{
"name": "t",
"type": "number"
}
]
}
}
]
}
Error
-–Record schema validated against '{"type":"record"...
I could not figure out what was wrong.
"items" : {
"type" : "record",
You need to add a name to this new record type. Avro doesn't allow "anonymous" record types.

Amazon Personalize dataset import job creation failed

My schema look is like:
{
"type": "record",
"name": "Interactions",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "USER_ID",
"type": "string"
},
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "TIMESTAMP",
"type": "long"
},
{
"name": "EVENT_TYPE",
"type": "string"
},
{
"name": "EVENT_VALUE",
"type": "float"
},
{
"name": "SESSION_ID",
"type": "string"
},
{
"name": "SERVICE_TYPE",
"type": "string"
},
{
"name": "SERVICE_LOCATION",
"type": "string"
},
{
"name": "SERVICE_PRICE",
"type": "int"
},
{
"name": "SERVICE_TIME",
"type": "long"
},
{
"name": "USER_LOCATION",
"type": "string"
}
]
}
I uploaded my .CSV file in S3 bucket user-flights-bucket. When I tried to uploaded it to personalize it failed with the reason:
Path does not exist: s3://user-flights-bucket/null;
S3://user-flights-bucket/"give ur file name.csv"
It will work..give this in the data location

Pentaho Kettle Avro input

I was wondering if someone managed to make this step work?
For example i have local avro file with extension avsc that I want to read, but I can't.
this is the file, if someone can share example ktr how to read it?
Thanks
{
"namespace":"$package.bi.ca.cp",
"type": "record",
"doc": "sample1",
"name":"cp",
"fields": [
{
"name":"cpi",
"type": "string",
"doc": "sample2"
},
{
"name":"ci",
"type": [
"null",
"string"
],
"doc": "sample3"
},
{
"name": "pmv",
"type": [
"null",
{
"type": "array",
"items": "$package.bi.ca.cp.ckmv"
}
],
"doc": ""
}
]
}

Resolving error: returned "Output field used as input"

I'm trying to create a BigQuery table using Python. Other operations (queries, retrieving table bodies etc.) are working fine, but when trying to create a table I'm stuck with an error:
apiclient.errors.HttpError: https://www.googleapis.com/bigquery/v2/projects/marechal-consolidation/datasets/marechal_results/tables?alt=json
returned "Output field used as input">
Here's the command I'm executing:
projectId = 'xxxx'
dataSet = 'marechal_results'
with open(filePath+'tableStructure.json') as data_file:
structure = json.load(data_file)
table_result = tables.insert(projectId=projectId, datasetId=dataSet, body=structure).execute()
JSON table:
{
"kind": "bigquery#table",
"tableReference": {
"projectId": "xxxx",
"tableId": "xxxx",
"datasetId": "xxxx"
},
"type": "table",
"schema": {
"fields": [
{
"mode": "REQUIRED",
"type": "STRING",
"description": "Company",
"name": "COMPANY"
},
{
"mode": "REQUIRED",
"type": "STRING",
"description": "Currency",
"name": "CURRENCY"
}
// bunch of other fields follow...
]
}
}
Why am I receiving this error?
EDIT: Here's the JSON object I'm passing as parameter:
{
"kind": "bigquery#table",
"type": "TABLE",
"tableReference": {
"projectId": "xxxx",
"tableId": "xxxx",
"datasetId": "xxxx"
},
"schema": {
"fields": [
{
"type": "STRING",
"name": "COMPANY"
},
{
"type": "STRING",
"name": "YEAR"
},
{
"type": "STRING",
"name": "COUNTRY_ISO"
},
{
"type": "STRING",
"name": "COUNTRY"
},
{
"type": "STRING",
"name": "COUNTRY_GROUP"
},
{
"type": "STRING",
"name": "REGION"
},
{
"type": "STRING",
"name": "AREA"
},
{
"type": "STRING",
"name": "BU"
},
{
"type": "STRING",
"name": "REFERENCE"
},
{
"type": "FLOAT",
"name": "QUANTITY"
},
{
"type": "FLOAT",
"name": "NET_SALES"
},
{
"type": "FLOAT",
"name": "GROSS_SALES"
},
{
"type": "STRING",
"name": "FAM_GRP"
},
{
"type": "STRING",
"name": "FAMILY"
},
{
"type": "STRING",
"name": "PRESENTATION"
},
{
"type": "STRING",
"name": "ORIG_FAMILY"
},
{
"type": "FLOAT",
"name": "REF_PRICE"
},
{
"type": "STRING",
"name": "CODE1"
},
{
"type": "STRING",
"name": "CODE4"
}
]
}
}
This is probably too late to help you but hopefully it helps the next poor soul like me. It took me a while figure out what "Output field used as input" meant.
Though the API specifies the same object for the request (input) and response (output), some fields are only allowed in the response. In the docs you will see their descriptions prefixed with "Output only". From looking at your table definition I see that you have "type": "TABLE" and "type" is listed as an "Output only" property. So I would gander that if you remove it then that error will go away. Here is the link to the docs: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables
It would help if they told you what field the violation was on.