Handle common schema in avro while generating code using avro maven plugin

Handle common schema in avro while generating code using avro maven plugin - apache

I am using avro maven plugin to generate java code for avro .avsc schema file, I have one common schema which is getting used at multiple places as separate record, when i give different namespace at each place, it is able to generate java code, but generated code is in different folders although code for both classes are same
Is there any way to generate only single class for common reference like above schenario... Here is my avsc
{
"namespace": "exmaple.avro",
"type": "record",
"name": "TopRecord",
"fields": [{
"name": "id",
"type": "string"
},
{
"name": "amount",
"type": "double"
},
{
"name": "AC",
"type": {
"type": "record",
"name": "AC_SCHEMA",
"fields": [{
"name": "id",
"type": "string"
},
{
"name": "amount",
"type": "double"
},
{
"name": "InnerCommon",
"type": {
"type": "record",
"name": "InnerSchema",
"fields": [{
"name": "id",
"type": "string"
}]
}
}
]
}
}, {
"name": "BC",
"type": {
"type": "record",
"name": "BC_SCHEMA",
"fields": [{
"name": "id",
"type": "string"
},
{
"name": "amount",
"type": "double"
},
{
"name": "InnerCommon",
"type": {
"type": "record",
"name": "InnerSchema",
"fields": [{
"name": "id",
"type": "string"
}]
}
}
]
}
}
]
}
If i give different namespaces for InnerCommon schema at both placed, it is able to generate code but with classes in 2 folders which have same code :(
Here is working avsc with namespace
{
"namespace": "exmaple.avro",
"type": "record",
"name": "TopRecord",
"fields": [{
"name": "id",
"type": "string"
},
{
"name": "amount",
"type": "double"
},
{
"name": "AC",
"type": {
"type": "record",
"name": "AC_SCHEMA",
"fields": [{
"name": "id",
"type": "string"
},
{
"name": "amount",
"type": "double"
},
{
"name": "InnerCommon",
"type": {
"type": "record",
"name": "InnerSchema",
"namespace": "inner1",
"fields": [{
"name": "id",
"type": "string"
}]
}
}
]
}
}, {
"name": "BC",
"type": {
"type": "record",
"name": "BC_SCHEMA",
"fields": [{
"name": "id",
"type": "string"
},
{
"name": "amount",
"type": "double"
},
{
"name": "InnerCommon",
"type": {
"type": "record",
"name": "InnerSchema",
"namespace": "inner2",
"fields": [{
"name": "id",
"type": "string"
}]
}
}
]
}
}
]
}
Here is the generated folder structure
Is there any way i can put all common generated stuff in single folder and have same namespace to remove duplication?
EDIT 1:
I need to register it to schema registry and check for evolution, i was wondering if there is any way to tell plugin not to override generated code and put only one class

You can move the common to definitions to their own avsc files, and then import the common files in your pom.xml spec for the avro-maven-plugin:
<plugins>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>${avro.version}</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>${avro.schema.dir}</sourceDirectory>
<imports>
<import>${my.common.dir}/my_common_type_1.avsc</import>
<import>${my.common.dir}/my_common_type_2.avsc</import>
</imports>
</configuration>
</execution>
</executions>
</plugin>

Related

Avro Schema: multiple records reference same data type issue: Unknown union branch

I have Avro Schema: customer record import the CustomerAddress subset.
[
{
"type": "record",
"namespace": "com.example",
"name": "CustomerAddress",
"fields": [
{ "name": "address", "type": "string" },
{ "name": "city", "type": "string" },
{ "name": "postcode", "type": ["string", "int"] },
{ "name": "type","type": {"type": "enum","name": "type","symbols": ["POBOX","RESIDENTIAL","ENTERPRISE"]}}
]
},
{
"type": "record",
"namespace": "com.example",
"name": "Customer",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "middle_name", "type": ["null", "string"], "default": null },
{ "name": "last_name", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "height", "type": "float" },
{ "name": "weight", "type": "float" },
{ "name": "automated_email", "type": "boolean", "default": true },
{ "name": "customer_emails", "type": {"type": "array","items": "string"},"default": []},
{ "name": "customer_address", "type": "com.example.CustomerAddress" }
]
}
]
i have JSON payload:
{
"Customer" : {
"first_name": "John",
"middle_name": null,
"last_name": "Smith",
"age": 25,
"height": 177.6,
"weight": 120.6,
"automated_email": true,
"customer_emails": ["ning.chang#td.com", "test#td.com"],
"customer_address":
{
"address": "21 2nd Street",
"city": "New York",
"postcode": "10021",
"type": "RESIDENTIAL"
}
}
}
when i runt the command: java -jar avro-tools-1.8.2.jar fromjson --schema-file customer.avsc customer.json
got the following exception:
Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union branch Customer

In your JSON data you use the key Customer but you have to use the fully qualified name. So it should be com.example.Customer.

Amazon Personalize dataset import job creation failed

My schema look is like:
{
"type": "record",
"name": "Interactions",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "USER_ID",
"type": "string"
},
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "TIMESTAMP",
"type": "long"
},
{
"name": "EVENT_TYPE",
"type": "string"
},
{
"name": "EVENT_VALUE",
"type": "float"
},
{
"name": "SESSION_ID",
"type": "string"
},
{
"name": "SERVICE_TYPE",
"type": "string"
},
{
"name": "SERVICE_LOCATION",
"type": "string"
},
{
"name": "SERVICE_PRICE",
"type": "int"
},
{
"name": "SERVICE_TIME",
"type": "long"
},
{
"name": "USER_LOCATION",
"type": "string"
}
]
}
I uploaded my .CSV file in S3 bucket user-flights-bucket. When I tried to uploaded it to personalize it failed with the reason:
Path does not exist: s3://user-flights-bucket/null;

S3://user-flights-bucket/"give ur file name.csv"
It will work..give this in the data location

Using a json schema in multiple layouts

I'm helping to build an interface that works with Json Schema, and I have a question about interface generation based on that schema. There are two display types - one for internal users and one for external users. Both are dealing with the same data, but the external users should see a smaller subset of fields than the internal users.
For example, here is one schema, it defines an obituary:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "",
"type": "object",
"required": [
"id",
"deceased"
],
"properties": {
"id": { "type": "string" },
"account": {
"type": "object",
"required": [
"name"
],
"properties": {
"id": { "type": "number" },
"name": { "type": "string" },
"website": {
"anyOf": [
{
"type": "string",
"format": "uri"
},
{
"type": "string",
"maxLength": 0
}
]
},
"email": {
"anyOf": [
{
"type": "string",
"format": "email"
},
{
"type": "string",
"maxLength": 0
}
]
},
"address": {
"type": "object",
"properties": {
"address1": { "type": "string" },
"address2": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" },
"postalCode": { "type": "string" },
"country": { "type": "string" }
}
},
"phoneNumber": {
"anyOf": [
{
"type": "string",
"format": "phone"
},
{
"type": "string",
"maxLength": 0
}
]
},
"faxNumber": {
"anyOf": [
{
"type": "string",
"format": "phone"
},
{
"type": "string",
"maxLength": 0
}
]
},
"type": { "type": "string" }
}
},
"deceased": {
"type": "object",
"required": [
"fullName"
],
"properties": {
"fullName": { "type": "string" },
"prefix": { "type": "string" },
"firstName": { "type": "string" },
"middleName": { "type": "string" },
"nickName": { "type": "string" },
"lastName1": { "type": "string" },
"lastName2": { "type": "string" },
"maidenName": { "type": "string" },
"suffix": { "type": "string" }
}
},
"description": { "type": "string" },
"photos": {
"type": "array",
"items": { "type": "string" }
}
}
}
Internal users would be able to access all the fields, but external users shouldn't be able to read/write the account fields.
Should I make a second schema for the external users, or is there a way to indicate different display levels or public/private on each field?

You cannot restrict acess to the fields defined in a schema, but you can have 2 schema files, one defining the "public" fields, and the other one defining the restricted fields plus including the restricted fields.
So
public-schema.json:
{
"properties" : {
"id" : ...
}
}
restricted-schema.json:
{
"allOf" : [
{
"$ref" : "./public-schema.json"
},
{
"properties" : {
"account": ...
}
}
]
}

BQ command line how to use "--noflatten_results" option to have nested fields

I need to query table with nested and repeated fields, using bq command line give me flattened result while i need to get result as the original format.
The orignal format is looking like
{
"fields": [
{
"fields": [
{
"mode": "REQUIRED",
"name": "version",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "hash",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "header",
"type": "STRING"
},
{
"name": "organization",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "date",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "encoding",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "message_type",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "receiver_code",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "sender_code",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "segment_separator",
"type": "STRING"
},
{
"fields": [
{
"fields": [
{
"name": "name",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
},
{
"name": "value",
"type": "STRING"
},
{
"fields": [
{
"name": "name",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
},
{
"name": "value",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "composite_elements",
"type": "RECORD"
}
],
"mode": "REPEATED",
"name": "elements",
"type": "RECORD"
},
{
"name": "description",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "name",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "segments",
"type": "RECORD"
},
{
"mode": "REQUIRED",
"name": "message_identifier",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "element_separator",
"type": "STRING"
},
{
"name": "composite_element_separator",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "messages",
"type": "RECORD"
},
{
"mode": "REQUIRED",
"name": "syntax",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "encoding",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "file_name",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "size",
"type": "INTEGER"
}
]
}
So how to export (locally) data with the nesting representation ?
[EDIT]
Export to Google to have nested representation
It's seem the only solution to export the nesting representation it's to export to table then extract to Google Storage and finally download the file.
bq query --destination_table=DEV.EDI_DATA_EXPORT --replace \
--allow_large_results --noflatten_results \
"select * from DEV.EDI_DATA where syntax='EDIFACT' " \
&& bq extract --destination_format=NEWLINE_DELIMITED_JSON DEV.EDI_DATA_EXPORT gs://mybucket/data.json \
&& gsutil cp gs://mybucket/data.json .
It's surprising to me...

Whenever you use -noflatten_results you also have to use --allow_large_results and --destination_table. This stores the non-flattened results in a new table.

Resolving error: returned "Output field used as input"

I'm trying to create a BigQuery table using Python. Other operations (queries, retrieving table bodies etc.) are working fine, but when trying to create a table I'm stuck with an error:
apiclient.errors.HttpError: https://www.googleapis.com/bigquery/v2/projects/marechal-consolidation/datasets/marechal_results/tables?alt=json
returned "Output field used as input">
Here's the command I'm executing:
projectId = 'xxxx'
dataSet = 'marechal_results'
with open(filePath+'tableStructure.json') as data_file:
structure = json.load(data_file)
table_result = tables.insert(projectId=projectId, datasetId=dataSet, body=structure).execute()
JSON table:
{
"kind": "bigquery#table",
"tableReference": {
"projectId": "xxxx",
"tableId": "xxxx",
"datasetId": "xxxx"
},
"type": "table",
"schema": {
"fields": [
{
"mode": "REQUIRED",
"type": "STRING",
"description": "Company",
"name": "COMPANY"
},
{
"mode": "REQUIRED",
"type": "STRING",
"description": "Currency",
"name": "CURRENCY"
}
// bunch of other fields follow...
]
}
}
Why am I receiving this error?
EDIT: Here's the JSON object I'm passing as parameter:
{
"kind": "bigquery#table",
"type": "TABLE",
"tableReference": {
"projectId": "xxxx",
"tableId": "xxxx",
"datasetId": "xxxx"
},
"schema": {
"fields": [
{
"type": "STRING",
"name": "COMPANY"
},
{
"type": "STRING",
"name": "YEAR"
},
{
"type": "STRING",
"name": "COUNTRY_ISO"
},
{
"type": "STRING",
"name": "COUNTRY"
},
{
"type": "STRING",
"name": "COUNTRY_GROUP"
},
{
"type": "STRING",
"name": "REGION"
},
{
"type": "STRING",
"name": "AREA"
},
{
"type": "STRING",
"name": "BU"
},
{
"type": "STRING",
"name": "REFERENCE"
},
{
"type": "FLOAT",
"name": "QUANTITY"
},
{
"type": "FLOAT",
"name": "NET_SALES"
},
{
"type": "FLOAT",
"name": "GROSS_SALES"
},
{
"type": "STRING",
"name": "FAM_GRP"
},
{
"type": "STRING",
"name": "FAMILY"
},
{
"type": "STRING",
"name": "PRESENTATION"
},
{
"type": "STRING",
"name": "ORIG_FAMILY"
},
{
"type": "FLOAT",
"name": "REF_PRICE"
},
{
"type": "STRING",
"name": "CODE1"
},
{
"type": "STRING",
"name": "CODE4"
}
]
}
}

This is probably too late to help you but hopefully it helps the next poor soul like me. It took me a while figure out what "Output field used as input" meant.
Though the API specifies the same object for the request (input) and response (output), some fields are only allowed in the response. In the docs you will see their descriptions prefixed with "Output only". From looking at your table definition I see that you have "type": "TABLE" and "type" is listed as an "Output only" property. So I would gander that if you remove it then that error will go away. Here is the link to the docs: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables
It would help if they told you what field the violation was on.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Handle common schema in avro while generating code using avro maven plugin - apache

Related

Avro Schema: multiple records reference same data type issue: Unknown union branch

Amazon Personalize dataset import job creation failed

Using a json schema in multiple layouts

BQ command line how to use "--noflatten_results" option to have nested fields

Resolving error: returned "Output field used as input"

Categories

Resources