How to represent repeated fields in Avro schema? - serialization

My data model has few fixed fields and a block of variable fields. The variable fields as a block, can repeat o to n number of times within the same record.
The object person can be used as an analogy for this. The name has just one entry in each record but he can have o to n number of addresses, and the field address has a structure too. Is there a way to loop through the address schema for any number of addresses the person has? How do I mention this in the Avro schema file?

Have you tried using a nested Avro schema. That should solve your one-person-multiple-addresses requirement. Here is a schema that would help.
{
"type": "record",
"name" : "person",
"namespace" : "com.testavro",
"fields": [
{ "name" : "personname", "type": ["null","string"] },
{ "name" : "personId", "type": ["null","string"] },
{ "name" : "Addresses", "type": {
"type": "array",
"items": [ {
"type" : "record",
"name" : "Address",
"fields" : [
{ "name" : "addressLine1", "type": ["null", "string"] },
{ "name" : "addressLine2", "type": ["null", "string"] },
{ "name" : "city", "type": ["null", "string"] },
{ "name" : "state", "type": ["null", "string"] },
{ "name" : "zipcode", "type": ["null", "string"] }
]
}]
}
}
]
}
When code is generated with the above avro schema you get the person class and the Address class. The autogenerated class for person class(only field declarations) looks like
/**
* RecordBuilder for person instances.
*/
public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase<person>
implements org.apache.avro.data.RecordBuilder<person> {
private java.lang.String personname;
private java.lang.String personId;
private java.util.List<java.lang.Object> Addresses;
and the Address class (only field declarations) looks like
/**
* RecordBuilder for Address instances.
*/
public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase<Address>
implements org.apache.avro.data.RecordBuilder<Address> {
private java.lang.String addressLine1;
private java.lang.String addressLine2;
private java.lang.String city;
private java.lang.String state;
private java.lang.String zipcode;
Is this what you were looking for?

Related

Json Schema required validation

I have my json schema where all values are required. For example:
....
{
"properties" : {
"minimumDelay" : {
"type" : "number"
},
"length" : {
"type" : "number"
},
},
"required": {
"minimumDelay",
"length"
}
Here the json data will be valid if I enter both minimumDelay and length values.
But my requirement is json data must be valid when I enter either 1 of the values(like XOR case). How my schema must be modified to achieve the same?
In JSON Schema, the XOR operator is oneOf.
{
"properties" : {
"minimumDelay" : {
"type" : "number"
},
"length" : {
"type" : "number"
}
},
"oneOf": [
{ "required": ["minimumDelay"] },
{ "required": ["length"] }
]
}

Morphline config file not indexing avro nexted data

I am generating index for my avro data in solr. Index are only getting generated for data elements which are at root level and not which are nested.
Below is the sample schema (not including all of it)
My Avro Schema is as below.
{
"type" : "record",
"name" : "abcd",
"namespace" : "xyz",
"doc" : "Schema Definition for Low Fare Search Shopping Request/Response Data",
"fields" : [ {
"name" : "ShopID",
"type" : "string"
}, {
"name" : "RqSysTimestamp",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "RqTimestamp",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "RsSysTimestamp",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "RsTimestamp",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "Request",
"type" : {
"type" : "record",
"name" : "RequestStruct",
"fields" : [ {
"name" : "TransactionID",
"type" : [ "string", "null" ]
}, {
"name" : "AgentSine",
"type" : [ "string", "null" ]
}, {
"name" : "CabinPref",
"type" : [ {
"type" : "array",
"items" : {
"type" : "record",
"name" : "CabinStruct",
"fields" : [ {
"name" : "Cabin",
"type" : [ "string", "null" ]
}, {
"name" : "PrefLevel",
"type" : [ "string", "null" ]
} ]
}
}, "null" ]
}, {
"name" : "CountryCode",
"type" : [ "string", "null" ]
},
"name" : "PassengerStatus",
"type" : [ "string", "null" ]
}, {
}
How do i refer "TransactionID" in my morphline config file. I tried all options but it does not generate index for data elements which are nested.
Below is the sample of my morphline config file.
extractAvroPaths {
flatten : true
paths : {
ShopID : /ShopID
RqSysTimestamp : /RqSysTimestamp
RqTimestamp : /RqTimestamp
RsSysTimestamp :/RsSysTimestamp
RsTimestamp : /RsTimestamp
TransactionID : "/Request/RequestStruct/TransactionID"
AgentSine : "/Request/RequestStruct/AgentSine"
Cabin :/Cabin
PrefLevel :/PrefLevel
CountryCode :/CountryCode
FrequentFlyerStatus :/FrequentFlyerStatus
The toAvro command expects a java.util.Map as input on conversion to a nested Avro record. So this is my solution.
morphlines: [
{
id: convertJsonToAvro
importCommands: [ "org.kitesdk.**" ]
commands: [
# read the JSON blob
{ readJson: {} }
# java code
{
java {
imports : """
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.kitesdk.morphline.base.Fields;
import java.io.IOException;
import java.util.Set;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
"""
code : """
String jsonStr = record.getFirstValue(Fields.ATTACHMENT_BODY).toString();
ObjectMapper mapper = new ObjectMapper();
Map<String, Object> map = null;
try {
map = (Map<String, Object>)mapper.readValue(jsonStr, Map.class);
} catch (IOException e) {
e.printStackTrace();
}
Set<String> keySet = map.keySet();
for (String o : keySet) {
record.put(o, map.get(o));
}
return child.process(record);
"""
}
}
# convert the extracted fields to an avro object
# described by the schema in this field
{ toAvro {
schemaFile: /etc/flume/conf/a1/like_user_event_realtime.avsc
} }
#{ logInfo { format : "loginfo: {}", args : ["#{}"] } }
# serialize the object as avro
{ writeAvroToByteArray: {
format: containerlessBinary
} }
]
}
]

how to add data to the repeatable fields in avro schema?

I'm trying to test avro serde and deserde without code generation (I completed this task using code generation). Schema is as follows
{
"type": "record",
"name" : "person",
"namespace" : "avro",
"fields": [
{ "name" : "personname", "type": ["null","string"] },
{ "name" : "personId", "type": ["null","string"] },
{ "name" : "Addresses", "type": {
"type": "array",
"items": [ {
"type" : "record",
"name" : "Address",
"fields" : [
{ "name" : "addressLine1", "type": ["null", "string"] },
{ "name" : "addressLine2", "type": ["null", "string"] },
{ "name" : "city", "type": ["null", "string"] },
{ "name" : "state", "type": ["null", "string"] },
{ "name" : "zipcode", "type": ["null", "string"] }
]
}]
}
},
{ "name" : "contact", "type" : ["null", "string"]}
]
}
I understand this is how data is added to the schema.
Schema schema = new Schema.Parser().parse(new File("src/person.avsc.txt"));
GenericRecord person1 = new GenericData.Record(schema);
person1.put("personname", "goud");
But how do I add city, state etc to address and then add it to addresses?
GenericRecord address1 = new GenericData.Record(schema);
address1.put("city", "SanJose");
The above snippet doesn't work. I tried to look into GenericArray, but I couldn't get my head around it.
You need to describe inner complex type ("type" : "record", "name" : "Address") in separate schema, like this:
{
"type" : "record",
"name" : "Address",
"fields" : [
{ "name" : "addressLine1", "type": ["null", "string"] },
{ "name" : "addressLine2", "type": ["null", "string"] },
{ "name" : "city", "type": ["null", "string"] },
{ "name" : "state", "type": ["null", "string"] },
{ "name" : "zipcode", "type": ["null", "string"] }
]
}
Then you may create an inner object:
Schema innerSchema = new Schema.Parser().parse(new File("person_address.avsc"));
GenericRecord address = new GenericData.Record(innerSchema);
address.put("addressLine1", "adr_1");
address.put("addressLine2", "adr_2");
address.put("city", "test_city");
address.put("state", "test_state");
address.put("zipcode", "zipcode_00000");
Then add an inner object you created to ArrayList.
At last, create the main object and add all this staff in it.
Here is full example in java:
Schema innerSchema = new Schema.Parser().parse(new File("person_address.avsc"));
GenericRecord address = new GenericData.Record(innerSchema);
address.put("addressLine1", "adr_1");
address.put("addressLine2", "adr_2");
address.put("city", "test_city");
address.put("state", "test_state");
address.put("zipcode", "zipcode_00000");
ArrayList<GenericRecord> addresses = new ArrayList<>();
addresses.add(address);
Schema mainSchema = new Schema.Parser().parse(new File("person.avsc"));
GenericRecord person1 = new GenericData.Record(mainSchema);
person1.put("personname", "goud");
person1.put("personId", "123_id");
person1.put("Addresses", addresses);
Result:
{
"personname": "goud",
"personId": "123_id",
"Addresses": [
{
"addressLine1": "adr_1",
"addressLine2": "adr_2",
"city": "test_city",
"state": "test_state",
"zipcode": "zipcode_00000"
}
],
"contact": "test_contact"
}

how to execute json-schema validator in java

i am new to json-schema i want to validate json data with json-schema,
this is my json data
{
users: [
{
"id": 1,
"username": "davidwalsh",
"phoneNumber": 987654321
},
{
"id": 2,
"username": "russianprince",
"phoneNumber": 9876541234
}
]
}
this is my json-Schema
{
"type" : "object",
"properties" : {
"users" : {
"type" : "array",
"items" : {
"type" : "object",
"properties" : {
"id": { "type": "number" },
"username": { "type" : "string" },
"phoneNumber": {"type": "number" }
}
}
}
}
}
now i dont no how to execute these files in program.
if any one have any examples please provide me some links.
thank you.
JSON Schemas are not "executable" in themselves. They are documents, that describe other documents.
To do something like validation, you need a validator that takes both the data and the schema as input. The JSON Schema main site has a list of software which might be helpful. :)

ElasticSearch for Attribute(Key) value data set

I am using Elasticsearch with Haystacksearch and Django and want to search the follow structure:
{
{
"title": "book1",
"category" : ["Cat_1", "Cat_2"],
"key_values" :
[
{
"key_name" : "key_1",
"value" : "sample_value_1"
},
{
"key_name" : "key_2",
"value" : "sample_value_12"
}
]
},
{
"title": "book2",
"category" : ["Cat_3", "Cat_2"],
"key_values" :
[
{
"key_name" : "key_1",
"value" : "sample_value_1"
},
{
"key_name" : "key_3",
"value" : "sample_value_6"
},
{
"key_name" : "key_4",
"value" : "sample_value_5"
}
]
}
}
Right now I have set up an index model using Haystack with a "text" that put all the data together and runs a full text search! In my opinion this is not the a well established search 'cause I am not using my data set structure and hence this is some kind odd.
As an example if for an object I have a key-value
{
"key_name": "key_1",
"value": "sample_value_1"
}
and for another object I have
{
"key_name": "key_2",
"value": "sample_value_1"
}
and we it gets a query like "Key_1 sample_value_1" comes I get a thoroughly mixed result of objects who have these words in their fields rather than using their structures.
P.S. I am totally new to ElasticSearch and better to say new to the search technologies and challenges. I have searched the web and SO button didn't find anything satisfying. Please let me know if there is something wrong with my thoughts and expectations from these search engines and if there is SO duplicate question! And also if there is a better approach to design a database for this kind of search
Read the es docs on nested mappings and do something like this:
"book_type" : {
"properties" : {
// title, cat mappings
"key_values" : {
"type" : "nested"
"properties": {
"key_name": {
"type": "string", "index": "not_analyzed"
},
"value": {
"type": "string"
}
}
}
}
}
Then query using a nested query
"nested" : {
"path" : "key_values",
"query" : {
"bool" : {
"must" : [
{
"term" : {"key_values.key_name" : "key_1"}
},
{
"match" : {"key_values.value" : "sample_value_1"}
}
]
}
}
}