Change/Update Field name in the NiFi Schema Text property Across various parallel flows - schema

I have few identical parallel flows(as shown in screenshot). I have convertRecord in each of the identical flows and in the Record Reader I have used "Schema Text Field Property" as access strategy and specified the "Schema text". For Example:
{
"type": "record",
"name": "AVLRecord0",
"fields" : [
{"name": "TimeOfDay", "type": "string", "logicalType":"timestamp-millis"},
{"name":"Field1", "type": "double"},
{"name":"Field2", "type": "double"},
{"name":"Field3", "type": "double"},
{"name": "Filename", "type": "string"}
]
}
Lets say the above schema I have used across various parallel flows ConvertRecord, and now I want to update one field name from Field to Field_Name so is there any way I can do it in one go across all the convert record Schema Text?
If I want to change/update one of the Field in the schema Text do I have to change/Update the field name in each processor manually? Or there is a global way that will change the field name across all the parallel flow I have?
Is there Any way that I can update the Schema Text across various processors In one go?
Any help is much appreciated! Thanks

As you are using Schema Text Field Property so you need to change in all ConvertRecord processor manually.
Try with this approach:
In ConvertRecord processor use Schema Access Strategy as
Use Schema Name Property
Then set up AvroSchemaRegistry and define your schema by adding new property
I have added sch as schema.name and defined the avro schema.
After GetFile Processor use UpdateAttribute processor and add schema.name attribute(for ex: with value sch) to the flowfile.
Now in reader controller service use the Schema Access strategy as Use Schema Name Property and Schema Registry asAvroSchemaRegistry` that has already setup.
By following this way we are not defining schema on all ConvertRecord processors instead we are referring to same schema that defined in AvroSchemaRegistry in case if you want to change one field name it is easy to go into Registry and change the value.
Flow:
1.GetFile
2.UpdateAttribute //add schema.name attribute
3.ConvertRecord //define/use AvroSchemaRegistry and access strategy as schemaname property
..other processors
Refer to this link for more details regards to defining/using AvroSchemaRegistry.

Related

Is there a way to add a default to a json schema array

I just want to understand if there is a way to add a default set of values to an array. (I don't think there is.)
So ideally I would like something like how you might imagine the following working. i.e. the fileTypes element defaults to an array of ["jpg", "png"]
"fileTypes": {
"description": "The accepted file types.",
"type": "array",
"minItems": 1,
"items": {
"type": "string",
"enum": ["jpg", "png", "pdf"]
},
"default": ["jpg", "png"]
},
Of course, all that being said... the above actually does seem to be validate as json schema however for example in VS code this default value does not populate like other defaults (like for strings) populate when creating documents.
It appears to be valid based on the spec.
9.2. "default"
There are no restrictions placed on the value of this keyword. When multiple occurrences of this keyword are applicable to a single sub-instance, implementations SHOULD remove duplicates.
This keyword can be used to supply a default JSON value associated with a particular schema. It is RECOMMENDED that a default value be valid against the associated schema.
See https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.9.2
It's up to the tooling to take advantage of that keyword in the JSON Schema and sounds like VS code is not.

value of key A equals to value of Key B in JSONschema

For {keyA:valueA},{KeyB:valueB} Is it possible to define in the schema, valueB must equal to valueA. In other words, copying down ValueA to ValueB?
I understand it causes duplication. But two different keys must be used to meet different standards.
For example, I want to use name as sample name in the schema below.
Schema
{
"$id": "sampleSchema",
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {
"type": "string"
},
"sample name":{
"type":"string"
},
}
}
The data will be like:
{
"name":"example1",
"sample name":"example1"
}
JSON Schema does not support operations like this.
We call this "data consistency validation" because it tests that data in one place is consistent with how it's defined in another location.
Supporting these types of operations would be very difficult. It would probably require a general purpose programming language to support most of the cases that people would like to see.
For more information, see Scope of JSON Schema Validation.
As an alternative, some validators allow you to implement custom keywords, or implement events or hooks when an instance is being validated against a schema with a particular ID. You can use this to implement the functionality you're looking for.

Why is the avro default value not used ? (with avro-python)

I'm serializing some data with Avro (using the python library), and I have a hard time figuring how to make the "default" value work.
I have this schema:
{
"type": "record",
"fields":[
{"name": "amount", "type": "long"},
{"name": "currency", "type": "string", "default": "EUR"}
],
"name": "Monetary",
}
So as I understood, I could pass an amount and no currency, and the currency field would take the "EUR" value. However, if I don't pass a "currency" field when writing, I get the error avro.io.AvroTypeException: The datum { ... } is not an example of the schema xxx...
If I replace the currency field's type as an union ["string", "null"], then the data is serialized, but currency is null.
So it seems the "default" value is not taken into account at all.
What am I missing ? Are default value applicable for primitive types ?
Thanks in advance
Here is the relevant cite from avro specification
default: A default value for this field, used when reading instances that lack this field (optional)
The 'default value' field is used when you try to read an instance written with one schema and convert it to an instance written with another schema. If the field does not exist at the first schema (thus the instance lacks this field), the instance you get will take the default value of the second schema.
That't it!
The 'default value' is not used when you read/write instance using the same schema.
So, for your example, when you set the currency field a default value, if you try to read an instance which was written with older schema which did not contain currency field, the instance you get will contain the default value you've defined at your schema.
Worth to mention, when you use union, the default value refers only to the first type of the union.

choosing between different objects in JSON-schema

I'm creating a schema for receipts and want to have a master schema for the core concepts with a variety of different detail objects for specialized receipt types (e.g. itemized hotel receipts, etc.) My current implementation is leveraging the oneOf mechanism in JSON-schema
{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "Receipt",
"type": "object",
"properties": {
...
"amount": { "type": "number" },
"detail": {
"type": "object",
"oneOf": [
{ "$ref": "general-detail.schema.json" },
{ "$ref": "hotel-detail.schema.json" },
...
]
}
}
}
The problem with this approach is that when I validate (using tv4), it appears that all of the schemas specified in oneOf are being checked, and are in fact, returning errors. I can minimize this effect by getting rid of the detail property, moving oneOf to the schema-level (e.g. outside of properties) and then creating root property names in each of the sub-schemas. However, even in that case, I get a "Missing required property: generalDetail" in the event that there's an error when I'm validating a hotel receipt type.
So 2 questions:
is it even possible to use a generic detail property like I'm currently doing and not have the validator completely validate each sub-schema in the oneOf structure (e.g. am I using oneOf wrongly)?
if it is not possible, I would be more than fine simply having a set of 'typed' detail properties (like 'generalDetail', 'hotelDetail', etc.) - but is there a way to specify that they are a group and that only one of them should exist in the document being validated?
TIA
It is usually better using anyOf - it is very rarely when you need oneOf. The latter will alway validate all schemas, the former will most likely exit at the first that passes.
You may look at some other validators. tv4 has many deviations from the standard and also is very slow. https://github.com/ebdrup/json-schema-benchmark
All of the schemas in oneOf need to be validated in order for the validator to ensure that only one of the schemas pass. If none pass or more than one pass, the validator needs to tell you the validation results of each schema in order for you to determine how to fix the error.
So, just because the validator is telling you why each of the schemas are failing doesn't mean that it expects all of those schemas to pass.

not_indexed field is stored in index

I'm trying to optimize my elasticsearch scheme.
I have a field which is a URL - I do not want to be able to query or filter it, just retreive it.
My understanding is that a field that is defined as "index":"no" is not indexed, but is still stored in the index.
(see slide 5 in http://www.slideshare.net/nitin_stephens/lucene-basics)
This should match to Lucene UnIndexed, right?
This confuses me, is there a way to store some fields, without them taking more storage than simply their content, and without encumbering the index for the other fields?
What am I missing?
I'm new to posting on stack exchange but believe I can help a bit!
There are a few considerations here:
Analyzing
As you don't want to do extra work you should set "index": "no". This will mean the field will not be run through any tokenizers and filters.
Furthermore it will not be searchable when directing a query at the specific field: (no hits)
"query": {
"term": {
"url": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
}
}
*here "url" is the field name.
However the field will still be searchable in the _all field: (might have a hit)
"query": {
"term": {
"_all": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
}
}
_all field
By default every field gets put in the _all field. Set "include_in_all": "false" to stop that. This might not be an issue with you as you are unlikely to search against the _all field with a URL by mistake.
I was working with a schema where countries were stored as 2 letter codes, e.g.: "NO" means Norway, and it is possible someone might do a search against the all field with "NO", so I make sure to set "include_in_all": "false".
Note: Any query where you don't specify a field explicitly will be executed against the _all field.
Storing
By default, elasticsearch will store your entire document (unanalyzed, as you sent it) and this will be returned to you in a hit's _source field. If you turn this off (if your elasticsearch db is getting huge perhaps?) then you need to explicitly set "store": "yes" to store fields individually. (One thing to notice is that store takes yes or no and not true or false - it tripped me up!)
Note: if you do this you will need to request the fields you want returned to you explicitly. e.g.:
curl -XGET http://path/index_name/type_name/id?fields=url,another_field
finally...
I would leave elasticsearch to store your whole document (as the default) and use the following mapping.
"type_name": {
"properties": {
"url": {
"type": "string",
"index": "no",
"include_in_all": "false"
},
// other fields' mappings
}
}
Source: elasticsearch documentation
There are two ways to input data into the index. Indexing and Storing. Indexing a piece of data means that it is tokenized, and placed in the inverted index, and can be searched. Storing data means it is not tokenized, or analyzed or anything, and is not added to the inverted index. It is stored in an entirely separate area, in it's full text form. It can not be searched against, but can be retrieved, in it's original form, by it's document ID.
The typical Lucene query process is to query against indexed data, and get the back Document IDs of matching documents, then to use those document IDs to retrieve the stored data for those documents, and display it to the user.
Data which is indexed, but not stored is searchable, but can not be retrieved in it's original form.
Data which is stored, but not indexed can be retrieved once you have found a hit, but is not searchable.
Data which is indexed and stored can be searched or retrieved.
Data which is neither can not be added to the index at all.
This is covered a bit in the Lucene FAQ.
You are looking for the 'index' => 'not_analyzed' mapping option.
Also, if you use the _source, you do not have to specify the store => false option.