How to specify an analyzer while creating an index in ElasticSearch - indexing

I'd like to specify an analyzer, name it, and use that name in a mapping while creating an index. I'm lost, my ES instance always returns me an error message.
This is, roughly, what I'd like to do:
"settings": {
"mappings": {
"alfedoc": {
"properties": {
"id": { "type": "string" },
"alfefield": { "type": "string", "analyzer": "alfeanalyzer" }
}
}
},
"analysis": {
"analyzer": {
"alfeanalyzer": {
"type": "pattern",
"pattern":"\\s+"
}
}
}
}
But this does not seem to work; the ES instance always returns me an error like
MapperParsingException[mapping [alfedoc]]; nested: MapperParsingException[Analyzer [alfeanalyzer] not found for field [alfefield]];
I tried putting the "analysis" branch of the dictionary at several places (inside the mapping etc.) but to no avail. I guess a working complete example (which I couldn't find up to now) would help me along as well. Probably I'm missing something rather basic.

"analysis" goes in the "settings" block, which goes either before or after the "mappings" block when creating an index.
"settings": {
"analysis": {
"analyzer": {
"alfeanalyzer": {
"type": "pattern",
"pattern": "\\s+"
}
}
}
},
"mappings": {
"alfedoc": { ... }
}
Here's a good complete, example: Example 1

Related

Can JSON Schema if statements handle nested $refs?

I have a JSON Schema using draft 2020-12 and I am trying to use an if-else subschema to check that a particular property exists based on the value of another property. Below is the if statement I am currently using. There are more properties but I have have omitted them for the sake of brevity. They are identical except the type of the property in the then statement is different. They are all wrapped in an allOf array:
{
"AValue": {
"allOf": [
{
"if": {
"$ref": "#/$defs/ValueItem/properties/dt",
"const": "type1"
},
"then": {
"properties": {
"string": { "type": "string" }
},
"required": ["string"]
}
}
]
}
}
The #/$defs/ValueItem/properties/dt being referred to is here:
{
"ValueItem": {
"properties": {
"value": {
"$ref": "#/$defs/AValue"
},
"dt": {
"$ref": "#/$defs/DT"
}
},
"additionalProperties": false
}
}
#/$defs/DT is here:
{
"DT": {
"type": "string",
"enum": [
"type1",
"type2",
"type3",
"type4"
]
}
}
I expected that when dt is encountered in a JSON instance document, the validator will check if the value of dt is type1 and then validate that an additional property called string is also present and is of type string. However, what actually happens is the validator complains that "Property 'string' has not been defined and the schema does not allow additional properties".
I assume that this is because the condition in the if statement evaluates to false so the subschema is never applied. However, I am unsure why that would be as I followed the example here when creating the if-then-else block. The only thing I can think of that is different is the use of $ref which I have in my schema but it is not in the example.
I found this answer which makes me think that it is possible to use $ref in an if statement but is it possible to use a ref that points to another ref or am I thinking about it incorrectly?
I have also tried removing the $ref from the if statement like below but it still doesn't work. Is it because of the $ref in the properties?
{
"AValue": {
"properties": {
"dt": {
"$ref": "#/$defs/DT"
}
},
"required": [
"dt"
],
"allOf": [
{
"if": {
"properties": {
"dt": {
"const": "type1"
}
}
},
"then": {
"properties": {
"string": {
"type": "string"
}
}
}
}
]
}
}
The problem is not cascading the $ref keywords. The const keyword at the if statement is not applied to the target of the $ref, but to the current location in the JSON input data. In this case to "AValue". To check if the property "dt" is of value "type1" at this point, you would need an if like this (simple solution with no $ref):
"if": {
"properties": {
"dt": {
"const": "type1"
}
},
"required": [ "dt" ]
}
Edit: Showing complete JSON Schema and error in JSONBuddy with $ref:

ajv-cli always says bad data is valid

Running ajv-cli as part of my automated testing scripts to make sure my mock data is up to date.
./node_modules/.bin/ajv -s ./test-data/manifest.schema.json -d ./test-data/fleet.manifest.json
./test-data/fleet.manifest.json valid
But the data isn't valid.
manifest.schema.json:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {
"ManifestHistoryItem": {
"properties": {
"id": {
"default": [
"assetCatalog",
"Roster"
],
"items": {
"type": "string"
},
"type": "array"
},
"name": {
"default": "",
"type": "string"
}
},
"required": [
"id",
"name"
],
"type": "object"
}
}
}
fleet.manifest.json:
{
"namee": "Epic Space Battles"
}
(it's missing the required "id" property, and "name" is misspelled)
Schema is generated from "typescript-json-schema": "^0.54.0" from a typescript model and evaluated via "ajv-cli": "^5.0.0".
Your schema declares definitions, but it doesn't reference them anywhere. You need to add a "$ref": "#/definitions/ManifestHistoryItem" at the root.
{
"definitions": {
"ManifestHistoryItem": { ... }
},
"$ref": "#/definitions/ManifestHistoryItem"
}
Either that or you can just get rid of the definitions wrapper altogether and just have the { ... } part from above.
Effectively what's happening is you've defined an empty schema, which applies no constraints, meaning all instances (data) pass.

Deeply nested unevaluatedProperties and their expectations

I have been working on my own validator for JSON schema and FINALLY have most of how unevaluatedProperties are supposed to work,... I think. That's one tricky piece there! However I really just want to confirm one thing. Given the following schema and JSON, what is the expected outcome... I have tried it with a https://www.jsonschemavalidator.net and gotten an answer, but I was hoping I could get a more definitive answer.
The focus is the faz property is in fact being evaluated, but the command to disallow unevaluatedProperties comes from a deeply nested schema.
Thoguhts?
Here is the schema...
{
"type": "object",
"properties": {
"foo": {
"type": "object",
"properties": {
"bar": {
"type": "string"
}
},
"unevaluatedProperties": false
}
},
"anyOf": [
{
"properties": {
"foo": {
"properties": {
"faz": {
"type": "string"
}
}
}
}
}
]
}
Here is the JSON...
{
"foo": {
"bar": "test",
"faz": "test"
}
}
That schema will successfully evaluate against the provided data. The unevaluatedProperties keyword will be aware of properties evaluated in subschemas of adjacent keywords, and is evaluated after all other applicator keywords, so it will see the annotation produced from within the anyOf subschema, also.
Evaluating this keyword is easy if you follow the specification literally -- it uses annotations to decide what to do. You just need to make sure that all keywords either produce annotations correctly or propagate annotations correctly that were produced by other keywords, and then all the information is available to generate the correct result.
The result produced by my implementation is:
{
"annotations" : [
{
"annotation" : [
"faz"
],
"instanceLocation" : "/foo",
"keywordLocation" : "/anyOf/0/properties/foo/properties"
},
{
"annotation" : [
"foo"
],
"instanceLocation" : "",
"keywordLocation" : "/anyOf/0/properties"
},
{
"annotation" : [
"bar"
],
"instanceLocation" : "/foo",
"keywordLocation" : "/properties/foo/properties"
},
{
"annotation" : [],
"instanceLocation" : "/foo",
"keywordLocation" : "/properties/foo/unevaluatedProperties"
},
{
"annotation" : [
"foo"
],
"instanceLocation" : "",
"keywordLocation" : "/properties"
}
],
"valid" : true
}
This is not an answer but a follow up example which I feel is in the same vein. I feel this guides us to the answer.
Here we have a single object being validated. But the unevaluated command resides in two different schemas each a part of a different "adjacent keyword subschemas"(from the core spec http://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.11)
How should this be resolved. If all annotations must be evaluated then in what order do I evaluate? The oneOf first or the anyOf? According the spec an unevaluated command(properties or items) generate annotation results which means that that result would affect any other unevaluated command.
http://json-schema.org/draft/2020-12/json-schema-core.html#unevaluatedProperties
"The annotation result of this keyword is the set of instance property names validated by this keyword's subschema."
This is as far as I am understanding the spec.
According to the two validators I am using this fails.
Schema
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"type": "object",
"properties": {
"foo": {
"type": "string"
}
},
"oneOf": [
{
"properties": {
"faz": {
"type": "string"
}
},
"unevaluatedProperties": true
}
],
"anyOf": [
{
"properties": {
"bar": {
"type": "string"
}
},
"unevaluatedProperties": false
}
]
}
Data
{
"bar": "test",
"faz": "test",
}

Can Elasticsearch make suggestions for mapping?

Playing around with Elasticsearch I added a document to my index called "pets", that looks like this:
{
"name" : "Piper",
"type" : "dog"
}
Then I added a second document:
{
"name" : "Max",
"type" : "dog",
"breed": "Scottish Terrier"
}
Now, I understand that the mapping of my "pets" index is initially created based on my first document ( unless i define a mapping at some point ). However, I am curious to know if ES can suggest a mapping based on the existing data ( like MySQL's "Propose table structure" ) or maybe update the mapping automatically.
Yes, ElasticSearch will automatically update the mapping.
Sometimes the language in the ElasticSearch documentation makes it sound like once the mapping is set, it cannot be changed. This is only true for the existing fields. Any additional fields will be automatically assigned a type and added to the mapping.
Remember you can always check the mapping of an index with the get mapping API:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-get-mapping.html
For example, with the example you have above, after your first "pet" document the mapping is:
{
"my_index": {
"mappings": {
"pet": {
"properties": {
"name": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
And after the second "pet" document, your mapping is:
{
"my_index": {
"mappings": {
"pet": {
"properties": {
"breed": {
"type": "string"
},
"name": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
I'm not familiar with MySQL's propose table structure, so I can't comment on that...

Indexing with edge NGrams for typeahead

I'm trying to get Elasticsearch to index some documents for typeahead suggestions. As far as I can tell, the edge NGram handling in Elasticsearch is provided by Lucene underneath. Unfortunately, the documentation for Lucene in this regard is proving to be very tough for me to make sense of. The best I have come up with is based on https://gist.github.com/988923, but it doesn't seem to work (the index with these settings only returns matches on full words, as though the settings didn't exist):
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"typeahead_analyzer":{
"type":"custom",
"tokenizer":"edgeNGram",
"filter":["typeahead_ngram"]
}
},
"filter":{
"typeahead_ngram":{
"type":"edgeNGram",
"min_gram":1,
"max_gram":8,
"side":"front"
}
}
}
}
}
}
I really don't know at all how analyzers, tokenizers, and filters go together - do I even want a filter? Should I just have a tokenizer? Do I have to reference these settings when I index the documents for them to be used? How can I find out what settings Lucene underneath is using for a given index? How do I debug this? Help :-)
I solved this using edgeNGram. Below are the mappings and analysis that I used to accomplish this.
{
"analysis": {
"analyzer": {
"str_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"str_index_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"substring"
]
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10,
"side": "front"
}
}
}
}
{
"index_name": {
"properties": {
"location": {
"type": "geo_point"
},
"name": {
"type": "string",
"index": "analyzed",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"
}
}
}
}
An important footnote is that I needed to use a match query with the AND operator to query against this properly.
Hope this helps.