Can Elasticsearch make suggestions for mapping? - lucene

Playing around with Elasticsearch I added a document to my index called "pets", that looks like this:
{
"name" : "Piper",
"type" : "dog"
}
Then I added a second document:
{
"name" : "Max",
"type" : "dog",
"breed": "Scottish Terrier"
}
Now, I understand that the mapping of my "pets" index is initially created based on my first document ( unless i define a mapping at some point ). However, I am curious to know if ES can suggest a mapping based on the existing data ( like MySQL's "Propose table structure" ) or maybe update the mapping automatically.

Yes, ElasticSearch will automatically update the mapping.
Sometimes the language in the ElasticSearch documentation makes it sound like once the mapping is set, it cannot be changed. This is only true for the existing fields. Any additional fields will be automatically assigned a type and added to the mapping.
Remember you can always check the mapping of an index with the get mapping API:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-get-mapping.html
For example, with the example you have above, after your first "pet" document the mapping is:
{
"my_index": {
"mappings": {
"pet": {
"properties": {
"name": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
And after the second "pet" document, your mapping is:
{
"my_index": {
"mappings": {
"pet": {
"properties": {
"breed": {
"type": "string"
},
"name": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
I'm not familiar with MySQL's propose table structure, so I can't comment on that...

Related

Deeply nested unevaluatedProperties and their expectations

I have been working on my own validator for JSON schema and FINALLY have most of how unevaluatedProperties are supposed to work,... I think. That's one tricky piece there! However I really just want to confirm one thing. Given the following schema and JSON, what is the expected outcome... I have tried it with a https://www.jsonschemavalidator.net and gotten an answer, but I was hoping I could get a more definitive answer.
The focus is the faz property is in fact being evaluated, but the command to disallow unevaluatedProperties comes from a deeply nested schema.
Thoguhts?
Here is the schema...
{
"type": "object",
"properties": {
"foo": {
"type": "object",
"properties": {
"bar": {
"type": "string"
}
},
"unevaluatedProperties": false
}
},
"anyOf": [
{
"properties": {
"foo": {
"properties": {
"faz": {
"type": "string"
}
}
}
}
}
]
}
Here is the JSON...
{
"foo": {
"bar": "test",
"faz": "test"
}
}
That schema will successfully evaluate against the provided data. The unevaluatedProperties keyword will be aware of properties evaluated in subschemas of adjacent keywords, and is evaluated after all other applicator keywords, so it will see the annotation produced from within the anyOf subschema, also.
Evaluating this keyword is easy if you follow the specification literally -- it uses annotations to decide what to do. You just need to make sure that all keywords either produce annotations correctly or propagate annotations correctly that were produced by other keywords, and then all the information is available to generate the correct result.
The result produced by my implementation is:
{
"annotations" : [
{
"annotation" : [
"faz"
],
"instanceLocation" : "/foo",
"keywordLocation" : "/anyOf/0/properties/foo/properties"
},
{
"annotation" : [
"foo"
],
"instanceLocation" : "",
"keywordLocation" : "/anyOf/0/properties"
},
{
"annotation" : [
"bar"
],
"instanceLocation" : "/foo",
"keywordLocation" : "/properties/foo/properties"
},
{
"annotation" : [],
"instanceLocation" : "/foo",
"keywordLocation" : "/properties/foo/unevaluatedProperties"
},
{
"annotation" : [
"foo"
],
"instanceLocation" : "",
"keywordLocation" : "/properties"
}
],
"valid" : true
}
This is not an answer but a follow up example which I feel is in the same vein. I feel this guides us to the answer.
Here we have a single object being validated. But the unevaluated command resides in two different schemas each a part of a different "adjacent keyword subschemas"(from the core spec http://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.11)
How should this be resolved. If all annotations must be evaluated then in what order do I evaluate? The oneOf first or the anyOf? According the spec an unevaluated command(properties or items) generate annotation results which means that that result would affect any other unevaluated command.
http://json-schema.org/draft/2020-12/json-schema-core.html#unevaluatedProperties
"The annotation result of this keyword is the set of instance property names validated by this keyword's subschema."
This is as far as I am understanding the spec.
According to the two validators I am using this fails.
Schema
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"type": "object",
"properties": {
"foo": {
"type": "string"
}
},
"oneOf": [
{
"properties": {
"faz": {
"type": "string"
}
},
"unevaluatedProperties": true
}
],
"anyOf": [
{
"properties": {
"bar": {
"type": "string"
}
},
"unevaluatedProperties": false
}
]
}
Data
{
"bar": "test",
"faz": "test",
}

Is there a way to set property value format requirements based on a condition of the property name?

I have a simple JSON schema:
{
"properties": {
"name": {
"type": "string"
}
},
"type": "object"
}
It requires that name property is a string. This schema does not restrict additional properties, e.g.
{
name: 'foo',
url: 'http://foo'/
}
The latter is a valid input.
Is there a way to set a property value format requirement based on a conditional property name match?, e.g. any property that contains url string in it must correspond to the following schema:
{
"type": "string",
"format": "url"
}
Therefore, an input:
{
name: 'foo',
location_url: 'not-a-valid-url'
}
would cause an error because location_url does not contain a valid URL?
I'd imagine, a schema for something like this would look like:
{
"properties": {
"name": {
"type": "string"
}
},
"matchProperties": {
"/url/i": {
"type": "string",
"format": "url"
}
}
"type": "object"
}
where matchProperties is a keyword I made up.

ElasticSearch for Attribute(Key) value data set

I am using Elasticsearch with Haystacksearch and Django and want to search the follow structure:
{
{
"title": "book1",
"category" : ["Cat_1", "Cat_2"],
"key_values" :
[
{
"key_name" : "key_1",
"value" : "sample_value_1"
},
{
"key_name" : "key_2",
"value" : "sample_value_12"
}
]
},
{
"title": "book2",
"category" : ["Cat_3", "Cat_2"],
"key_values" :
[
{
"key_name" : "key_1",
"value" : "sample_value_1"
},
{
"key_name" : "key_3",
"value" : "sample_value_6"
},
{
"key_name" : "key_4",
"value" : "sample_value_5"
}
]
}
}
Right now I have set up an index model using Haystack with a "text" that put all the data together and runs a full text search! In my opinion this is not the a well established search 'cause I am not using my data set structure and hence this is some kind odd.
As an example if for an object I have a key-value
{
"key_name": "key_1",
"value": "sample_value_1"
}
and for another object I have
{
"key_name": "key_2",
"value": "sample_value_1"
}
and we it gets a query like "Key_1 sample_value_1" comes I get a thoroughly mixed result of objects who have these words in their fields rather than using their structures.
P.S. I am totally new to ElasticSearch and better to say new to the search technologies and challenges. I have searched the web and SO button didn't find anything satisfying. Please let me know if there is something wrong with my thoughts and expectations from these search engines and if there is SO duplicate question! And also if there is a better approach to design a database for this kind of search
Read the es docs on nested mappings and do something like this:
"book_type" : {
"properties" : {
// title, cat mappings
"key_values" : {
"type" : "nested"
"properties": {
"key_name": {
"type": "string", "index": "not_analyzed"
},
"value": {
"type": "string"
}
}
}
}
}
Then query using a nested query
"nested" : {
"path" : "key_values",
"query" : {
"bool" : {
"must" : [
{
"term" : {"key_values.key_name" : "key_1"}
},
{
"match" : {"key_values.value" : "sample_value_1"}
}
]
}
}
}

How to specify an analyzer while creating an index in ElasticSearch

I'd like to specify an analyzer, name it, and use that name in a mapping while creating an index. I'm lost, my ES instance always returns me an error message.
This is, roughly, what I'd like to do:
"settings": {
"mappings": {
"alfedoc": {
"properties": {
"id": { "type": "string" },
"alfefield": { "type": "string", "analyzer": "alfeanalyzer" }
}
}
},
"analysis": {
"analyzer": {
"alfeanalyzer": {
"type": "pattern",
"pattern":"\\s+"
}
}
}
}
But this does not seem to work; the ES instance always returns me an error like
MapperParsingException[mapping [alfedoc]]; nested: MapperParsingException[Analyzer [alfeanalyzer] not found for field [alfefield]];
I tried putting the "analysis" branch of the dictionary at several places (inside the mapping etc.) but to no avail. I guess a working complete example (which I couldn't find up to now) would help me along as well. Probably I'm missing something rather basic.
"analysis" goes in the "settings" block, which goes either before or after the "mappings" block when creating an index.
"settings": {
"analysis": {
"analyzer": {
"alfeanalyzer": {
"type": "pattern",
"pattern": "\\s+"
}
}
}
},
"mappings": {
"alfedoc": { ... }
}
Here's a good complete, example: Example 1

How can I handle duplicate data in Elasticsearch?

I have used parent & child mapping to normalize data but as far as I understand there is no way to get any fields from _parent document.
Here is the mapping of my index:
{
"mappings": {
"building": {
"properties": {
"name": {
"type": "string"
}
}
},
"flat": {
"_parent": {
"type": "building"
},
"properties": {
"name": {
"type": "string"
}
}
},
"room": {
"_parent": {
"type": "flat"
},
"properties": {
"name": {
"type": "string"
},
"floor": {
"type": "long"
}
}
}
}
}
Now, I'm trying to find the best way of storing flat_name and building_name in room type. I won't query these fields but I should be able to get them when I query other fields like floor.
There will be millions of rooms and I don't have much memory so I suspect that these duplicate values may cause out of memory. For now, flat_name and building_name fields are has "index": "no" property and I turned on compression for _source field.
Do you have any efficient suggestion for avoiding duplicate values like querying multiple queries or hacky way to get fields from _parent document or denormalized data is the only way to handle this kindle of problem?