Specific analyzers for sub-documents in lucene / elasticsearch - lucene

After reading the documentation, testing and reading a lot of other questions here on stackoverflow:
We have documents that have titles and description in multiple languages. There are also tags that are translated to the same languages. There might be up to 30-40 different languages in the system, but probably only 3 or 4 translations for a single document.
This is the planned document structure:
{
"luck": {
"id": 10018,
"pub": 0,
"pr": 100002,
"loc": {
"lat": 42.7,
"lon": 84.2
},
"t": [
{
"lang": "en-analyzer",
"title": "Forest",
"desc": "A lot of trees.",
"tags": [
"Wood",
"Nature",
"Green Mouvement"
]
},
{
"lang": "fr-analyzer",
"title": "ForĂȘt",
"desc": "A grand nombre d'arbre.",
"tags": [
"Bois",
"Nature",
"Mouvement Vert"
]
}
],
"dates": [
"2014-01-01T20:00",
"2014-06-06T20:00",
"2014-08-08T20:00"
]
}
}
Possible queries are "arbre" or "wood" or "forest" or "nature" combined with a date and a geo_distance filter, furthermore there will be some facets over the tags array (that obviously include counting).
We can produce any document structure that fits best for elasticsearch (or for lucene). It's crucial that each language is analyzed specifically, so we use "_analyzer" in order to distinguish the languages.
{
"luck": {
"properties": {
"id": {
"type": "long"
},
"pub": {
"type": "long"
},
"pr": {
"type": "long"
},
"loc": {
"type": "geo_point"
},
"t": {
"_analyzer": {
"path": "t.lang"
},
"properties": {
"lang": {
"type": "string"
},
"properties": {
"title": {
"type": "string"
},
"desc": {
"type": "string"
},
"tags": {
"type": "string"
}
}
}
}
}
}
A) Apparently, this idea does not work: after PUTting the mapping, we retrieve the same mapping ("GET") and it seems to ignore the specific analyzers (A test with a top-level "_analyzer" worked fine). Does "_analyzer" work for sub-documents and if yes how to should we refer to it? We also tested declaring the sub-document as "object" or "nested". How is multi-language document indexing supposed to work.
B) One possibility would be to put each language in its own document: In that case how do we manage the id? Finally both documents should refer to the same id. For example if the user searches for "nature" (and we don't know if the user intends to find "nature" in English or French), this document would appear twice in the result set, and the counting and paging would be very wrong (also facet counting).
Any ideas?

Related

Is it possible to be agnostic on the properties' names?

Let's say I want to have a schema for characters from a superhero comics. I want the schema to validate json objects like this one:
{
"Name": "Roberta",
"Age": 15,
"Abilities": {
"Super_Strength": {
"Cost": 10,
"Effect": "+5 to Strength"
}
}
}
My idea is to do it like that:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "characters_schema.json",
"title": "Characters",
"description": "One of the characters for my game",
"type": "object",
"properties": {
"Name": {
"type": "string"
},
"Age": {
"type": "integer"
},
"Abilities": {
"description": "what the character can do",
"type": "object"
}
},
"required": ["Name", "Age"]
}
And use a second schema for abilities:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "abilities_schema.json",
"title": "Abilities",
"type": "object",
"properties": {
"Cost": {
"description": "how much mana the ability costs",
"type": "integer"
},
"Effect": {
"type": "string"
}
}
}
But I can't figure how to merge Abilities in Characters. I could easily tweak the schema so that it validates characters formatted like:
{
"Name": "Roberta",
"Age": 15,
"Abilities": [
{
"Name": "Super_Strength"
"Cost": 10,
"Effect": "+5 to Strength"
}
]
}
But as I need the name of the ability to be used as a key I don't know what to do.
You need to use the additionalProperties keyword.
The behavior of this keyword depends on the presence and annotation
results of "properties" and "patternProperties" within the same schema
object. Validation with "additionalProperties" applies only to the
child values of instance names that do not appear in the annotation
results of either "properties" or "patternProperties".
https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.10.3.2.3
In laymans terms, if you don't define properties or patternProperties the schema value of additionalProperties is applied to all values in the object at that instance location.
Often additionalProperties is only given a true or false value, but rememeber, booleans are valid schema values.
If you have constraints on the keys for the object, you may wish to use patternPoperties followed by additionalProperties: false.

How to use anyOf on different properties type?

In the schema below, I need items_list, price and variance as required keys. Condition is price and variance may or may not be null but both cannot be null.
Though I'm able to achieve it, I'm looking forward to if there's any shorter way to do this. Also, I'm not sure where exactly to put required and additionalProperties keys.
Any help is greatly appreciated.
{
"type": "object",
"properties": {
"items_list": {
"type": "array",
"items": {
"type": "string"
}
},
},
"anyOf": [
{
"properties": {
"price": {
"type": "number",
"minimum": 0,
},
"variance": {
"type": [
"number",
"null"
],
"minimum": 0,
},
},
},
{
"properties": {
"price": {
"type": [
"number",
"null"
],
"minimum": 0,
},
"variance": {
"type": "number",
"minimum": 0,
},
},
},
],
# "required": [
# "items_list",
# "price",
# "variance",
# ],
# "additionalProperties": False,
}
To answer the question, "can it be shorter?", the answer is, yes. The general rule of thumb is to never define anything in the boolean logic keywords. Use the boolean logic keywords only to add compound constraints. I use the term "compound constraint" to mean a constraint that is based on more that one value in a schema. In this case, the compound constraint is that price and variance can't both be null.
{
"type": "object",
"properties": {
"items_list": {
"type": "array",
"items": { "type": "string" }
},
"price": { "type": ["number", "null"], "minimum": 0 },
"variance": { "type": ["number", "null" ], "minimum": 0 }
},
"required": ["items_list", "price", "variance"],
"additionalProperties": false,
"allOf": [{ "$ref": "#/definitions/both-price-and-variance-cannot-be-null" }],
"definitions": {
"both-price-and-variance-cannot-be-null": {
"not": {
"properties": {
"price": { "type": "null" },
"variance": { "type": "null" }
},
"required": ["price", "variance"]
}
}
}
}
Not only do you not have to jump through hoops to get additionalProperties working properly, it's also easier to read. It even matches your description of the problem, "price and variance may or may not be null" (properties) but "both cannot be null" (not (compound constraint)). You could make this even shorter by inlining the definition, but I included it to show how expressive this technique can be while still being shorter than the original schema.
Looks like you have this mostly right. That's the right place to put required.
Using additionalProperties: false, you need to also define properties at the top level, additionalProperties cannot "see through" *Of keywords (applicators).
You can add properties: [prop] : true, but define all the properties.
You need to do this because additionalProperties only knows about properties within the same schema object at the same level.

How do I indicate which "oneOf" API response will use?

I have an API where the basic response of one key will have an array of identifiers. A user may pass an extra parameter so the array will turn to an array of objects from an array of strings (for actual details rather than having to make a separate call).
"children": {
"type": "array",
"items": {
"oneOf": [{
"type": "string",
"description": "Identifier of child"
}, {
"type": "object",
"description": "Contains details about the child"
}]
}
},
Is there a way to indicate that the first type comes by a default and the second via a requested param?
It's not entirely clear to me what you are trying to accomplish with the distinction. Really that sounds like documentation; maybe elaborate in the descriptions of each oneOf subschema.
You could add an additional boolean field at the top level (sibling of children) to indicate whether detailed responses are returned and provide a default value for that field. The next step is to couple the value of the boolean to the type of the array items, which I've done using oneOf.
I'm suggesting something along the lines of:
{
"children": {
"type": "array",
"items": {
"oneOf": [
{
"type": "string",
"description": "Identifier of child",
"pattern": "^([A-Z0-9]-?){4}$"
},
{
"type": "object",
"description": "Contains details about the child",
"properties": {
"age": {
"type": "number"
}
}
}
]
}
},
"detailed": {
"type": "boolean",
"description": "If true, children array contains extra details.",
"default": false
},
"oneOf": [
{
"detailed": {
"enum": [
true
]
},
"children": {
"type": "array",
"items": {
"type": "object"
}
}
},
{
"detailed": {
"enum": [
false
]
},
"children": {
"type": "array",
"items": {
"type": "string"
}
}
}
]
}
The second oneOf places a further requirement on the response object that when "detailed": true the type of items of the "children" array must be "object". This refines the first oneOf restriction that describes the schema of objects in the "children" array.

Is it possible to inline JSON schemas into a JSON document? [duplicate]

For example a schema for a file system, directory contains a list of files. The schema consists of the specification of file, next a sub type "image" and another one "text".
At the bottom there is the main directory schema. Directory has a property content which is an array of items that should be sub types of file.
Basically what I am looking for is a way to tell the validator to look up the value of a "$ref" from a property in the json object being validated.
Example json:
{
"name":"A directory",
"content":[
{
"fileType":"http://x.y.z/fs-schema.json#definitions/image",
"name":"an-image.png",
"width":1024,
"height":800
}
{
"fileType":"http://x.y.z/fs-schema.json#definitions/text",
"name":"readme.txt",
"lineCount":101
}
{
"fileType":"http://x.y.z/extended-fs-schema-video.json",
"name":"demo.mp4",
"hd":true
}
]
}
The "pseudo" Schema note that "image" and "text" definitions are included in the same schema but they might be defined elsewhere
{
"id": "http://x.y.z/fs-schema.json",
"definitions": {
"file": {
"type": "object",
"properties": {
"name": { "type": "string" },
"fileType": {
"type": "string",
"format": "uri"
}
}
},
"image": {
"allOf": [
{ "$ref": "#definitions/file" },
{
"properties": {
"width": { "type": "integer" },
"height": { "type": "integer"}
}
}
]
},
"text": {
"allOf": [
{ "$ref": "#definitions/file" },
{ "properties": { "lineCount": { "type": "integer"}}}
]
}
},
"type": "object",
"properties": {
"name": { "type": "string"},
"content": {
"type": "array",
"items": {
"allOf": [
{ "$ref": "#definitions/file" },
{ *"$refFromProperty"*: "fileType" } // the magic thing
]
}
}
}
}
The validation parts of JSON Schema alone cannot do this - it represents a fixed structure. What you want requires resolving/referencing schemas at validation-time.
However, you can express this using JSON Hyper-Schema, and a rel="describedby" link:
{
"title": "Directory entry",
"type": "object",
"properties": {
"fileType": {"type": "string", "format": "uri"}
},
"links": [{
"rel": "describedby",
"href": "{+fileType}"
}]
}
So here, it takes the value from "fileType" and uses it to calculate a link with relation "describedby" - which means "the schema at this location also describes the current data".
The problem is that most validators do not take any notice of any links (including "describedby" ones). You need to find a "hyper-validator" that does.
UPDATE: the tv4 library has added this as a feature
I think cloudfeet answer is a valid solution. You could also use the same approach described here.
You would have a file object type which could be "anyOf" all the subtypes you want to define. You would use an enum in order to be able to reference and validate against each of the subtypes.
If the sub-types schemas are in the same Json-Schema file you don't need to reference the uri explicitly with the "$ref". A correct draft4 validator will find the enum value and will try to validate against that "subschema" in the Json-Schema tree.
In draft5 (in progress) a "switch" statement has been proposed, which will allow to express alternatives in a more explicit way.

Finding similar documents with Elasticsearch

I'm using ElasticSearch to develop service that will store uploaded files or web pages as attachment (file is one field in document). This part works fine as I can search these files using like_text as input. However, the second part of this service should compare the file that is just uploaded with the existing files in order to find duplicates or very similar files, so it doesn't recommend users same files or same web pages. The problem is that I can't get expected results for documents that are the same. Similarity between same files varies, but is never more then 0.4. Even worse, sometimes I get better scores for files which are not the same then for two exactly the same files. The java code give bellow gives me always the set of documents which are in the same order, regardless of the input. It looks like like_text extracted from uploaded file is always the same.
String mapping = copyToStringFromClasspath("/org/prosolo/services/indexing/documents- mapping.json");
byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();
client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse = client.index(indexRequest(indexName).type(indexType)
.source(jsonBuilder()
.startObject()
.field("file", txt)
.field("title",title)
.field("visibility",visibilityType.name().toLowerCase())
.field("ownerId",ownerId)
.field("description",description)
.field("contentType",DocumentType.DOCUMENT.name().toLowerCase())
.field("dateCreated",dateCreated)
.field("url",link)
.field("relatedToType",relatedToType)
.field("relatedToId",relatedToId)
.endObject()))
.actionGet();
client.admin().indices().refresh(refreshRequest()).actionGet();
MoreLikeThisRequestBuilder mltRequestBuilder=new MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS, ESIndexTypes.DOCUMENT, iResponse.getId());
mltRequestBuilder.setField("file");
SearchResponse response = client.moreLikeThis(mltRequestBuilder.request()).actionGet();
SearchHits searchHits= response.getHits();
System.out.println("getTotalHits:"+searchHits.getTotalHits());
Iterator<SearchHit> hitsIter=searchHits.iterator();
while(hitsIter.hasNext()){
SearchHit searchHit=hitsIter.next();
System.out.println("FOUND DOCUMENT:"+searchHit.getId()+" title:"+searchHit.getSource().get("title")+" score:"+searchHit.score());
}
And the query from browser which looks like:
http://localhost:9200/documents/document/m2HZM3hXS1KFHOwvGY1pVQ/_mlt?mlt_fields=file&min_doc_freq=1
Gives me results:
{"took":120,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":4,
"max_score":0.41059873,
"hits":
[{"_index":"documents","_type":"document",
"_id":"gIe6NDEWRXWTMi4kMPRbiQ",
"_score":0.41059873,
"_source" :
{"file":"PCFET0NUWVBFIGh..._skiping_the_file_content_here...",
"title":"Univariate Analysis",
"visibility":"public",
"description":"Univariate Analysis Simple Tools for Description ",
"contentType":"webpage",
"dateCreated":"null",
"url":"http://www.slideshare.net/christineshearer/univariate-analysis"}}
This is exactly the same web page, so I'm expecting the score to be 1.0, not 0.41 as there is not difference between two documents except in _id. The results are even worse with files.
Mapping I was using is:
{
"document":{
"properties":{
"title":{
"type":"string",
"store":true
},
"description":{
"type":"string",
"store":"yes"
},
"contentType":{
"type":"string",
"store":"yes"
},
"dateCreated":{
"store":"yes",
"type":"date"
},
"url":{
"store":"yes",
"type":"string"
},
"visibility": {
"store":"yes",
"type":"string"
},
"ownerId": {
"type": "long",
"store":"yes"
},
"relatedToType": {
"type": "string",
"store":"yes"
},
"relatedToId": {
"type": "long",
"store":"yes"
},
"file":{
"path": "full",
"type":"attachment",
"fields":{
"author": {
"type": "string"
},
"title": {
"store": true,
"type": "string"
},
"keywords": {
"type": "string"
},
"file": {
"store": true,
"term_vector": "with_positions_offsets",
"type": "string"
},
"name": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"date": {
"format": "dateOptionalTime",
"type": "date"
},
"content_type": {
"type": "string"
}
} } } } }
Does anyone have an idea what could be wrong here?
Thanks