How can I handle duplicate data in Elasticsearch? - lucene

I have used parent & child mapping to normalize data but as far as I understand there is no way to get any fields from _parent document.
Here is the mapping of my index:
{
"mappings": {
"building": {
"properties": {
"name": {
"type": "string"
}
}
},
"flat": {
"_parent": {
"type": "building"
},
"properties": {
"name": {
"type": "string"
}
}
},
"room": {
"_parent": {
"type": "flat"
},
"properties": {
"name": {
"type": "string"
},
"floor": {
"type": "long"
}
}
}
}
}
Now, I'm trying to find the best way of storing flat_name and building_name in room type. I won't query these fields but I should be able to get them when I query other fields like floor.
There will be millions of rooms and I don't have much memory so I suspect that these duplicate values may cause out of memory. For now, flat_name and building_name fields are has "index": "no" property and I turned on compression for _source field.
Do you have any efficient suggestion for avoiding duplicate values like querying multiple queries or hacky way to get fields from _parent document or denormalized data is the only way to handle this kindle of problem?

Related

Put validation of two array fields in JSON Schema using oneOf

Can I put check on two fields in JSON schema ? Both field are of type array of objects. Conditions:
Either one of them can contain value at a time (i.e. other should be empty).
Both can be empty.
Any leads ?
// The schema
var schema = {
"id": "https://kitoutapi.lrsdedicated.com/v1/json_schemas/login-request#",
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "Login request schema",
"type": "object",
"oneOf": [
{ "categories": {
"maxItems": 0
},
"positionedOffers": {
"minItems": 1
}},
{ "categories": {
"minItems": 1
},
"positionedOffers": {
"maxItems": 0
}}
],
"properties": {
"categories": {
"type": "array"
},
"positionedOffers": {
"type": "array"
}
},
"additionalProperties": false
};
// Test data 1
// This test should return a good result
var data1 = {
"positionedOffers":['hello'],
"categories":[],
}
For your requirement, I think I'd come at this from the other direction. Rather than saying
If one contains a value, the other must be empty, but both may be empty.
I'd say
At least one must be empty.
That leads you to use an anyOf with subschemas checking that each property is an empty array.
{
"id": "https://kitoutapi.lrsdedicated.com/v1/json_schemas/login-request#",
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "Login request schema",
"type": "object",
"anyOf": [
{
"properties": {
"categories": {
"maxItems": 0
}
}
},
{
"properties": {
"positionedOffers": {
"maxItems": 0
}
}
}
],
"properties": {
"categories": {
"type": "array"
},
"positionedOffers": {
"type": "array"
}
},
"additionalProperties": false
}
Bonus Material
In your original post, you omitted the properties keywords under the oneOf. This may have been the cause of the schema's failure to validate. I've added it in the above.
Secondly, draft 4 is quite old at this point. You may be limited by the implementation you're using, but if you can, you should consider using a more recent version of JSON Schema. You can view available implementations and what versions they support on the JSON Schema implementations page.

Is it possible to inline JSON schemas into a JSON document? [duplicate]

For example a schema for a file system, directory contains a list of files. The schema consists of the specification of file, next a sub type "image" and another one "text".
At the bottom there is the main directory schema. Directory has a property content which is an array of items that should be sub types of file.
Basically what I am looking for is a way to tell the validator to look up the value of a "$ref" from a property in the json object being validated.
Example json:
{
"name":"A directory",
"content":[
{
"fileType":"http://x.y.z/fs-schema.json#definitions/image",
"name":"an-image.png",
"width":1024,
"height":800
}
{
"fileType":"http://x.y.z/fs-schema.json#definitions/text",
"name":"readme.txt",
"lineCount":101
}
{
"fileType":"http://x.y.z/extended-fs-schema-video.json",
"name":"demo.mp4",
"hd":true
}
]
}
The "pseudo" Schema note that "image" and "text" definitions are included in the same schema but they might be defined elsewhere
{
"id": "http://x.y.z/fs-schema.json",
"definitions": {
"file": {
"type": "object",
"properties": {
"name": { "type": "string" },
"fileType": {
"type": "string",
"format": "uri"
}
}
},
"image": {
"allOf": [
{ "$ref": "#definitions/file" },
{
"properties": {
"width": { "type": "integer" },
"height": { "type": "integer"}
}
}
]
},
"text": {
"allOf": [
{ "$ref": "#definitions/file" },
{ "properties": { "lineCount": { "type": "integer"}}}
]
}
},
"type": "object",
"properties": {
"name": { "type": "string"},
"content": {
"type": "array",
"items": {
"allOf": [
{ "$ref": "#definitions/file" },
{ *"$refFromProperty"*: "fileType" } // the magic thing
]
}
}
}
}
The validation parts of JSON Schema alone cannot do this - it represents a fixed structure. What you want requires resolving/referencing schemas at validation-time.
However, you can express this using JSON Hyper-Schema, and a rel="describedby" link:
{
"title": "Directory entry",
"type": "object",
"properties": {
"fileType": {"type": "string", "format": "uri"}
},
"links": [{
"rel": "describedby",
"href": "{+fileType}"
}]
}
So here, it takes the value from "fileType" and uses it to calculate a link with relation "describedby" - which means "the schema at this location also describes the current data".
The problem is that most validators do not take any notice of any links (including "describedby" ones). You need to find a "hyper-validator" that does.
UPDATE: the tv4 library has added this as a feature
I think cloudfeet answer is a valid solution. You could also use the same approach described here.
You would have a file object type which could be "anyOf" all the subtypes you want to define. You would use an enum in order to be able to reference and validate against each of the subtypes.
If the sub-types schemas are in the same Json-Schema file you don't need to reference the uri explicitly with the "$ref". A correct draft4 validator will find the enum value and will try to validate against that "subschema" in the Json-Schema tree.
In draft5 (in progress) a "switch" statement has been proposed, which will allow to express alternatives in a more explicit way.

Schema to load JSON to Google BigQuery

Suppose I have the following JSON, which is the result of parsing urls parameters from a log file.
{
"title": "History of Alphabet",
"author": [
{
"name": "Larry"
},
]
}
{
"title": "History of ABC",
}
{
"number_pages": "321",
"year": "1999",
}
{
"title": "History of XYZ",
"author": [
{
"name": "Steve",
"age": "63"
},
{
"nickname": "Bill",
"dob": "1955-03-29"
}
]
}
All the fields in top-level, "title", "author", "number_pages", "year" are optional. And so are the fields in the second level, inside "author", for example.
How should I make a schema for this JSON when loading it to BQ?
A related question:
For example, suppose there is another similar table, but the data is from different date, so it's possible to have different schema. Is it possible to query across these 2 tables?
How should I make a schema for this JSON when loading it to BQ?
The following schema should work. You may want to change some of the types (e.g. maybe you want the dob field to be a TIMESTAMP instead of a STRING), but the general structure should be similar. Since types are NULLABLE by default, all of these fields should handle not being present for a given row.
[
{
"name": "title",
"type": "STRING"
},
{
"name": "author",
"type": "RECORD",
"fields": [
{
"name": "name",
"type": "STRING"
},
{
"name": "age",
"type": "STRING"
},
{
"name": "nickname",
"type": "STRING"
},
{
"name": "dob",
"type": "STRING"
}
]
},
{
"name": "number_pages",
"type": "INTEGER"
},
{
"name": "year",
"type": "INTEGER"
}
]
A related question: For example, suppose there is another similar table, but the data is from different date, so it's possible to have different schema. Is it possible to query across these 2 tables?
It should be possible to union two tables with differing schemas without too much difficulty.
Here's a quick example of how it works over public data (kind of a silly example, since the tables contain zero fields in common, but shows the concept):
SELECT * FROM
(SELECT * FROM publicdata:samples.natality),
(SELECT * FROM publicdata:samples.shakespeare)
LIMIT 100;
Note that you need the SELECT * around each table or the query will complain about the differing schemas.

Can Elasticsearch make suggestions for mapping?

Playing around with Elasticsearch I added a document to my index called "pets", that looks like this:
{
"name" : "Piper",
"type" : "dog"
}
Then I added a second document:
{
"name" : "Max",
"type" : "dog",
"breed": "Scottish Terrier"
}
Now, I understand that the mapping of my "pets" index is initially created based on my first document ( unless i define a mapping at some point ). However, I am curious to know if ES can suggest a mapping based on the existing data ( like MySQL's "Propose table structure" ) or maybe update the mapping automatically.
Yes, ElasticSearch will automatically update the mapping.
Sometimes the language in the ElasticSearch documentation makes it sound like once the mapping is set, it cannot be changed. This is only true for the existing fields. Any additional fields will be automatically assigned a type and added to the mapping.
Remember you can always check the mapping of an index with the get mapping API:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-get-mapping.html
For example, with the example you have above, after your first "pet" document the mapping is:
{
"my_index": {
"mappings": {
"pet": {
"properties": {
"name": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
And after the second "pet" document, your mapping is:
{
"my_index": {
"mappings": {
"pet": {
"properties": {
"breed": {
"type": "string"
},
"name": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
I'm not familiar with MySQL's propose table structure, so I can't comment on that...

Specific analyzers for sub-documents in lucene / elasticsearch

After reading the documentation, testing and reading a lot of other questions here on stackoverflow:
We have documents that have titles and description in multiple languages. There are also tags that are translated to the same languages. There might be up to 30-40 different languages in the system, but probably only 3 or 4 translations for a single document.
This is the planned document structure:
{
"luck": {
"id": 10018,
"pub": 0,
"pr": 100002,
"loc": {
"lat": 42.7,
"lon": 84.2
},
"t": [
{
"lang": "en-analyzer",
"title": "Forest",
"desc": "A lot of trees.",
"tags": [
"Wood",
"Nature",
"Green Mouvement"
]
},
{
"lang": "fr-analyzer",
"title": "ForĂȘt",
"desc": "A grand nombre d'arbre.",
"tags": [
"Bois",
"Nature",
"Mouvement Vert"
]
}
],
"dates": [
"2014-01-01T20:00",
"2014-06-06T20:00",
"2014-08-08T20:00"
]
}
}
Possible queries are "arbre" or "wood" or "forest" or "nature" combined with a date and a geo_distance filter, furthermore there will be some facets over the tags array (that obviously include counting).
We can produce any document structure that fits best for elasticsearch (or for lucene). It's crucial that each language is analyzed specifically, so we use "_analyzer" in order to distinguish the languages.
{
"luck": {
"properties": {
"id": {
"type": "long"
},
"pub": {
"type": "long"
},
"pr": {
"type": "long"
},
"loc": {
"type": "geo_point"
},
"t": {
"_analyzer": {
"path": "t.lang"
},
"properties": {
"lang": {
"type": "string"
},
"properties": {
"title": {
"type": "string"
},
"desc": {
"type": "string"
},
"tags": {
"type": "string"
}
}
}
}
}
}
A) Apparently, this idea does not work: after PUTting the mapping, we retrieve the same mapping ("GET") and it seems to ignore the specific analyzers (A test with a top-level "_analyzer" worked fine). Does "_analyzer" work for sub-documents and if yes how to should we refer to it? We also tested declaring the sub-document as "object" or "nested". How is multi-language document indexing supposed to work.
B) One possibility would be to put each language in its own document: In that case how do we manage the id? Finally both documents should refer to the same id. For example if the user searches for "nature" (and we don't know if the user intends to find "nature" in English or French), this document would appear twice in the result set, and the counting and paging would be very wrong (also facet counting).
Any ideas?