Finding similar documents with Elasticsearch - lucene

I'm using ElasticSearch to develop service that will store uploaded files or web pages as attachment (file is one field in document). This part works fine as I can search these files using like_text as input. However, the second part of this service should compare the file that is just uploaded with the existing files in order to find duplicates or very similar files, so it doesn't recommend users same files or same web pages. The problem is that I can't get expected results for documents that are the same. Similarity between same files varies, but is never more then 0.4. Even worse, sometimes I get better scores for files which are not the same then for two exactly the same files. The java code give bellow gives me always the set of documents which are in the same order, regardless of the input. It looks like like_text extracted from uploaded file is always the same.
String mapping = copyToStringFromClasspath("/org/prosolo/services/indexing/documents- mapping.json");
byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();
client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse = client.index(indexRequest(indexName).type(indexType)
.source(jsonBuilder()
.startObject()
.field("file", txt)
.field("title",title)
.field("visibility",visibilityType.name().toLowerCase())
.field("ownerId",ownerId)
.field("description",description)
.field("contentType",DocumentType.DOCUMENT.name().toLowerCase())
.field("dateCreated",dateCreated)
.field("url",link)
.field("relatedToType",relatedToType)
.field("relatedToId",relatedToId)
.endObject()))
.actionGet();
client.admin().indices().refresh(refreshRequest()).actionGet();
MoreLikeThisRequestBuilder mltRequestBuilder=new MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS, ESIndexTypes.DOCUMENT, iResponse.getId());
mltRequestBuilder.setField("file");
SearchResponse response = client.moreLikeThis(mltRequestBuilder.request()).actionGet();
SearchHits searchHits= response.getHits();
System.out.println("getTotalHits:"+searchHits.getTotalHits());
Iterator<SearchHit> hitsIter=searchHits.iterator();
while(hitsIter.hasNext()){
SearchHit searchHit=hitsIter.next();
System.out.println("FOUND DOCUMENT:"+searchHit.getId()+" title:"+searchHit.getSource().get("title")+" score:"+searchHit.score());
}
And the query from browser which looks like:
http://localhost:9200/documents/document/m2HZM3hXS1KFHOwvGY1pVQ/_mlt?mlt_fields=file&min_doc_freq=1
Gives me results:
{"took":120,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":4,
"max_score":0.41059873,
"hits":
[{"_index":"documents","_type":"document",
"_id":"gIe6NDEWRXWTMi4kMPRbiQ",
"_score":0.41059873,
"_source" :
{"file":"PCFET0NUWVBFIGh..._skiping_the_file_content_here...",
"title":"Univariate Analysis",
"visibility":"public",
"description":"Univariate Analysis Simple Tools for Description ",
"contentType":"webpage",
"dateCreated":"null",
"url":"http://www.slideshare.net/christineshearer/univariate-analysis"}}
This is exactly the same web page, so I'm expecting the score to be 1.0, not 0.41 as there is not difference between two documents except in _id. The results are even worse with files.
Mapping I was using is:
{
"document":{
"properties":{
"title":{
"type":"string",
"store":true
},
"description":{
"type":"string",
"store":"yes"
},
"contentType":{
"type":"string",
"store":"yes"
},
"dateCreated":{
"store":"yes",
"type":"date"
},
"url":{
"store":"yes",
"type":"string"
},
"visibility": {
"store":"yes",
"type":"string"
},
"ownerId": {
"type": "long",
"store":"yes"
},
"relatedToType": {
"type": "string",
"store":"yes"
},
"relatedToId": {
"type": "long",
"store":"yes"
},
"file":{
"path": "full",
"type":"attachment",
"fields":{
"author": {
"type": "string"
},
"title": {
"store": true,
"type": "string"
},
"keywords": {
"type": "string"
},
"file": {
"store": true,
"term_vector": "with_positions_offsets",
"type": "string"
},
"name": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"date": {
"format": "dateOptionalTime",
"type": "date"
},
"content_type": {
"type": "string"
}
} } } } }
Does anyone have an idea what could be wrong here?
Thanks

Related

Put validation of two array fields in JSON Schema using oneOf

Can I put check on two fields in JSON schema ? Both field are of type array of objects. Conditions:
Either one of them can contain value at a time (i.e. other should be empty).
Both can be empty.
Any leads ?
// The schema
var schema = {
"id": "https://kitoutapi.lrsdedicated.com/v1/json_schemas/login-request#",
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "Login request schema",
"type": "object",
"oneOf": [
{ "categories": {
"maxItems": 0
},
"positionedOffers": {
"minItems": 1
}},
{ "categories": {
"minItems": 1
},
"positionedOffers": {
"maxItems": 0
}}
],
"properties": {
"categories": {
"type": "array"
},
"positionedOffers": {
"type": "array"
}
},
"additionalProperties": false
};
// Test data 1
// This test should return a good result
var data1 = {
"positionedOffers":['hello'],
"categories":[],
}
For your requirement, I think I'd come at this from the other direction. Rather than saying
If one contains a value, the other must be empty, but both may be empty.
I'd say
At least one must be empty.
That leads you to use an anyOf with subschemas checking that each property is an empty array.
{
"id": "https://kitoutapi.lrsdedicated.com/v1/json_schemas/login-request#",
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "Login request schema",
"type": "object",
"anyOf": [
{
"properties": {
"categories": {
"maxItems": 0
}
}
},
{
"properties": {
"positionedOffers": {
"maxItems": 0
}
}
}
],
"properties": {
"categories": {
"type": "array"
},
"positionedOffers": {
"type": "array"
}
},
"additionalProperties": false
}
Bonus Material
In your original post, you omitted the properties keywords under the oneOf. This may have been the cause of the schema's failure to validate. I've added it in the above.
Secondly, draft 4 is quite old at this point. You may be limited by the implementation you're using, but if you can, you should consider using a more recent version of JSON Schema. You can view available implementations and what versions they support on the JSON Schema implementations page.

ajv-cli always says bad data is valid

Running ajv-cli as part of my automated testing scripts to make sure my mock data is up to date.
./node_modules/.bin/ajv -s ./test-data/manifest.schema.json -d ./test-data/fleet.manifest.json
./test-data/fleet.manifest.json valid
But the data isn't valid.
manifest.schema.json:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {
"ManifestHistoryItem": {
"properties": {
"id": {
"default": [
"assetCatalog",
"Roster"
],
"items": {
"type": "string"
},
"type": "array"
},
"name": {
"default": "",
"type": "string"
}
},
"required": [
"id",
"name"
],
"type": "object"
}
}
}
fleet.manifest.json:
{
"namee": "Epic Space Battles"
}
(it's missing the required "id" property, and "name" is misspelled)
Schema is generated from "typescript-json-schema": "^0.54.0" from a typescript model and evaluated via "ajv-cli": "^5.0.0".
Your schema declares definitions, but it doesn't reference them anywhere. You need to add a "$ref": "#/definitions/ManifestHistoryItem" at the root.
{
"definitions": {
"ManifestHistoryItem": { ... }
},
"$ref": "#/definitions/ManifestHistoryItem"
}
Either that or you can just get rid of the definitions wrapper altogether and just have the { ... } part from above.
Effectively what's happening is you've defined an empty schema, which applies no constraints, meaning all instances (data) pass.

In VS Code, how can I get JSON intellisense based on the files in a given directory?

My application includes JSON Files.
One JSON property common to all JSON files in my app is "type".
The value of the type property must be the file name of any handlebars (.hbs) file in my /templates/ directory.
My goal is to get intellisense as I type the value for a given "type" property in a JSON file. The intellisense should include a list of all handlebars file names in the /templates/ directory. Either an automatic solution, or one where I provide a list of values would be great.
Is there a way in the VS Code config or with TypeScript to make that happen?
You can add a JSON schema to you workspace settings.json in .vscode directory.
This is an example that adds intellisense to package.json for ghooks settings:
"json.schemas": [
{
"fileMatch": [
"/package.json"
],
"schema": {
"type": "object",
"properties": {
"config": {
"properties": {
"ghooks": {
"type": "object",
"description": "Git hook configurations",
"properties": {
"applypatch-msg": {
"type": "string"
},
"pre-applypatch": {
"type": "string"
},
"post-applypatch": {
"type": "string"
},
"pre-commit": {
"type": "string"
},
"prepare-commit-msg": {
"type": "string"
},
"commit-msg": {
"type": "string"
},
"post-commit": {
"type": "string"
},
"pre-rebase": {
"type": "string"
},
"post-checkout": {
"type": "string"
},
"post-merge": {
"type": "string"
},
"pre-push": {
"type": "string"
},
"pre-receive": {
"type": "string"
},
"update": {
"type": "string"
},
"post-receive": {
"type": "string"
},
"post-update": {
"type": "string"
},
"pre-auto-gc": {
"type": "string"
},
"post-rewrite": {
"type": "string"
}
}
}
}
}
}
}
}
],
To get intellisense for specific property values you can check out this documentation: https://github.com/json-schema/json-schema/wiki/anyOf,-allOf,-oneOf,-not#oneof
For your usecase it could also be useful to dynamically generate the JSON schema that includes all possible template names and include the schema in your JSON file via $schema property or via configuration:
"json.schemas": [
{
"fileMatch": [
"/json-files/*.json"
],
"url": "./templates.schema.json"
}
],
I'm not aware of any vscode feature to dynamically generate such a schema. You might want to use an approach as I did in this extension: https://github.com/skolmer/vscode-i18n-tag-schema - an extension that uses a node module to dynamically generate a schema file.

Is it possible to inline JSON schemas into a JSON document? [duplicate]

For example a schema for a file system, directory contains a list of files. The schema consists of the specification of file, next a sub type "image" and another one "text".
At the bottom there is the main directory schema. Directory has a property content which is an array of items that should be sub types of file.
Basically what I am looking for is a way to tell the validator to look up the value of a "$ref" from a property in the json object being validated.
Example json:
{
"name":"A directory",
"content":[
{
"fileType":"http://x.y.z/fs-schema.json#definitions/image",
"name":"an-image.png",
"width":1024,
"height":800
}
{
"fileType":"http://x.y.z/fs-schema.json#definitions/text",
"name":"readme.txt",
"lineCount":101
}
{
"fileType":"http://x.y.z/extended-fs-schema-video.json",
"name":"demo.mp4",
"hd":true
}
]
}
The "pseudo" Schema note that "image" and "text" definitions are included in the same schema but they might be defined elsewhere
{
"id": "http://x.y.z/fs-schema.json",
"definitions": {
"file": {
"type": "object",
"properties": {
"name": { "type": "string" },
"fileType": {
"type": "string",
"format": "uri"
}
}
},
"image": {
"allOf": [
{ "$ref": "#definitions/file" },
{
"properties": {
"width": { "type": "integer" },
"height": { "type": "integer"}
}
}
]
},
"text": {
"allOf": [
{ "$ref": "#definitions/file" },
{ "properties": { "lineCount": { "type": "integer"}}}
]
}
},
"type": "object",
"properties": {
"name": { "type": "string"},
"content": {
"type": "array",
"items": {
"allOf": [
{ "$ref": "#definitions/file" },
{ *"$refFromProperty"*: "fileType" } // the magic thing
]
}
}
}
}
The validation parts of JSON Schema alone cannot do this - it represents a fixed structure. What you want requires resolving/referencing schemas at validation-time.
However, you can express this using JSON Hyper-Schema, and a rel="describedby" link:
{
"title": "Directory entry",
"type": "object",
"properties": {
"fileType": {"type": "string", "format": "uri"}
},
"links": [{
"rel": "describedby",
"href": "{+fileType}"
}]
}
So here, it takes the value from "fileType" and uses it to calculate a link with relation "describedby" - which means "the schema at this location also describes the current data".
The problem is that most validators do not take any notice of any links (including "describedby" ones). You need to find a "hyper-validator" that does.
UPDATE: the tv4 library has added this as a feature
I think cloudfeet answer is a valid solution. You could also use the same approach described here.
You would have a file object type which could be "anyOf" all the subtypes you want to define. You would use an enum in order to be able to reference and validate against each of the subtypes.
If the sub-types schemas are in the same Json-Schema file you don't need to reference the uri explicitly with the "$ref". A correct draft4 validator will find the enum value and will try to validate against that "subschema" in the Json-Schema tree.
In draft5 (in progress) a "switch" statement has been proposed, which will allow to express alternatives in a more explicit way.

Specific analyzers for sub-documents in lucene / elasticsearch

After reading the documentation, testing and reading a lot of other questions here on stackoverflow:
We have documents that have titles and description in multiple languages. There are also tags that are translated to the same languages. There might be up to 30-40 different languages in the system, but probably only 3 or 4 translations for a single document.
This is the planned document structure:
{
"luck": {
"id": 10018,
"pub": 0,
"pr": 100002,
"loc": {
"lat": 42.7,
"lon": 84.2
},
"t": [
{
"lang": "en-analyzer",
"title": "Forest",
"desc": "A lot of trees.",
"tags": [
"Wood",
"Nature",
"Green Mouvement"
]
},
{
"lang": "fr-analyzer",
"title": "ForĂȘt",
"desc": "A grand nombre d'arbre.",
"tags": [
"Bois",
"Nature",
"Mouvement Vert"
]
}
],
"dates": [
"2014-01-01T20:00",
"2014-06-06T20:00",
"2014-08-08T20:00"
]
}
}
Possible queries are "arbre" or "wood" or "forest" or "nature" combined with a date and a geo_distance filter, furthermore there will be some facets over the tags array (that obviously include counting).
We can produce any document structure that fits best for elasticsearch (or for lucene). It's crucial that each language is analyzed specifically, so we use "_analyzer" in order to distinguish the languages.
{
"luck": {
"properties": {
"id": {
"type": "long"
},
"pub": {
"type": "long"
},
"pr": {
"type": "long"
},
"loc": {
"type": "geo_point"
},
"t": {
"_analyzer": {
"path": "t.lang"
},
"properties": {
"lang": {
"type": "string"
},
"properties": {
"title": {
"type": "string"
},
"desc": {
"type": "string"
},
"tags": {
"type": "string"
}
}
}
}
}
}
A) Apparently, this idea does not work: after PUTting the mapping, we retrieve the same mapping ("GET") and it seems to ignore the specific analyzers (A test with a top-level "_analyzer" worked fine). Does "_analyzer" work for sub-documents and if yes how to should we refer to it? We also tested declaring the sub-document as "object" or "nested". How is multi-language document indexing supposed to work.
B) One possibility would be to put each language in its own document: In that case how do we manage the id? Finally both documents should refer to the same id. For example if the user searches for "nature" (and we don't know if the user intends to find "nature" in English or French), this document would appear twice in the result set, and the counting and paging would be very wrong (also facet counting).
Any ideas?