GraphDB Lucene connector: indexing rdfs:label values of a single language - lucene

I'm going to pose a question about indexes in GraphDB Lucene connector.
In the context of a multilingual rdf resource, how is it possible to index the rdfs:label values of a single language (for example english) ?
I tried with this:
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
PREFIX : <http://www.ontotext.com/connectors/lucene#>
INSERT DATA {
inst:lexicalEntryIndex :createConnector '''
{
"types": [
"http://www.w3.org/ns/lemon/ontolex#LexicalEntry"
],
"fields": [
{
"fieldName": "type",
"propertyChain": [
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
"http://www.w3.org/2000/01/rdf-schema#label"
],
"languages": [
"en"
]
}
]
}
''' .
}
but all the languages are indexed.
Thanks in advance,
Andrea

The GraphDB Lucene Connector documentation clearly demonstrates how to index a single language.
Here is a sample snippet how to do it:
PREFIX luc: <http://www.ontotext.com/connectors/lucene#>
PREFIX luc-index: <http://www.ontotext.com/connectors/lucene/instance#>
INSERT DATA {
luc-index:my_index luc:createConnector '''
{
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameLanguage",
"propertyChain": [
"http://www.ontotext.com/example#name",
"lang()"
]
}
], "entityFilter":"?nameLanguage in (\\"en\\")"
}
''' .
}

Related

Deeply nested unevaluatedProperties and their expectations

I have been working on my own validator for JSON schema and FINALLY have most of how unevaluatedProperties are supposed to work,... I think. That's one tricky piece there! However I really just want to confirm one thing. Given the following schema and JSON, what is the expected outcome... I have tried it with a https://www.jsonschemavalidator.net and gotten an answer, but I was hoping I could get a more definitive answer.
The focus is the faz property is in fact being evaluated, but the command to disallow unevaluatedProperties comes from a deeply nested schema.
Thoguhts?
Here is the schema...
{
"type": "object",
"properties": {
"foo": {
"type": "object",
"properties": {
"bar": {
"type": "string"
}
},
"unevaluatedProperties": false
}
},
"anyOf": [
{
"properties": {
"foo": {
"properties": {
"faz": {
"type": "string"
}
}
}
}
}
]
}
Here is the JSON...
{
"foo": {
"bar": "test",
"faz": "test"
}
}
That schema will successfully evaluate against the provided data. The unevaluatedProperties keyword will be aware of properties evaluated in subschemas of adjacent keywords, and is evaluated after all other applicator keywords, so it will see the annotation produced from within the anyOf subschema, also.
Evaluating this keyword is easy if you follow the specification literally -- it uses annotations to decide what to do. You just need to make sure that all keywords either produce annotations correctly or propagate annotations correctly that were produced by other keywords, and then all the information is available to generate the correct result.
The result produced by my implementation is:
{
"annotations" : [
{
"annotation" : [
"faz"
],
"instanceLocation" : "/foo",
"keywordLocation" : "/anyOf/0/properties/foo/properties"
},
{
"annotation" : [
"foo"
],
"instanceLocation" : "",
"keywordLocation" : "/anyOf/0/properties"
},
{
"annotation" : [
"bar"
],
"instanceLocation" : "/foo",
"keywordLocation" : "/properties/foo/properties"
},
{
"annotation" : [],
"instanceLocation" : "/foo",
"keywordLocation" : "/properties/foo/unevaluatedProperties"
},
{
"annotation" : [
"foo"
],
"instanceLocation" : "",
"keywordLocation" : "/properties"
}
],
"valid" : true
}
This is not an answer but a follow up example which I feel is in the same vein. I feel this guides us to the answer.
Here we have a single object being validated. But the unevaluated command resides in two different schemas each a part of a different "adjacent keyword subschemas"(from the core spec http://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.11)
How should this be resolved. If all annotations must be evaluated then in what order do I evaluate? The oneOf first or the anyOf? According the spec an unevaluated command(properties or items) generate annotation results which means that that result would affect any other unevaluated command.
http://json-schema.org/draft/2020-12/json-schema-core.html#unevaluatedProperties
"The annotation result of this keyword is the set of instance property names validated by this keyword's subschema."
This is as far as I am understanding the spec.
According to the two validators I am using this fails.
Schema
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"type": "object",
"properties": {
"foo": {
"type": "string"
}
},
"oneOf": [
{
"properties": {
"faz": {
"type": "string"
}
},
"unevaluatedProperties": true
}
],
"anyOf": [
{
"properties": {
"bar": {
"type": "string"
}
},
"unevaluatedProperties": false
}
]
}
Data
{
"bar": "test",
"faz": "test",
}

Why ElasticSearch is not able to search when special characters are available?

I have an ElasticSearch index with below configuration:
{
"my_ind": {
"settings": {
"index": {
"mapping": {
"total_fields": {
"limit": "10000000"
}
},
"number_of_shards": "3",
"provided_name": "my_ind",
"creation_date": "1539773409246",
"analysis": {
"analyzer": {
"default": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"uuid": "3wC7i-E_Q9mSDjnTN2gxrg",
"version": {
"created": "5061299"
}
}
}
}
}
I want to search below content with plain search:
DL-1234170386456
This contents are available in the below field:
DNumber
This filed has mapping like below:
{
"DNumber": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
I am trying to implement it in JAVA language. I came across the ElasticSearch Analyzers and Tokenizers so I made use of "whitespace" tokenizer.
I am trying to search with below query:
{
"query": {
"multi_match": {
"query": "DL-1234170386456",
"fields": [
"_all"
],
"type": "best_fields",
"operator": "OR",
"analyzer": "default",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"lenient": false,
"zero_terms_query": "NONE",
"boost": 1
}
}
}
What wrong I am doing?
After doing lot of research and Trial & Error, found out the answer!
Some basic but important points:
We need to specify Analyzers and Tokenizers while creating/indexing the index/data.
In specified string i.e. "DL-1234170386456", special character (i.e. "-") is available and ElasticSearch is using by default Standard Analyzer.
Standard Analyzer contains Standard Tokenizer which is based on the Unicode Text Segmentation algorithm.
Actual Problem:
ElasticSearch is separating the String ("DL-1234170386456") into two different parts like "DL" and "1234170386456".
Solution:
We need to specify Whitespace Analyzer which contains Whitespace Tokenizer.
It will split the word whenever space is encountered. So, String ("DL-1234170386456") will kept as it is by ElasticSearch and we are able to find it out.

GraphDB Elasticsearch Connector

is there a working example to map lat long properties from graphdb to geo_point objects on elastic search ?
{
"fieldName": "location",
"propertyChain": [
"http://example.com/coordinates"
],
"objectFields": [
{
"fieldName": "lat",
"propertyChain": [
"http://www.w3.org/2003/01/geo/wgs84_pos#lat"
]
},
{
"fieldName": "lon",
"propertyChain": [
"http://www.w3.org/2003/01/geo/wgs84_pos#long"
]
}
]
}
thanks
The only way to index data as geo_point with the current version of GraphDB and the Elasticsearch connector is to have the latitude and the longitude in a single literal, e.g. with the property http://www.w3.org/2003/01/geo/wgs84_pos#lat_long. The connector would look like this:
PREFIX : <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst: <http://www.ontotext.com/connectors/elasticsearch/instance#>
INSERT DATA {
inst:geopoint :createConnector '''
{
"elasticsearchNode": "localhost:9300",
"types": ["http://geopoint.ontotext.com/Point"],
"fields": [
{
"fieldName": "location",
"propertyChain": [
"http://www.w3.org/2003/01/geo/wgs84_pos#lat_long"
],
"datatype": "native:geo_point"
}
],
}
''' .
}
Note that datatype: "native:geo_point" is important as it tells Elasticsearch what type of data this is.
We are currently looking into possible ways to introduce support for latitude and longitude coming from separate literals.

How to query microsoft academic graph for citation and co-citation?

Reading through:
https://www.microsoft.com/cognitive-services/en-us/Academic-Knowledge-API/documentation/GraphSearchMethod
It is a bit obscure the meaning of "path":
"path": "/paper/AuthorIDs/author" - I don't see authorIds object in the returned results.
# post data query
{
"path": "/paper/AuthorIDs/author",
"paper": {
"type": "Paper",
"NormalizedTitle": "graph engine",
"select": [
"OriginalTitle"
]
},
"author": {
"return": {
"type": "Author",
"Name": "bin shao"
}
}
}
#results
{
"Results": [
[
{
"CellID": 2160459668,
"OriginalTitle": "Trinity: a distributed graph engine on a memory cloud"
},
{
"CellID": 2093502026
}
],
[
{
"CellID": 2171539317,
"OriginalTitle": "A distributed graph engine for web scale RDF data"
},
{
"CellID": 2093502026
}
],
[
{
"CellID": 2411554868,
"OriginalTitle": "A distributed graph engine for web scale RDF data"
},
{
"CellID": 2093502026
}
],
[
{
"CellID": 73304046,
"OriginalTitle": "The Trinity graph engine"
},
{
"CellID": 2093502026
}
]
]
}
Which is the correct path (or data to post) to query for citation and co-citation of an article, and paginate results?
You will find AuthorIDs on the graph schema from Microsoft Academic Search:
Assuming you know the ID of the source paper (2118322263 in the following example), here is the POST part of the request:
{
"path": "/paper/CitationIDs/citation",
"paper": {
"type": "Paper",
"id": [ 2118322263 ],
"select": [
"OriginalTitle"
]
},
"citation": {
"return": {
"type": "Paper"
},
"select": [
"OriginalTitle"
]
}
}
This returns 634 results in one response, while a query to the paper itself shows a citation count of 732. I have no idea why there is a difference, nor how to do pagination.

ElasticSearch - how to give priority to the matching from the same row

I have the following documents in ElasticSearch 0.19.11, using:
{ "title": "dogs species",
"col_names": [ "name", "description", "country_of_origin" ],
"rows": [
{ "row": [ "Boxer", "good dog", "Germany" ] },
{ "row": [ "Irish Setter", "great dog", "Ireland" ] }
]
}
{ "title": "Misc stuff",
"col_names": [ "foo" ],
"rows": [
{ "row": [ "Setter is impotant" ] },
{ "row": [ "Ireland is green" ] }
]
}
The mapping is as follows:
{
"table" : {
"properties" : {
"title" : {"type" : "string"},
"col_names" : {"type" : "string"},
"rows" : {
"properties" : {
"row" : {"type" : "string"}
}
}
}
}
}
Question: I'm now searching for "Ireland Setter" and I need to have a higher score for documents that have search terms in the same row.
Currently the second document gets score of 0.22, while the first one - 0.14.
I want the first document to get a higher score in this case, since it has both "Ireland" and "Setter" in the same row. How can it be done?
With great cooperation from ElasticSearch google-group members, the solution is found.
Here is the link to the discussion: https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/4O9dff2SNhg