Numbers returned by MarkLogic do not use scientific notation - sparql

Write the following RDF data into MarkLogic:
<http://John> <http://have> 2.1123519E7 .
Then execute the following query:
SELECT *
WHERE { ?s ?p ?o. }
The query result is:
?s
?p
?o
<http://John>
<http://have>
"21123519"^^xsd:double
Expected query result:
The expected query result is:
?s
?p
?o
<http://John>
<http://have>
"2.1123519E7"^^xsd:double
Alternatives
The written data is expressed in scientific notation, but the data returned in query result is not in scientific notation. Will there be some inconsistencies?
Executing the SparQL query with Java Client API will get the unexpected query result. But executing it in Query Console will get the expected query result.
This query can also return the expected query result in Apache Jena and RDF4j.
Can someone give me an answer or a hint about it?

As Mads points points out: the numbers are the same.
However, this is an interesting one.
MarkLogic does keep and understand the scientific notation. The REST API also handles this for many return types. I tested your query against the /v1/sparql endpoint:
Accept Header: application/sparql-results+xml
...
<result>
<binding name="s">
<uri>http://John</uri>
</binding>
<binding name="p">
<uri>http://have</uri>
</binding>
<binding name="o">
<literal datatype="http://www.w3.org/2001/XMLSchema#double">2.1123519E7</literal>
</binding>
</result>
...
** Accept Header: text/csv **
s,p,o
http://John,http://have,2.1123519E7
Same for HTML ETC..
However, for JSON, things are different:
{
"s": {
"type": "uri",
"value": "http://John"
},
"p": {
"type": "uri",
"value": "http://have"
},
"o": {
"datatype": "http://www.w3.org/2001/XMLSchema#double",
"type": "literal",
"value": "21123519"
}
}
This matches the fact that the scientific notation appears to be lost as double for JSON:
JSON.parse('{ "foo" : 2.1123519E7}')
//return:
{
"foo": 21123519
}
So, it all comes down to how you are requesting your results in your call to MarkLogic. Some response types return what you expect. At least one (JSON) does not. At this point, I suggest opening a ticket under the Java API project: https://github.com/marklogic/java-client-api/issues

Related

Query Wikidata REST API with related identifier

I am attempting to do alignments for a set of known VIAF IDs. I would like to query the Wikidata REST API with a VIAF ID (P214) and get back a set of one or more Wikidata entity IDs (QXXXXX) that correspond to that VIAF entity. I am unable to find any examples of this either in the Wikidata API documentation or otherwise online.
I've noodled around with various permutations of queries using action=wbsearchentities and action=query, all to no avail.
Could anyone kindly point me to set of docs or example code that enumerates the correct query parameters for such a search?
Let's suppose you want to find the item whose VIAF ID is "113230702" (i.e. Douglas Adams).
Solution 1
Use action=query:
https://www.wikidata.org/w/api.php?action=query&format=json&list=search&srsearch=haswbstatement:P214=113230702
URL response:
{"batchcomplete":"","query":{"searchinfo":{"totalhits":1},"search":[{"ns":0,"title":"Q42","pageid":138,"size":319024,"wordcount":1204,"snippet":"Douglas Adams\nDouglas Adams\n\u0414\u0443\u0433\u043b\u0430\u0441 \u0410\u0434\u0430\u043c\u0441\nDouglas Adams\nDouglas Adams\nDouglas Adams\nDouglas Adams\nDouglas Adams\nDouglas Adams\nDouglas Adams\nDouglas Adams","timestamp":"2021-08-13T22:04:39Z"}]}}
Solution 2
Use Wikidata Query Service:
https://query.wikidata.org/bigdata/namespace/wdq/sparql?format=json&query=SELECT%20DISTINCT%20%3Fx%0AWHERE%20{%0A%20%20%3Fx%20wdt%3AP214%20%22113230702%22%0A}
This last URL comes from the following SPARQL query:
SELECT DISTINCT ?x
WHERE {
?x wdt:P214 "113230702"
}
URL response:
{
"head" : {
"vars" : [ "x" ]
},
"results" : {
"bindings" : [ {
"x" : {
"type" : "uri",
"value" : "http://www.wikidata.org/entity/Q42"
}
} ]
}
}

SPARQL filter xsd:Date fails

I have the following SPARQL query
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT DISTINCT ?d
WHERE { GRAPH <https:/my/triples>{
?s <http://my/ontology/v1.0#hasTimestamp> ?d .
FILTER(?d > "1945-01-01"^^xsd:date)
}
}
When i execute the above, it fails to filter correctly the results by date. Actually it gives me no results at all, and no errors.
If i remove the date filter i get this response:
"bindings": [
{
"s": { "type": "uri" , "value": "seo:S2A_MSI_2019_11_28_09_33_31_T33SXB_t_dogliotti" } ,
"d": { "datatype": "xsd:date" , "type": "typed-literal" , "value": "2019-11-28" }
}
What could be wrong?
I tried all of the suggestions in the comments but i couldn't make it work properly.
I stumbled upon a post from 2011 (!!) that partially gave me the solution.
The post is here and it states that the below solution (aka. "stringifying" the date works only for equality
FILTER(str(?d) >= "2018") .
and casting to xsd:dateTime works for comparisons.
In my case the first solution worked for comparisons (including equality), while casting fails miserably.

how to programmatically get all available information from a Wikidata entity?

I'm really new to wikidata. I just figured that wikidata uses a lot of reification.
Suppose we want to get all information available for Obama. If we are going to do it from DBpedia, we would just use a simple query:
select * where {<http://dbpedia.org/resource/Barack_Obama> ?p ?o .} This would return all the properties and values with Obama being the subject. Essentially the result is the same as this page: http://dbpedia.org/page/Barack_Obama while the query result is in a format I needed.
I'm wondering how to do the same thing with Wikidata. This is the Wikidata page for Obama: https://www.wikidata.org/wiki/Q76. Let's say I want all the statements on this page. But almost all the statements on this page are reified in that they have ranks and qualifiers, etc. For example, for the "educated at" part, it not only has the school, but also the "start time" and "end time" and all schools are ranked as normal since Obama is not in these schools anymore.
I could just get all the schools by getting the truthy statements (using https://query.wikidata.org):
SELECT ?school ?schoolLabel WHERE {
wd:Q76 wdt:P69 ?school .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
}
The above query will simple return all the schools.
If I want to get the start time and end time of the school, I need to do this:
SELECT ?school ?schoolLabel ?start ?end WHERE {
wd:Q76 p:P69 ?school_statement .
?school_statement ps:P69 ?school .
?school_statement pq:P580 ?start .
?school_statement pq:P582 ?end .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
}
But the thing is, without looking at the actual page, how would I know that the ?school_statement has pq:P580 and pq:P582, namely the "start time" and "end time"? And it all comes down to a question that how do I get all the information (including reification) from https://www.wikidata.org/wiki/Q76?
Ultimately, I would expect a table like this:
||predicate||object||objectLabel||qualifier1||qualifier1Value||qualifier2||qualifier2Value||...
you should probably go for the Wikidata data API (more specifically the wbgetentities module) instead of the SPARQL endpoint:
In your case:
https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&ids=Q76
You should find all the qualifiers data you where looking for: example with entities.Q76.claims.P69.1
{ mainsnak:
{ snaktype: 'value',
property: 'P69',
datavalue:
{ value: { 'entity-type': 'item', 'numeric-id': 3273124, id: 'Q3273124' },
type: 'wikibase-entityid' },
datatype: 'wikibase-item' },
type: 'statement',
qualifiers:
{ P580:
[ { snaktype: 'value',
property: 'P580',
hash: 'a1db249baf916bb22da7fa5666d426954435256c',
datavalue:
{ value:
{ time: '+1971-01-01T00:00:00Z',
timezone: 0,
before: 0,
after: 0,
precision: 9,
calendarmodel: 'http://www.wikidata.org/entity/Q1985727' },
type: 'time' },
datatype: 'time' } ],
P582:
[ { snaktype: 'value',
property: 'P582',
hash: 'a065bff95f5cb3026ebad306b3df7587c8daa2e9',
datavalue:
{ value:
{ time: '+1979-01-01T00:00:00Z',
timezone: 0,
before: 0,
after: 0,
precision: 9,
calendarmodel: 'http://www.wikidata.org/entity/Q1985727' },
type: 'time' },
datatype: 'time' } ] },
'qualifiers-order': [ 'P580', 'P582' ],
id: 'q76$464382F6-E090-409E-B7B9-CB913F1C2166',
rank: 'normal' }
Then you might be interesting in ways to extract readable results from those results

Elasticsearch - How to normalize score when combining regular query and function_score?

Idealy what I am trying to achieve is to assign weights to queries such that query1 constitutes 30% of the final score and query2 consitutes other 70%, so to achieve the maximum score a document has to have highest possible score on query1 and query2. My study of the documentation did not yield any hints as to how to achieve this so lets try to solve a simpler problem.
Consider a query in following form:
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {"match_all": {}},
"script_score": {
"script": "<some_script>",
}
}
},
{
"match": {
"message": "this is a test"
}
}
]
}
}
}
The script can return an arbitrary number (think-> it can return something like 12392002).
How do I make sure that the result from the script will not dominate the overall score?
Is there any way to normalize it? For example instead of script score return the ratio to max_script_score (achieved by document with highest score)?
Recently i am working on a problem like this too. I couldn't find any formal documentation about this issue but when i investigate the results with "explain api", it seems like "queryNorm" is not applied to the score directly coming from "functions" field. This means that you can not directly normalize script value.
However, i think i find a little bit tricky solution to this problem. If you combine this function field with a query like you do (match_all query) and give a boost to that query, normalization is working on this query that is, multiplication of this two scores - from normalized query and from script- will give us a total normalization. For a better explanation query will be like:
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {"match_all": {"boost":1}},
"functions": [ {
"script_score": {
"script": "<some_script>",
}}],
"score_mode": "sum",
"boost_mode": "multiply"
}
},
{
"match": {
"message": "this is a test"
}
}
]
}
}
}
This answer is not a proper solution to your problem but i think you can play with this query to obtain required result. My suggestion to you is use explain api, try to understand what it is returned, examine the parameters affecting final score and play with script and boost values to get optimized solution.
Btw, "rescore query" may help a lot to obtain that %30-%70 ratio on the final score:
Official documentation
As far as I searched, there is no way to get a normalized score out of elastic. You will have to hack it by making two queries. First will be a pilot query (preferably with size 1, but rest all attributes same) and it will fetch you the max_score. Then you can shoot your actual query and use functional_score to normalize the score. Pass the max_score you got as part of pilot query in params to function_score and use it to normalize every score. Refer: This article snippet

elasticsearch splits by space in facets

I am trying to do a simple facet request over a field containing more than one word (Eg: 'Name1 Name2', sometimes with dots and commas inside) but what I get is...
"terms" : [{
"term" : "Name1",
"count" : 15
},
{
"term" : "Name2",
"count" : 15
}]
so my field value is split by spaces and then runs the facet request...
Query example:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": [
"dataset"
],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"field": [
"speciesName"
],
"size": 50000
}
}
}
}'
Your field shouldn't be analyzed, or at least not tokenized. You need to update your mapping and then reindex if you want to index the field without tokenizing it.
First of all, javanna provided a very good answer from a practical perspective. However, for the sake of completeness, I want to mention that in some cases there is a way to do it without reindexing the data.
If the speciesName field is stored and your queries produce relatively small number of results, you can use script_field to retrieve stored field values:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": ["dataset"],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"script_field": "_fields['\''speciesName'\''].value",
"size": 50000
}
}
}
}
'
As a result of this query, elasticsearch will retrieve the speciesName field for every record in your result set and it will construct facets from these values. Needless to say, if your result set contains millions of records, performance of this query might be sluggish.
Similarly, if the field is not stored, but record source is stored, you can use script_field facet to retrieve field values from the source:
......
"script_field": "_source['\''speciesName'\'']",
......
Again, source for each record in the result list will be retrieved and parsed, so you might need some patience to run this query on a large set of records.