Retrieve Wikidata ID candidates based on a partial name match

Retrieve Wikidata ID candidates based on a partial name match - sparql

I have some entities in a specific language and I am trying to retrieve the possible IDs from Wikidata that match those names.
For example, I have some German name, let's say "Ministerium für Auswärtige Angelegenheiten" and I can get the top N candidate IDs that correspond to the name like this:
SELECT ?item
WHERE
{
?item rdfs:label "Ministerium für Auswärtige Angelegenheiten"#de
}
LIMIT 2
and this will give me 2 candidate IDs.
The issue that I have is, if I have a name that contains some inflection, then the exact match won't be in the database and nothing will be returned.
Even in the current example with the name: "Ministerium für Auswärtige Angelegenheiten", if I remove the word "für", I won't get any results returned.
Is there a way to make the search more flexible and return the closest results to the query, even if they are incorrect?
P.S. I am doing it through Python, using the SPARQLWrapper

Not using the WQS SPARQL service, IIANM.
For similar usecases, using the full-text search engine might be workable. Take a look at a search query in the API Sandbox, returning some relevant results.

Related

SPARQL triple filter is not an exact match

I am using an IBM wrapper solution around SPARQL to get information from our database. I set a triple variable to act as a filter but it doesnt return an exact match, only a 'contains' match.
More specifically, we are looking at requirements that live in a collection. The SPARQL query returns all requirement objects and the collection that they live in. Each collection has a unique identifier associated with it which is accessed via the predicate 'dcterms:identifier'. The exact line in the SPARQL code which does this is:
?oslc_rm_RequirementCollection1_uri dcterms:identifier ?oslc_rm_RequirementCollection1_identifier
This works as expected. In the output I get a table containing each collection with the list of requirements associated with each one.
The problem arises when I want to look at requirements in only a specific collection. To do this I set the variable oslc_rm_RequirementCollection1_identifier in IBM's wrapper, and it generally works. If I enter '18732' it only shows me requirements from the collection with id 18732. However, this is not an exact match only a contains. For example if I enter '867', I am shown two collections: 867 and 38674.
How can I modify this to exclude 38674 and only show the exact match? I cannot use a string literal because the wrapper does not allow this.

Is there a way to sort SPARQL query results by relevance score in MarkLogic 8?

We're running SPARQL queries on some clinical ontology data in our MarkLogic server. Our queries look like the following:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cts: <http://marklogic.com/cts#>
SELECT *
FROM <http://example/ontologies/snomedct>
WHERE {
?s rdfs:label ?o .
FILTER cts:contains(?o, cts:word-query("Smoke*", "wildcarded"))
}
LIMIT 10
We expected to get sorted results based off of relevance score, but instead they seemed to be in some random order. Tried many ways with the query but nothing worked. After some research we found this statement in the MarkLogic docs:
When understanding the order an expression returns in, there are two
main rules to consider:
cts:search expressions always return in relevance order (the most relevant to the least relevant).
XPath expressions always return in document order.
Does this mean that cts:contains is a XPath expression that always return in document order? If that's the case, how can we construct a SPARQL query that returns in relevance order?
Thanks,
Kevin

In the example you have, the language you are using is SPARQL - with a fragment filter of the cts:contains.
IN this case, the cts:contains is only useful in isolating fragment IDs that match - thus filtering the candidate documents used in the SPARQL query. Therefore, I do not believe that the the cts relevance is taken into account.
However, you could possibly get results you are looking for in a different way: Do an actual cts:search on the documents in question - then filter them using a cts:triple-range-query.
https://docs.marklogic.com/cts:triple-range-query

Linked Geo Data seems to contain only a few of all housenumbers of OSM. How can it be?

I have a question regarding the retrieval of the house numbers from Open Street Map through the Linked Geo Data sparql endpoint:
If I run a query to select all the buildings which have a house number located in the city of, for example, of Esino Lario I get only 11 results whereas in OSM seems that a lot of building of that village have explicitly indicated a housenumber (compare the query result with the actual view from OSM).
Even if i expand my search to the whole world broadening the research to include all possibilities I get only 1,176,404 results which seems not reasonable at all (here the query result).
The queries used are the following:
Select *
Where {
?s
<http://linkedgeodata.org/ontology/addr%3Acity> "Esino Lario";
<http://linkedgeodata.org/ontology/addr%3Ahousenumber> ?housenumber.
}
and
Select count(?housenumber ) as ?n
Where { ?s <http://linkedgeodata.org/ontology/addr%3Ahousenumber> ?housenumber }
Seems that the LGD database is not complete or not updated at Jan 2016 as declared. Or, which is more plausible, I am doing some bad mistakes in the query or I do not understand properly the ontology used.
Can somebody please help me unravel this mystery?

Query DBpedia for countries with human development index

I'm trying to query for all Countries in DBpedia and get their human development index.
The query I am trying is:
SELECT *
WHERE {
?Country a <http://dbpedia.org/ontology/Country> .
?Country <http://dbpedia.org/ontology/humanDevelopmentIndex> ?humanDevelopmentIndex .
}
LIMIT 1000
Would anyone be able to explain why this query isn't returning any results? It seems straightforward to me.

You're not getting anything back because apparently, none of the countries in DBpedia actually have a humanDevelopmentIndex property associated with them.
You can verify this for yourself. If you simplify your query to just get back countries:
SELECT *
WHERE {
?Country a <http://dbpedia.org/ontology/Country> .
}
LIMIT 1000
You will get back a list of countries, so clearly it is the addition of the other property pattern that causes the query to not match any results. Also, if you take a look at the data for, for example, Austrialia in DBPedia, you will not find the property you want there.
The reason it doesn't appear is that the data you want is probably located in the ontology_infobox_properties or the ontology_infobox_properties_specific dataset. These are not exposed in the public endpoint, but you can download them.

Lucene exact ordering

I've had this long term issue in not quite understanding how to implement a decent Lucene sort or ranking. Say I have a list of cities and their populations. If someone searches "new" or "london" I want the list of prefix matches ordered by population, and I have that working with a prefix search and an sort by field reversed, where there is a population field, IE New Mexico, New York; or London, Londonderry.
However I also always want the exact matching name to be at the top. So in the case of "London" the list should show "London, London, Londonderry" where the first London is in the UK and the second London is in Connecticut, even if Londonderry has a higher population than London CT.
Does anyone have a single query solution?

dlamblin,let me see if I get this correctly: You want to make a prefix-based query, and then sort the results by population, and maybe combine the sort order with preference for exact matches.
I suggest you separate the search from the sort and use a CustomSorter for the sorting:
Here's a blog entry describing a custom sorter.
The classic Lucene book describes this well.

API for
Sortcomparator
says
There is a distinct Comparable for each unique term in the field - if
some documents have the same term in
the field, the cache array will have
entries which reference the same
Comparable
You can apply a
FieldSortedHitQueue
to the sortcomparator which has a Comparator field for which the api says ...
Stores a comparator corresponding to
each field being sorted by.
Thus the term can be sorted accordingly

My current solution is to create an exact searcher and a prefix searcher, both sorted by reverse population, and then copy out all my hits starting from the exact hits, moving to the prefix hits. It makes paging my results slightly more annoying than I think it should be.
Also I used a hash to eliminate duplicates but later changed the prefix searcher into a boolean query of a prefix search (MUST) with an exact search (MUST NOT), to have Lucene remove the duplicates. Though this seemed even more wasteful.
Edit: Moved to a comment (since the feature now exists): Yuval F Thank you for your blog post ... How would the sort comparator know that the name field "london" exactly matches the search term "london" if it cannot access the search term?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Retrieve Wikidata ID candidates based on a partial name match - sparql

Not using the WQS SPARQL service, IIANM. For similar usecases, using the full-text search engine might be workable. Take a look at a search query in the API Sandbox, returning some relevant results.

Related

SPARQL triple filter is not an exact match

Is there a way to sort SPARQL query results by relevance score in MarkLogic 8?

Linked Geo Data seems to contain only a few of all housenumbers of OSM. How can it be?

Query DBpedia for countries with human development index

Lucene exact ordering

Categories

Resources