OpenRefine: after reconciliation, I want to retrieve the title of Wikidata items in other languages - openrefine

In the same way I can obtain geographic coordinates for, let's say a list of cities, I'd like to create columns in OpenRefine with the names of those cities in different languages (Venice, Venezia, Wenecja, Wenedig…). Is it possible? Apparently there is no property like that in Wikidata.
Wikidata screenshot

You can fetch those using special property codes in OpenRefine. To fetch the label in a given language, find out the language code for it, and then prepend L to it. For instance, Lit will give you the label in Italian.
You can also do Dfr (French description) or Ade (German aliases).
This is documented at https://wikidata.reconci.link/

Related

Is it possible to find the Wikipedia ID by property, e.g. Grid ID

Wikidata stores the Grid ID of educational institutions in property 2427 (https://www.wikidata.org/wiki/Property:P2427). Is it possible to retrieve the Wikipedia ID for a given Grid ID.
Example: I know the value "grid.1012.2" as Grid ID.
I would like to get Q1517021 back!
I tried
https://www.wikidata.org/w/api.php?action=wbgetclaims&format=json&property=P2427&claim=grid.1012.2
without success...
You can use the search API with the haswbstatement keyword (docs): action=query&format=json&list=search&srsearch=haswbstatement%3AP2427%3Dgrid.1012.2
Although as suggested by others using SPARQL might be easier for these kinds of things.

SPARQL best approach for generating flat data for search

We have a triple store of information such as drugs and I'm unsure how I can extract this information to make it available so that it can be indexed by our search engine Elasticsearch. I had envisaged that I would run a SPARQL query to extract the following information:
Title
Body
Href
Please not the triple store does not contain the above structure it's a lot more complicated than that.
One of the requirements is to be able to format the Titles using different triples from the triple store so for example for drugs something like this would be needed:
Paracetamol | Introduction | Drug
(Pracetamol refers to the drug name, Introduction is a subsection and Drug refers to the type)
For the body I was thinking of extracting all the text values from all the triples related to drugs.
And for the href simply using the uri of the resource(drug).
I would then convert this information to JSON-LD so that it can be indexed by Elasticsearch. In the end the JSON-LD will simply contain the title, body and href.
So my question is, is using SPARQL the right approach for what I'm wanting to do or should I look at a different approach to extract the data I need based on the requirements above.

Hibernate Search: how to query for embeded entities

I like to use Hibernate Search for implementing an sophisticated autosuggestion feature across multiple input fields on a web page.
Each input field is for its own entity, let's say Country and City. There is a many-to-one relationship between both entities
(countries contain cities).
The autosuggestion should work such that when typing e.g. a country name prefix and the city field is already filled,
you get only suggestions for countries that have such a city (and vice versa).
The server side autosuggestion service should return list of projections
(entityId, entityName) which are rendered into the input field (dropdown, whatever).
According to the schema and after having read the manual I tried the following index schema:
SearchMapping mapping = new SearchMapping();
mapping.analyzerDef(...
.entity(City.class).indexed().indexName("MyIndex")
.property("cityId", ElementType.FIELD)
.documentId()
.name("id")
.property("name", ElementType.FIELD)
.documentId()
.name("id")
.property("country", ElementType.METHOD)
.indexEmbedded()
.entity(Country.class).indexed()
.property("id", ElementType.FIELD)
.documentId()
.name("id")
.property("name", ElementType.METHOD)
.field()
.name("name")
This mapping defines City to be the main entity, right?
I have indexed all cities and am able to query for them (also by combining both fields). However, I only get matches when querying for cities.
i.e. when querying like
fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(City.class).get();
This is not useful for the country field becuse when I type in "Spain", I get a single row for each city of Spain. (Spain, Spain, Spain, Spain ,.... ;-))
The question is: How is it possible to search for country entities? Changing the index structure? The indexing procedure? Or how to query?
The only way I found was to setup a Facet for country, and you the different possible facets as autosuggestion. However, this is also not perfect
since it is not possible to sort facets alphabetically.
Of course, in this example, I could switch both entities in the mapping, but suppose scenarios with more complex entity graphs.
UPDATE: adding queries requested in comment
For building queries, I employ the QueryBuilder. The following produces a result set like in the Spain example:
fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(City.class).get();
with query:
country.name:Spain
If I try to use a query builder for countries
fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(Country.class).get();
and query:
name:Spain
I get no results.
You are not showing your actual query. You don't have to use the query DSL, but you can also write native Lucene queries. In both cases (DSL or native Lucene) you can combine queries via boolean logic. Embedded entities follow the java bean notation. The country name would for example in a city query be reached as country.name. Again, without your actual query it is hard to give any more specific feedback.
Last, but not least, facets can also be sorted alphabetically. Check FacetSortOrder.COUNT_DESC.

Pattern for searching entire DB record, not specific field

More and more, I'm seeing searches that not only find a substring in a specific column, but they appear to search in all columns. An example is in Amazon, where you can search for "Arnold" and it finds both the movie Running Man starring Arnold Schwarzeneggar, and the Gund toy Arnold the Snoring Pig. I don't know what the term is for this type of search (Wide search? Global search?), and that bugs me. But what I really want to know is what is the normal pattern for accomplishing this type of search in a QUICK way.
The obvious, and slow, way to do it would be to search for the substring "Arnold" in the title, "Arnold" in the author, "Arnold" in the description, etc.
The first quick solution that comes to mind is to store a mapping for each word used to describe a product to the product itself, and then search that word mapping. That could be quick, but doesn't seem very space-efficient to me.
There are probably a hundred ways to accomplish this, some of which probably don't even use a database. But what is the norm?
I've done this in the past by storing an XML version of items in an XML column in the table, then searching in that column instead of the others.
Maybe they're not storing the data the way you expect.
They could, for example, store all titles, authors, descriptions, and every other searchable field in one table with an attribute to distinguish the field's type.

Searching and sorting by language

I am testing Lucene.NET for our searching requirements, and I've got couple of questions.
We have documents in XML format. Every document contains multi-language text. The number of languages and the languages itself vary from document to document. See example below:
<document>This is a sample document, which is describing a <word lang="de">tisch</word>, a <word lang="en">table</word> and a <word lang="en">desk</word>.</document>
The keywords of a document are tagged with a special element and language attribute.
When I am creating lucene index I extract the text content from the XML and pairs of language and keyword (I am not sure if I have to), like this:
This is a sample document, which is describing a tisch, a table and a desk.
de - tisch
en - table
en - desk
I don't know exactly how to create an index that I will be able to search for example:
- all the documents that contains word tisch in German (and not the document which contains word tisch in other languages).
And also I want to specifiy sorting at runtime:
I want to sort by user specified language order (depending on a user interface). For example, if we have two documents:
<document>This is a sample document, which is describing a <word lang="de">tisch</word>.</document>
<document>This is a another sample document, which is describing a <word lang="en">table</word>.</document>
and a user on an English interface searches by "tisch OR table" I want to get the second result first.
Any information or advice is appreciated.
Many thanks!
You have a design decision to make, where the options are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
If you use the multi-index approach, it will be easier to restrict search to a specific language or set of languages - just search the indexes for these languages, not using the other languages. Also, sorting by language becomes easier. Therefore, if you do not have
an "AND" search that requires keywords from different languages appear in the same document, I would suggest the M-index approach.
Based on your example, I assume that the part of the documents not specially tagged is in English. If this is so, you can add the document text to the English index as a separate field; The other indexes need only store a document id, which will make them lighter.