fetching wikidata labels in other languages from reconciled column - openrefine

I want to use wikidata reconciliation to translate a column of terms into various languages by fetching the labels in those languages. Using SPARQL, I'd filter a query for label by language (this is the approach suggested in various similar cases). I don't see how to do the same using OpenRefine reconciliation, however.
Maybe the problem is that the wikidata API is language-specific?

Say that you want to fetch labels in Italian, which has language code it. You can do that by entering Lit in the property input. You can also fetch descriptions with Dit or aliases with Ait. To fetch these terms in other languages, replace it by other language codes.
This is only documented at https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation so far - I acknowledge that we need a more visible documentation for this (ideally it should be easily accessible from OpenRefine's user interface, given that the reconciliation service comes preconfigured in OpenRefine).

Related

Wikipedia API Extraction of abstracts in 2 languages

I am trying to connect 2 API queries.
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=&explaintext=&titles=Albert+Einstein&format=json
Where I search for article descriptions and
https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllang=de&titles=Companion%20dog
Where I retrieve the name of the article in another language (here German).
Is there a way to connect them to retrieve description data both in English and German?
I have tried connecting them via "generators" and I seem to not understand how to apply it here.
I also tried inputting another query after extracting names in 2 languages (searching for descriptions). However, the names are sometimes formatted so that I cannot reuse them in the query.
No. The description is a snippet from the start of the article. If you want a German description, you need to get it from the German Wikipedia (ie. a different API endpoint).

SpaCy different Language Models

I'm making some progress:) developing my litle OCR Project.
I was wondering if my idea is possible in this case!
After extracting the Text from a Images (ocr), I use nlp (spacy) to identify two Entities (LOCation and PERson). I write to a Dictionary and later in a JSON Data. That works good.
Now I'm wondering if I can improve my identified Entities.
One way I can imagine is to use the right Language Model for the text.
I have varies Texts in German, English,Spanish and French.
At the moment I'm using the
But now I have no idea how to put langdetect into this
Have a great week!
Greets
Here is a link that you might find useful when it comes to detecting a language (There are multiple options including langdetect) - How to detect language
You can create a dictionary with the languages you plan to detect and match it with langdetect's output. I guess you have the rest sorted out.

REST API filter operator best practice

I am building a REST API that uses a filter parameter to control search results. E.g., one could search for a user by calling:
GET /users/?filter=name%3Dfoo
Now, my API should allow many different filter operators. Numeric operators such as equals, greater than, less than, string operators like contains, begins with or ends with and date operators such as year of or timediff. Moreover, AND and OR combinations should be possible.
Basically, I want to support a subset of the underlying MySQL database operators.
I found a lot of different implementations (two good examples are Google Analytics and LongJump) that seem to use custom syntax.
Looking at my requirements, I would probably design a custom syntax pretty similiar to the MySQL operator syntax.
However, I was wondering if there are any best practices established that I should follow and whether I should consider anything else. Thanks!
You need an already existing query language, don't try to reinvent the wheel! By REST this is complicated and not fully solved issue. There are some REST constraints your application must fulfill:
uniform interface / hypermedia as the engine of application state:
You have to send hypermedia responses to your clients, and they have to follow the hyperlinks given in those responses, instead of building the requests on their own. So you can decouple the clients from the structure of the URI.
uniform interface / self-descriptive messages:
You have to send messages annotated with semantics. So you can decouple the clients from the data structure. The best solution to do this is RDF with for example open linked data vocabs. If you don't want to use RDF, then the second best solution to use a vendor specific MIME type, so your messages will be self-descriptive, but the clients need to know how to parse your custom MIME type.
To describe simple search links, you can use URI templates, for example GET /users/{?name} will wait a name parameter in the query string. You can use the hydra:IRITemplateMapping from the hydra vocab to add semantics to the paramers like name.
Describing ad-hoc queries is a hard task. You have to describe somehow what your query can contain.
You can choose an URI query language and stick with URI templates and probably hydra annotation. There are many already existing URI query languages, like HTSQL, OData query (ppl don't like that one), etc...
You can choose an existing query language and send it in a single URI param. This can be anything you want, for example SQL, SPARQL, etc... You have to teach your client to generate that param. You can create your own vocab to describe the constraints of the actual query. If you don't need complicated things, this should not be a problem. I don't know of already existing query structure descibing vocabs, but I never looked for them...
You can choose an existing query language and send it in the body in a SEARCH request. Afaik SEARCH is not cached or supported by recent HTTP clients. It was defined by webdav. You can describe your query with the proper MIME type, and you can use the same vocab as by the previous solution.
You can use an RDF query solution, for example a SPARQL endpoint, or triple pattern fragments, etc... So your queries will contain the semantic metadata, and not your link description. By SPARQL you don't necessary need a triple data storage, you can translate the queries on server side to SQL, or whatever you use. You can probably use SPIN to describe query constraints and query templates, but that is new for me too. There might be other solutions to describe SPARQL query structures...
So to summarize if you want a real REST solution, you have to describe to your clients, how they can construct the queries and what parameters, logical operators they can use. Without query descriptions they won't be able to generate for example a HTML form for the user. If you don't want a REST solution, then pick a query language write a builder on the client, write a parser on the server and that's all.
The Open Data Protocol (OData)
You can check BreezeJs too and see how this protocol it's implemented for node.js + mongodb with breeze-mongodb module and for a .NET project using Web API and EntityFramework with Breeze.ContextProvider dll.
By embracing a set of common, accepted delimiters, equality comparison can be implemented in
straight-forward fashion. Setting the value of the filter query-string parameter to a string using those
delimiters creates a list of name/value pairs which can be parsed easily on the server-side and utilized
to enhance database queries as needed. You can use the delimeters of your choice say (“|”) to separate individual filter phrases for OR and ("&") to separate
individual filter phrases for AND and a double colon (“::”) to separate the names and values.
This provides a unique-enough set of delimiters to support the majority of use cases and creates a user readable
query-string parameter. A simple example will serve to clarify the technique. Suppose we want
to request users with the name “Todd” who live in "Denver" and have the title of “Grand Poobah”.
The request URI, complete with query-string might look like this:
GET http://www.example.com/users?filter="name::todd&city::denver&title::grand poobah”
The delimiter of the double colon (“::”) separates the property name from the comparison value,
enabling the comparison value to contain spaces—making it easier to parse the delimiter from the value
on the server.
Note that the property names in the name/value pairs match the name of the properties that would be
returned by the service in the payload.
Case sensitivity is certainly up for debate on a case-by-case basis, but in general,
filtering works best when case is ignored. You can also offer wild-cards as needed using the asterisk
(“*”) as the value portion of the name/value pair.
For queries that require more-than simple equality or wild-card comparisons, introduction of operators
is necessary. In this case, the operators themselves should be part of the value and parsed on the server
side, rather than part of the property name. When complex query-language-style functionality is
needed, consider introducing query concept from the Open Data Protocol (OData) Filter System Query
Option specification (http://www.odata.org/documentation/odata-version-4-0/)
There seems to be a lot of standards (like OData), but many are quite complicated in that they introduce new syntax.
For simple multi filtering the following format avoid polluting the parameter namespace while still standing on top of existing web-technology
GET /users?filter[name]=John&filter[title]=Manager
It's easily readable and on the backend languages like PHP will receive it as an array of filters to apply.
A possible standard would SCIM which is adopted by some commercial products. But it's not distinguished by brevity. For a pet project I used this
= equal
! not equal
* like
< smaller
> greater
& bitwise and 
| bitwise or
^ bitwise xor
~ in comma separated value list
Examples
So GET /user?name=*An* would get all users whose name start with An and GET /user?name=~Anna,Bertha would get those two users.
Not yet a standard but who knows...

How to design a REST API with LIKE criteria?

I'm designing a REST API and have an entity for "people":
GET http://localhost/api/people
Returns a list of all the people in the system
GET http://localhost/api/people/1
Returns the person with id 1.
GET http://localhost/api/people?forename=john&surname=smith
Returns all the people with matching forenames and surnames but I have a further requirement. What is the cleanest / best practice way of allowing API consumers to retrieve all the people whose forename starts with "jo" for example.
I've seen some APIs do this like:
GET http://localhost/api/people?forename=jo~&surname=smith
where the tilde signifies a "fuzzy" match. On the other hand I've seen it implemented with a totally different criteria e.g.
GET http://localhost/api/people?forename-startswith=jo&surname=smith
which seems a bit cumbersome considering I might have -endswith, -contains, -soundslike (for some sort of soundex match).
Can anyone suggest from experience which works better and also any examples of well designed REST APIs that have similar functionality.
IMHO it does not matter if you have fuzzy matches or have -endswith -contains etc. What matters is if your REST API permits easy parsing of such parameters so that you can define functions to fetch data from your data source (DB or xml file etc.) accordingly
If you are using PHP...from my experience, SlimFramework is a great light weight, easy-to-get-started solution.
I would recommend you the OData protocol which provides a Query String Options. What you did is ok and follows REST conventions.
But, the OData protocol describes a $expand parameter and even a $filter parameter. This $ prefix denotes "System Query Options" and you will be interested in the last one because it allows you to write the following URI:
http://services.odata.org/Northwind/Northwind.svc/Customers?$filter=tolower(CompanyName) eq 'foobar' &select=FirstName,LastName&$orderby=Name desc
It allows you to pass SQL like data, it can be a nice alternative to what you described (both solutions are fine, it's just a matter of taste).
AFAIK, none of above are quite RESTful. Both of them rely on a priory knowledge on the client's part on how to invoke queries (in the first case, query pattern and on the second one a query DSL). In the second example, in fact, the API is reduced mere to a wrapper around the data store. As such, API does not define a server domain - it is a data provider. This is in contrast to the client-server constraint of REST.
If you need to expose a full-blown data store with all various querying capabilities, you had better stick to known standards which we have OData. OData has been sold as REST but many REST-heads have problems with it. Anyhow, at the end of the day it works and REST discussions can commonly lead to analysis-paralysis.
If I was doing this, I would probably constraint the API to a common use-case, so something more like the second one without defining a query DSL (hence forenameStartsWith rather than forename-startswith).
Having said that, if you need to query based on many fields and various conditions, I would use OData.
Both examples use query parameters for filtering. I don't think it matters what these query parameters are called or if some wildcard syntax is used.
Both approaches are equally RESTFul.

What is the easiest way to implement terms association mining in Solr?

Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.
So, my needs are:
To retrieve the most associated terms within particular field.
To retrieve the term, that is closest to the specified one within particular field.
I will rate answers in the following way:
Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.
You can export a Lucene (or Solr) index to Mahout, and then use Latent Dirichlet Allocation. If LDA is not close enough to LSA for your needs, you can just take the correlation matrix from Mahout, and then use Mahout to take the singular value decomposition.
I don't know of any LSA components for Solr.
Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.
I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:
All terms or at least the most frequent ones, because rare terms won't affect result of association mining by their nature.
Documents where these terms occur, again, at least top documents.
Both these tasks may be easily done with standard Solr components.
To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).
Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.
Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.
You can get the count of occurrences of the current term in found document in the following query:
http://ip:port/solr/someinstance/select?defType=func&fl=termfreq(field,xxx),*&fq={!frange l=1}termfreq(field,xxx)&indent=on&q=termfreq(field,xxx)&sort=termfreq(field,xxx) desc&wt=json