How to assign a suitable science category to keywords, using DBPedia and SPARQL? - sparql

I have some keywords like emotion perception ability, students’ motivation, self-efficacy. The goal is to map these keywords to a corresponding category(-ies) of psychology. In this case I know apriori that the answer is Educational psychology, however I want to get the same answer using DBPedia ontologies.
Using the following query I am able to extract different branches of psychology and corresponding abstracts:
SELECT DISTINCT ?subject ?abstract
WHERE {
?concept rdfs:label "Branches of psychology"#en .
?concept ^dct:subject ?subject .
?subject dbo:abstract ?abstract .
}
LIMIT 100
Now I want to add some OPTIONAL clause that would compare my keywords (using OR) with terms from the abstract (dbo:abstract). Is it possible to do this using SPARQL? Or should I use SPARQL just to obtain the abstracts and then to make all further text processing using e.g. Java or Python?
Also, the ideas of some other approaches that might be useful to reach the goal are highly appreciated.

You could retrieve data as text using sparql, but deciding if a text match a query should be done with text data analysis technics, or text mining
This is a whole science, but fortunately lots of libraries for many languages (including Java and Python) exist in order to implement related algorithms. Here is a list of software on wikipedia. NLTK is well-known for the job, and have a Python binding.
In your case, i think of many ways, but i'm far from an expert, so my idea can be wrong:
Create a corpus of abstracts of each desired categories (educational psychology,…), and, for a given abstract A, compare A to each abstract of each abstract of each category C. The result of the comparison will give for each category a score/likelihood that A belongs to C. (cf fuzzy sets)
The comparison could be implemented with vector space model, that works on vocabulary similarities.
Named Entities Recognition could help to detect names of authors, technics or tools related to particular category.
Main idea is the following: once you defined what is the particular trait of each category, by using its vocabulary, authors, references or whatever, you can decide, for any abstract, a score of membership to all categories.
So, the real question to ask is which scoring function should i use ?.
The answer depends heavily of the data, and the results you want. When you say an abstract is about educational psychology, you have to know why. And then implement it as a scoring function.
As a side node, i add that neural networks, by training on the corpus, could, maybe, bypass the scoring, by automatic learning. I don't know enough in that field to say more.

Related

Retrieving all Wikipedia articles about people

I'm trying to retrieve all articles from Wikipedia that are about people. More specifically, I'm looking for:
only the page title (and perhaps the page ID)
of articles that are about people,
separated by gender (for the sake of simplicity, male and female),
from the current English Wikipedia.
There are several things I've tried, none of which have worked out:
The Wikipedia API lets me search for all pages in a given category. However, searching in "Men" or "Women" fetches mostly subcategory pages, and pages about actual people are buried further down the subcategory hierarchy. I can't find a way to auto-traverse the hierarchy.
PetScan lets me specify a hierarchy depth, but requests time out with a depth of more than 3. Also, like the Wikipedia API, results include articles that aren't about people.
Wikidata lets me write SPARQL queries to search for entities that have a gender of "male" or "female". This example seems to work, but once I include entity names in the query, it times out. Also, I'm not sure how exactly this data corresponds to Wikipedia articles — is this data guaranteed to be the same as on Wikipedia?
What's the best way to achieve what I'm looking for?
I've created a SPARQL-query doing the work. It's important to keep the query as simple as possible (for query optimisation read: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization). Here is the query for SPARQL: https://w.wiki/JhK
For articles of woman this might work with the Wikidata Query Service (WQS), though it's hard on the edge of timing out. So for the male articles (there are more) you need to add a LIMIT and step through it by adding an increasing OFFSET. The WQS seams to still timeout, but there are other endpoints to Wikidata, this one is limited to 100.000 results, but works with increasing OFFSET: https://wikidata.demo.openlinksw.com/sparql
The resulting SPARQL query is something like this:
SELECT ?sitelink
WHERE {
?item wdt:P21 wd:Q6581097;
wdt:P31 wd:Q5.
?sitelink schema:about ?item;
schema:isPartOf <https://en.wikipedia.org/>.
}
LIMIT 100000 OFFSET 100000

WikiData, locating entities with a specific type or subtype, located in a specific city

My Specific problem is that I have a place called "Beacon Theater". What I want to find is the best match for this in Wikidata.
A Wikidata Search will give me three results:
Live at the Beacon Theater (Q6656601)
Beacon Theatre (Q264186): performing arts venue
Beacon Theaters (Q19110809)
The first one is a movie, the second it the correct result, and the third is a Supreme Court decision.
Using this API call, I can find the id's for all three:
https://www.wikidata.org/w/api.php?action=query&format=json&list=search&srsearch=Beacon Theater
Next step is getting the details for each of these. I use this call to get the information for all three entities
"https://www.wikidata.org/w/api.php?action=wbgetentities&props=descriptions|labels|claims&ids=Q6656601|Q264186|Q19110809&languages=en&format=json"
At this point, I want to iterate over them and find the one that is a building. I may also later want to add a way to find the one located in New York.
My problem is that the correct answer is not a building (Q41176). The P31 value is Q3469910, which is a Performance Arts Venue, so I can't really sort on that (Imagine in the future I use this code to search for a museum. A museum is also a building, but not a Performing Arts Venue. The search for Beacon Theater is just an example.
So question: How can I find the correct entry, which for the purposes of this question I define as:
Being a building (or perhaps being derived from a Building)
Optional Answer: Being located in New York (In case of multiple hits, this would further limit the results)
I think I need to do a SPARQL query as the second query to do this, but from the examples I could not quite figure out how, or if that would be the correct/easiest way. Maybe even a SPARQL query that would do all of the above in one query?
In your case an exact label match might already do
SELECT DISTINCT ?loc ?locLabel ?locDescription
WHERE
{
values ?locLabel {"Beacon Theater"#en }
?loc rdfs:label ?locLabel .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"}
}
try it!
I have a project where I meet the same kind of problems but for books, which can also be comic books, or manga etc. The easiest solution I found was to keep a list of "alias entities", that is, entities that could be considered at match when looking for a book. It's not as dynamic as a SPARQL query and requires periodical updates - adding newly found matching entities, removing problematic ones - but that's so much faster and satisfies most of my needs.

How do triple stores use linked data?

Lets say I have the following scenario:
I have some different ontology files hosted somewhere on the web on different domains like _http://foo1.com/ontolgy1.owl#, _http://foo2.com/ontology2.owl# etc.
I also have a triple store in which I want to insert instances based on the ontology files mentioned like this:
INSERT DATA
{
<http://foo1.com/instance1> a <http://foo1.com/ontolgy1.owl#class1>.
<http://foo2.com/instance2> a <http://foo2.com/ontolgy2.owl#class2>.
<http://foo2.com/instance2x> a <http://foo2.com/ontolgy2.owl#class2x>.
}
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
And after the insert if I run a SPARQL query like this:
select ?a
where
{
?a rdf:type ?type.
?type rdfs:subClassOf* <http://foo2.com/ontolgy2.owl#class2> .
}
the result would be:
<http://foo2.com/instance2>
and not:
<http://foo2.com/instance2>
<http://foo2.com/instance2x>
as it should be. This is happening because the ontology file _http://foo2.com/ontolgy2.owl# is not imported into the triple store.
My question is:
Can we talk in this example about "linked" data? Because it seems to me that it is not linked at all. It has to be imported locally into a triple store, and after that you can start querying.
Lets say if you want to run a query on some complex data that is described by 20 ontology files, then all 20 ontology files would need to be imported.
Isn't this a bit disappointing?
Do I misunderstood triple stores and linked data and how they work together?
as it should be.
I'm not certain that should is the right term here. The semantics of the SPARQL query is to query the data stored in a particular graph stored at the endpoint. IRIs are more or less opaque identifiers; just because they might also be URLs from which additional data can be retrieved doesn't obligate any particular system to actually do that kind of retrieval. Doing that would easily make query behavior unpredictable: "this query worked yesterday, why doesn't it work today? oh, a remote website is no longer available…".
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
Remember, since IRIs are opaque, anyone can define a term in any ontology. It's always possible for someone else to come along and say something else about a resource. You have no way of tracking all that information. For instance, if I go and write an ontology, I can declare http://foo2.com/ontolgy2.owl#class2x as a class and assert that it's equivalent to http://dbpedia.org/ontology/Person. Should the system have some way to know about what I did someplace else, and even if it did, should it be required to go and retrieve information from it? What if I made an ontology that's 2GB in size? Surely your endpoint can't be expected to go and retrieve that just to answer a quick query?
Can we talk in this example about "linked" data? Because it seems to
me that it is not linked at all. It has to be imported locally into a
triple store, and after that you can start querying.
Lets say if wan to run a query on some complex data that is describe
by 20 ontology files, in this case I have to import all 20 ontology
files.
This is usually the case, and the point about linked data is that you have a way to get more information if you choose to, and that you don't have to do as much work in negotiating how to identify resources in that data. However, you can use the service keyword in SPARQL to reference other endpoints, and that can provide a type of linking. For instance, knowing that DBpedia has a SPARQL endpoint, I can run a local query that incorporates DBpedia with something like this:
select ?person ?localValue ?publicName {
?person :hasLocalValueOfInterest ?localValue
service <http://dbpedia.org/sparql> {
?person foaf:name ?publicName
}
}
You can use multiple service blocks to aggregate data from multiple endpoints; you're not limited to just one. That seems pretty "linked" to me.

Linking related topics IR

How to link terms(keywords entities) which have some relation among them through text documents . Example is of google when you search for a person it shows recommendations of other people related to that person .
In this picture it figured out spouse , presidential candidate , and equal designation
I am using frequency count technique . The more two terms occur in same document the more chance of them to have some relation. But this also links unrelated terms like pagemarks , verbs and page refences in a text document .
How should I improve it and is there any other easy but reliable technique ?
You should look a few techniques
1.) Stop word filtering: it is common in text mining two filter words which are typically not very important as they are two frequent. Like the, a, is and so on. There are predefined dictionaries.
2.) TF/IDF: TF/IDF re-weights words on how much they separate documents.
3.) Named Entity Recognition: For your task at hand it might be sufficient to just focus on the names. Named entity recognition can extract names from documents
4.) Linear Dirichlet Allocation: LDA finds concept in documents. A concept is a set of words which frequently appear together.

Yago ontology for entity disambiguation

I am using the propriety rdfs:type equal to dbpedia-owl:Organisation for selecting (obviously) organizations on my SPARQL query:
SELECT ?s
WHERE {
?s a dbpedia-owl:Organisation .
} LIMIT 10
I would like to consider the YAGO ontology for increasing my performance on getting real organizations. For example, the FBI (http://dbpedia.org/resource/Federal_Bureau_of_Investigation) is not considered as a dbpedia-owl:Organisation but is tagged as yago:Organization108008335 .
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
Moreover, when I look for more classes with this format (using the query below), I can find two more classes: http://dbpedia.org/class/yago/Organization108008335, http://dbpedia.org/class/yago/Organization101008378, http://dbpedia.org/class/yago/Organization101136519
SELECT DISTINCT ?t WHERE {
?s a ?t
FILTER(regex(str(?t), "http://dbpedia.org/class/yago/Organization\\d+"))
}
Are they different? Why aren't they all "yago:Organization". Should I expect "new" organization classes as new versions of YAGO ontologies are made available? Is there any other class I should consider when selecting Organizations?
I have been digging into this lately, so I'll try to answer all your questions one by one:
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
That number corresponds to the synset id of the word in Wordnet. For example, if you look up wordnet_organization_101136519 in wordnet (the URI in the dbpedia is not resolvable at this moment, maybe they have changed something in the last releases), you will see that it has a synsetID "101136519". I don't think you can know it a priori without looking into wordnet.
Are they different? Why aren't they all "yago:Organization".
They are different because they have a different definition in wordnet. For example:
wordnet_organization_101136519: "the activity or result of distributing or disposing persons or things properly or methodically 'his organization of the work force was very efficient'". Example of an instance: Bogo-Indian_Defence. See more details here
wordnet_organization_101008378: "the act of organizing a business or an activity related to a business 'he was brought in to supervise the organization of a new department'". Example of an instance: Adam_Smith_Foundation. See more details here
If you follow the links I provided you can see more differences and common similarities.
Should I expect "new" organization classes as new versions of YAGO ontologies are made available?
When they generated Yago they associated every word in wordnet to a URI. If more words about organizations are added, then I guess that you'll have more definitions. However it is impossible to know beforehand.
Is there any other class I should consider when selecting Organizations?
You can look for all the classes with the label "organization" in wordnet and then add optionals to your query (or issue one query per class retrieving the different organizations you are interested in). These are the classes with the "organization" label in Wordnet.
I hope it helps.