Extract all properties and values from entities from a large set of heterogeneous RDF datasets - semantic-web

I would like to know if there exists some work that is able to extract all properties and values from entities from a large set of heterogeneous RDF datasets, such as LOD cloud?
Example: I would like to extract all properties and values from all cities from all datasets of the whole LOD cloud.
Problem: Ontology alignment, example, city in DBpedia is represented by (http://dbpedia.org/ontology/City) but wikidata, city is (https://www.wikidata.org/wiki/Q515).

Written on the website https://lod-cloud.net/#about
The raw data is available here
You got downvotes because SO is not a place for questions with no research or thinking, otherwise, we'd be spammed with thoughtless questions.
Answering comment
The ontologies have to be merged, maybe define it with owl:equivalentClass owl:sameAs.
http://schema.org is a project already established attempting to solve the problem, so to have a master ontology once for all.
Geospatial Consortium also tries to standardize geo terms.
There is also data mining research about merging data (including ontology), which I don't know enough to give exact advice.

Related

Where can I find various existing ontologies regarding certain aspects?

I need to create – possibly, by reusing various (parts of) existing ontologies – an ontological model regarding certain aspects – data communication, data processing, data storage, etc. – regarding a distributed system (platform, framework,...) used in the context of big data. Significant concepts, relations, restrictions, individuals should be considered as examples for a real software product like Hadoop or Git Large File Storage might be taking into account. Do someone know if there are ontologies that describes the system for one of the above or any other distributed system?
I don't know a specific vocabulary for that, but there are sites out there that can help you find what you need, e.g. http://lov.okfn.org/dataset/lov/

How to mix RDMS DB with a Graph DB

I am developing a website using Django, and PostgreSQL which would seemingly have huge amount of data as gathered in social network sites.
I need to use RDMS with SQL for tabular data for less SQL complexity and also Graph DB with Cipher for large data for high query complexity.
Please let me know how to go about this. Also please let me know whether it is feasible.
EDIT: Clarity as asked in Comments:-
The database structure can be similar to that of a social network like Facebook. I've checked FB Engineering page for their open graph. For graph DB I can find only Neo4J graph DB with proper ACID values though I would prefer an open source graph DB. Graph DB structure, I require basically for summary of huge volume data pertaining to relationships like friends, updates, daily user related updates as individual relations. Horizontal Scalability is important for future up gradation to me.
I intend to use PostgreSQL for base informational data and push the relational data updates to graph DB like Facebook uses both MySql and open graph.
Based on your reply to my queries. I would first suggest looking at TitanDB. I believe it fulfills many of your requirements:
It is open source.
It scales horizontally.
In addition to meeting your requirements it has existed for quite sometime and many companies are using it in Production. The only thing you would have to get used to is that it uses TinkerPop traversals, not Cypher queries. Also note that I believe Titan is not ACID for most backends. This is a result of it being horizontally scalable.
If you would like a more structured (but significantly less mature) approach to Graph DBs then you can look at the stack that myself and some colleagues are working on MindmapsDB which sits on top of Titan, but uses a more "sql-like" query language.
OrientDB Gremlin is also a very good option but lacks the maturity and support of Titan.
There are many other graph vendors out there such as DSE Graph, IBM Graph, etc . . . but the ones I have listed above are the opensource ones I have worked with.

Best practice for storing a neural network in a database

I am developing an application that uses a neural network. Currently I am looking at either trying to put it into a relational database based on SQL (probably SQL server) or a graph database.
From a performance viewpoint, the neural net will be very large.
My questions:
Do relational databases suffer a performance hit when dealing with a neural net in comparison to graph databases?
What graph-database technology would be best suited to dealing with a large neural net?
Can a geospatial database such as PostGIS be used to represent a neural net efficiently?
That depends on the intent of progress on the model.
Do you have a fixated idea on an immutable structure of the network? Like a Kohonnen map. Or an off-the-shelf model.
Do you have several relationship structures you need to test out, so that you wish to be able flip a switch to alternate between various structures.
Does your model treat the nodes as fluid automatons, free to seek their own neighbours? Where each automaton develops unique characteristic values of a common set of parameters, and you need to analyse how those values affect their "choice" of neighbours.
Do you have a fixed set of parameters for a fixed number of types/classes of nodes? Or is a node expected to develop a unique range of attributes and relationships?
Do you have frequent need to access each node, especially those embedded deep in the network layers, to analyse and correlate them?
Is your network perceivable as, or quantizable into, set of state-machines?
Disclaimer
First of all, I need to disclaim that I am familiar only with Kohonnen maps. (So, I admit having been derided for Kohonnen as being only entry-level of anything barely neural-network.) The above questions are the consequence of personal mental exploits I've had over the years fantasizing after random and lowly-educated reading of various neural shemes.
Category vs Parameter vs Attribute
Can we class vehicles by the number of wheels or tonnage? Should wheel-quantity or tonnage be attributes, parameters or category-characteristics.
Understanding this debate is a crucial step in structuring your repository. This debate is especially relevant to disease and patient vectors. I have seen patient information relational schemata, designed by medical experts but obviously without much training in information science, that presume a common set of parameters for every patient. With thousands of columns, mostly unused, for each patient record. And when they exceed column limits for a table, they create a new table with yet thousands more of sparsely used columns.
Type 1: All nodes have a common set of parameters and hence a node can be modeled into a table with a known number of columns.
Type 2: There are various classes of nodes. There is a fixed number of classes of nodes. Each class has a fixed set of parameters. Therefore, there is a characteristic table for each class of node.
Type 3: There is no intent to pigeon-hole the nodes. Each node is free to develop and acquire its own unique set of attributes.
Type 4: There are fixed number of classes of nodes. Each node within a class is free to develop and acquire its own unique set of attributes. Each class has a restricted set of attributes a node is allowed to acquire.
Read on EAV model to understand the issue of parameters vs attributes. In an EAV table, a node needs only three characterising columns:
node id
attribute name
attribute value
However, under constraints of technology, an attribute could be number, string, enumerable or category. Therefore, there would be four more attribute tables, one for each value type, plus the node table:
node id
attriute type
attribute name
attribute value
Sequential/linked access versus hashed/direct-address access
Do you have to access individual nodes directly rather than traversing the structural tree to get to a node quickly?
Do you need to find a list of nodes that have acquired a particular trait (set of attributes) regardless of where they sit topologically on the network? Do you need to perform classification (aka principal component analysis) on the nodes of your network?
State-machine
Do you wish to perceive the regions of your network as a collection of state-machines?
State machines are very useful quantization entities. State-machine quatization helps you to form empirical entities over a range of nodes based on neighbourhood similarities and relationships.
Instead of trying to understand and track individual behaviour of millions of nodes, why not clump them into regions of similarity. And track the state-machine flow of those regions.
Conclusion
This is my recommendation. You should start initially using a totally relational database. The reason is that relational database and the associated SQL provides information with a very liberal view of relationship. With SQL on a relational model, you could inquire or correlate relationships that you did not know exist.
As your experiments progress and you might find certain relationship modeling more suitable to a network-graph repository, you should then move those parts of the schema to such suitable repository.
In the final state of affairs. I would maintain a dual mode information repo. You maintain a relational repo to keep track of nodes and their attributes. So you store the dynamically mutating structure in a network-graph repository but each node refers to a node id in a relational database. Where the relational database allows you to query nodes based on attributes and their values. For example,
SELECT id FROM Nodes a, NumericAttributes b
WHERE a.attributeName = $name
AND b.value WItHIN $range
AND a.id = b.id
I am thinking, perhaps, hadoop could be used instead of a traditional network-graph database. But, I don't know how well hadoop adapts to dynamically changing relationships. My understanding is that hadoop is good for write-once read-by-many. However, a dynamic neural network may not perform well in frequent relationship changes. Whereas, a relational table modeling network relationships is not efficient.
Still, I believe I have only exposed questions you need to consider rather than providing you with a definite answer, especially with a rusty knowledge on many concepts.
Trees can be stored in a table by using self-referencing foreign keys. I'm assuming the only two things that need to be stored are topology and the weights; both of these can be stored in a flattened tree structure. Of course, this can require a lot of recursive selects, which depending on your RDBMS may be a pain to implement natively (thus requiring many SQL queries to achieve). I cannot comment on the comparison, but hopefully that helps with the relational point of view :)

Graph Database: TinkerPop/Blueprints vs W3C Linked data

Looking for an infrastructure for network analysis for heterogeneous (multiple node types (multi-mode), multiple edge type (multi-relation) and multiple descriptive features (multi-featured)) networks, I've noticed that there are two standard stacks in the Graph Database world:
On one hand we have the ThinkPop/Blueprint property graph model. It is supported by Neo4j, OrientDB GraphDB, Dex, Titan, InfiniteGraph, etc.
The Tinkerpop stack includes the Blueprint property graph model interface, the Gremlin graph traversal language, and the Furnace graph algorithms package.
On the other hand we have W3C's Linked Data technology stack, which is supported by AllegroGraph, 4store, Oracle Database Semantic Technologies, OWLIM, SYSTap BigData, etc.
Semantic data is represented using RDF/RDFS/OWL, and can be queried using SPARQL On top it offers rules and reasoning capabilities.
Now, suppose that I want to represent heterogeneous data in a graph database, and analyse such data (statistics, relations discovery, structure, evolution, etc.) (I know these terms are wide and vague) - What are the relative strengths of each model for various types of network analysis tasks? Do these two models complement each other?
Couple things, your exemplars of linked data stacks are all triple stores. You would start building a linked data application by first getting your triple store set up, but calling a database a linked data stack is incorrect imo. That's also an incomplete list of triple stores, there is also Sesame, Jena, Mulgara, and Stardog. Sesame and Jena kind of pull double duty, they're the two de-facto standard Java APIs for the semantic web, but both provide triple stores that come bundled with the APIs. I also know that both Cray and IBM are working on triple stores, but I don't know much about either at this point. I do know that Stardog works well with the TinkerPop stack and that it's basically a drop in and start writing Gremlin queries against the RDF.
I think the strengths of RDF/OWL is that you 1) get a real query language 2) they're w3c standards and 3) you get reasoning, if the triple store supports it, for free (more or less -- you still have to write an ontology).
With RDF/OWL/SPARQL being standards, it makes it quite easy to pick up and move to a new triple store with a different feature set should you need to, your data is already in a common format that everyone understands and any application logic encoded as queries are completely portable. And in most cases, you'd be writing against either the Sesame or Jena APIs, or working over SPARQL protocol, so you might need to only change your config/init. I think that's a big win in the early prototyping phases.
I also think that RDF/OWL especially combined w/ reasoning and the kinds of complex SPARQL queries that you can create with the new SPARQL 1.1 really suit themselves well to building complicated analytic applications. Also, I think that the impression that most people have that RDF triple stores don't scale is no longer correct. Most triple stores at this point easily scale into the billions of triples and have very competitive throughput numbers as well.
So based on what I think you might be doing, I think semweb might be a better bet for you. I did a similar project a few years back using RDF & RDFS for the backend fronted by a simple Pylons based webapp and was very happy with the results.

Freely available example datasets of hierarchical information, and realistic names

I'm about to write some example applications and accompanying documents comparing ways of accessing information stored in relational databases. To demonstrate real-life requirements, I need to include a realistic dataset of hundreds of thousands of facts.
Is anyone aware of publicly available, free datasets of that magnitude, of datasets of human names with human-level variance, or hierarchical datasets of either large organizational hierarchies, or large hierarchical, categorized, product catalogues?
Please point me in the right direction, if you are.
Part 1, human names: http://timecenter.cs.aau.dk/software.htm
Part 2, hierarchical data: no answer yet
The wikipedia dump is pretty massive: obligatory wikipedia link.
Your own PC's directory tree is a large hierarchical structure with lots of facts. You probably have a few thousand "Facts" which are file names, modification dates, sizes, extra OS info, etc., etc.
If that's not large enough, find a server that you can login to. That will be larger.
Not large enough? Get a web crawler and start crawling a big web site. That can be as large as you have the patience to crawl.
http://dev.mysql.com/doc/sakila/en/sakila.html