Protocol for Web Description Resources - semantic-web

Is POWDER (http://www.w3.org/TR/powder-dr/) meant to be used for Trust/ Proof layers of the Semantic Web? Are there any other w3c standards made for the same layers?
Thank you in advance

Well, the upper layers are pretty much still science fiction, and no one knows whether they will work out exactly in the way suggested by the usual layer cake diagrams.
There are some standards that might be useful as building blocks for supporting these higher layers. POWDER is one of them. Named Graphs (as standardised in SPARQL, and currently under consideration as a native feature of RDF 1.1) is another one. The W3C Provenance Ontology is yet another one.
Then there's plenty of proposals and research ideas that are not yet ready for standardisation.

Related

Is topic coherence (gensim CoherenceModel) calculated based exclusively on my corpus or external data as well?

I'm topic modeling a corpus of English 20th century correspondence using LDA and I've been using topic coherence (as well as silhouette scores) to evaluate my topics. I use gensim's CoherenceModel with c_v coherence and the highest I've ever gotten was a 0.35 score in all the models I've tested, even in the topics that make the most sense to me in qualitative evaluation, even after extensive pre-processing and hyperparameter comparison.
So I basically accepted that that's the best I'd get, but in order to write about it now I've been reading up on topic coherence and I've understood it's a pipeline and it models human judgement. One thing I can't seen to find clear info on, though: Is it based exclusively on calculations made on my corpus, or is it based on some external data as well? Like trained on external corpora that might have nothing to do with my domain? Should I use u_mass instead?
Yes, except u_mass, they all use external reference datasets. However, it may not be a bad thing, as those reference datasets provide richer information.

Why aren't TripleStore implemented as Native Graph Store as Property-Graph Store are?

Sparql based store or put another way, TripleStore, are known to be less efficient than property graph store, on top of not being able to be distributed while maintaining performance as property graph.
I understand that there are a lot of things at stake here, such as inferencing and what not. Putting distribution and inferencing aside where we could limit ourself to RDFS which can be fully captured via SPARQL, I am wondering why that is ?
More specifically why is the storage the issue. What is limiting Sparql Based store to store data as Property graph store does, and performing traversal instead of massive join queries. Can't sparql simply be translated to Gremlin steps for instance ? What is the limitation there? Can't the join be avoided ?
My assumption is, if sparql can be translated in efficient step traversal, and data is stored as property graph do, such as as janusGraph does https://docs.janusgraph.org/latest/data-model.html , then the issue of performance would be bridged while maintaining some inference such as RDFS.
This being said, Sparql is not Turing-complete of course, but at least for what it does, it would do it fast and possibly at scale as well. The goal is not to compete in my view, but to benefit for SPARQL ease of use and using traversal language like gremlin for things that really requires it e.g. OLAP.
Is there any project in that direction, has Apache jena considered any of this?
I saw that Graql of Grakn seem to be using that road for the reason I explain above, hence what's stopping the TripleStore community ?
#Michael, I am happy that you step in as you definitely know more than me on this :) . I am on a learning journey at this point. At your request here is one of the paper that inspired my understanding:
arxiv.org/abs/1801.02911 (SPARQL querying of Property Graphs using
Gremlin Traversals)
I quote them
"We present a comprehensive empirical evaluation of Gremlinator and
demonstrate its validity and applicability by executing SPARQL queries
on top of the leading graph stores Neo4J, Sparksee and Apache
TinkerGraph and compare the performance with the RDF stores Virtuoso,
4Store and JenaTDB. Our evaluation demonstrates the substantial
performance gain obtained by the Gremlin counterparts of the SPARQL
queries, especially for star-shaped and complex queries."
They explain however that things depends somehow on the type of queries.
Or as another answer put that in stack overflow Comparison of Relational Databases and Graph Databases would also help understand the issue between Set and path. My understanding is that TripleStore works with Set too. This being said i am definitely not aware of all the optimization technics implemented in TripleStore lately, and i saw several papers explaining technics to significantly prune set join operation.
On distribution it is more a guts feelings. For instance, doing join operation in a distributed fashion sounds very but very expensive to me. I don't have the papers and my research is not exhaustive on the matters. But from what I have red and I will have to dig in my Evernote :) to back it, that's the fundamental problem with distribution. Automated smart sharding here seems not to help alleviate the issue.
#Michael this a very but very complex subject. I'm definitively on the journey and that's why i am helping myself with stackoverflow to guide my research. You probably have an idea of as to why. So feel free to provides with pointers indeed.
This being said, I am not saying that there is a problem with RDF and that Property-Graph are better. I am saying that somehow, when it comes to graph traversal, there are ways of implementing a backend that makes this fast. The data model is not the issue here, the data structure used to support the traversal is the issue. The second thing that i am saying is that, it seems that the choice of the query language influence how the "traversal" is performed and hence the data structure that is used to back the data model.
That's my understanding so far, and yes I do understand that there are a lot of other factor at play, and feel free to enumerate some of them to guide my journey.
In short my question comes down to, is it possible to have RDF stores backed by a so-called Native Graph Storage and then Implement Sparql in term of Traversal steps rather than joins over set as per its algebra ? Wouldn't that makes things a bit faster. It seems to be that this is somewhat the approach taken by https://github.com/graknlabs/grakn which is primarily backed by janusGraph for a graph like storage. Although it is not RDF, Graql is the same Idea as having RDFS++ + Sparql. They claim to just do it better, for which i have my reservation, but that's not the fundamental question of this thread. The bottom line is they back knowledge representation by the information retrieval (path traversal) and the accompanying storage approach that Property-Graph championed. Let me be clear on this, I am not saying that the graph native storage is the property of property graph. It is just in my mind a storage approach optimized to store Graph Structure where the information retrieval involve (path) traversal: https://docs.janusgraph.org/latest/data-model.html.
First, I'd love to see the references that back up your claim that RDF-based systems are inherently less efficient than property graph ones, because frankly it's a nonsensical claim. Further, there have been distributed, and I'm assuming you mean scale-out, RDF stores, so the claim that they are not able to be distributed is simply incorrect.
The Property Graph model, and Gremlin, can easily be implemented on top of an RDF-based system. This has been done at twice once to my knowledge, and in one of those implementations reasoning was supported at the Gremlin/Property Graph layer. So you don't need to be a Property Graph based system to support that model. There are a myriad of reasons why systems, RDF and Property Graph, make specific implementation choices, from storage to execution and beyond, and those choices are guided some by the "native" model, the technology chosen for implementation, and perhaps most importantly, the use cases for the system and the problems it aims to solve.
Further, it's unclear what you recommend the authors of RDF-based systems actually do; are you suggesting scale-out is beneficial? Are you stating that your preference for the Propety Graph model should be taken as gospel such that RDF-based systems give up and switch data models? Do you want Property Graph systems retrofit RDFS?
Finally, to the initial question you asked, I think you have it exactly backwards; the Property Graph model is a hybrid graph model mixing elements of graph and key-value models, whereas the RDF model is a pure, ie native, graph model. Gremlin will be adopting the RDF model, albeit with syntactic sugar around what in the RDF world is called reification, but to everyone else, edge properties. So in the world where your exemplar of the Property Graph model is abandoning said model, I'm not sure what more to tell you, other than you should do a bit more background research.

Semantic techniques in IOT

I am trying to use semantic technologies in IOT. From the last two months I am doing literature survey and during this time I came to know some of the tools required like (protege, Apache Jena). Now along with reading papers I want to play with semantic techniques like annotation, linking data etc so that I can get the better understanding of the concepts involved. For the same I have put the roadmap as:
Collect data manually (using sensors) or use some data set already on the web.
Annotate the dataset and possibly use ontology (not sure)
Apply open linking data principles
I am not sure whether this road map is correct or not. I am asking for suggestions in following points
Is this roadmap correct?
How should I approach for steps 2 and 3. In other words which tools should I use for these steps?
Hope you guys can help me in finding a proper way for handling this issue. Thanks
Semantics and IoT (or semantic sensor web [1]) is a hot topic. Congratulations that you choose a interesting and worth pursuing research topic.
In my opinion, your three steps approach looks good. I would recommend you to do a quick prototype so you can learn the possible challenges early.
In addition to the implementation technologies (Portege, etc.), there are some important works might be useful for you:
Open Geospatial Consortium (OGC) Sensor Web Enablement (SWE). [2] It is an important work for sharing and exchanging sensor observation data. Many large organizations (NOAA, NASA, NRCan, AAFC, ESA, etc.) have adopted this standard. This standard has defined a conceptual data model/ontology (O&M, ISO 19156). Note: this is a very comprehensive standard, hence it's very BIG and can be time consuming to read. I recommend to read #2 mentioned below.
OGC SensorThings API (http://ogc-iot.github.io/ogc-iot-api/), a IoT cloud API standard based on the OGC SWE. This might be most relevant to you. It is a light-weight protocol of the SWE family, and designed specifically for IoT. Some early research work has been done to use JSON-LD to annotate SensorThings.
W3C Spatial Data on Web (http://www.w3.org/2015/spatial/wiki/Main_Page). It is an on-going joint work between W3C and OGC. Part of the goal is to mature SSN (Semantic Sensor Network) ontology. Once it's ready, the new SSN can be used to annotate SensorThings API for example. A work worth to monitor.
[1] Sheth, Amit, Cory Henson, and Satya S. Sahoo. "Semantic sensor web." Internet Computing, IEEE 12.4 (2008): 78-83.
[2] Bröring, Arne, et al. "New generation sensor web enablement." Sensors 11.3 (2011): 2652-2699.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.

Swarm Intelligence - what kinds of problems are effectively solved?

I am looking for practical problem (or implementations, applications) examples which are effectively algoritmized using swarm intelligence. I found that multicriteria optimization is one example. Are there any others?
IMHO swarm-intelligence should be added to the tags
Are you looking for toy problems or more for real-world applications?
In the latter category I know variants on swarm intelligence algorithms are used in Hollywood for CGI animations such as large (animated) armies riding the fields of battle.
Related but more towards the toy-problem end of the spectrum you can model large crowds with similar algorithms, and use it for example to simulate disaster-scenarios. AFAIK the Dutch institute TNO has research groups on this topic, though I couldn't find an English link just by googling.
One suggestion for a place to start further investigation would be this PDF book:
http://www.cs.vu.nl/~schut/dbldot/collectivae/sci/sci.pdf
That book also has an appendix (B) with some sample projects you could try and work on.
If you want to get a head start there are several frameworks (scientific use) for multi-agent systems such as swarming intelligence (most of 'em are written with Java I think). Some of them include sample apps too. For example have a look at these:
Repast:
http://repast.sourceforge.net/repast_3/
Swarm.org:
http://swarm.org/
Netlogo:
http://ccl.northwestern.edu/netlogo
Post edited, added more info.
I will take your question like: what kind of real-world problems SI can solve?
There are alot. Swarm intelligence is based on the complex behaviour of swarms, where agents in the swarm coordinate and cooperate by executing very simple rules to generate an emergent complex auto organized behaviour. Also, the agents often make a deliberation process to make efficient decisions, and also, the emergent behaviour of the swarms allows them to find patterns, learn and adapt to their environment. Therefore, real-world applications based on SI are those that often required coordination and cooperation techniques, optimization process, exploratory analysis, dynamical poblems, etc. Some of these are:
Optimization techniques (mathematical functions for example)
Coordination of a swarm of robots (to organize inventory for example)
Routing in communication networks. (This is also dynamical combinatorial optimization)
Data analysis (usually exploratory, like clustering). SI has alot of applications in data mining and machine learning. This allows SI algorithms to find interesting patterns in big sets of data.
Np problems in general
I'm sure there are alot more. You should check the book:
"Swarm Intelligence: from natural to artificial systems". This is the basic book.
Take care.