Extract specific triples from turtle file using SPARQL - sparql

I am in a learning phase of SPARQL. I am using rdflib to extract some triples from a graph. I have loaded the triple file and stored it is a graph object. The turtle file looks like below.
https://ontology/meddra_10047786 http://formats/oboInOwl#hasDbXref umls:C0520587
https://ontology/meddra_10047786 http://formats/oboInOwl#hasExactSynonym buschke-lã¶wenstein tumor
https://ontology/efo_12343 http://formats/oboInOwl#hasDbXref umls:C454654
https://ontology/meddra_10047786 http://formats/oboInOwl#hasDbXref mesh:D487584
I would like to extract the triples with medra as a subject and predicate should have a relation relation hasDbXref and value should have mesh and in the end would like to save it in a dataframe. The expected output is:
https://ontology/meddra_10047786 http://formats/oboInOwl#hasDbXref mesh:D487584
I am using following lines of code but it is very slow.
for s, p, o in g:
if "meddra" in s and p = "http://formats/oboInOwl#hasExactSynonym":
print (s, p, o)
else:
pass
Any help is highly appreciated.

extract the triples with medra as a subject
You do not have "medra" as any subject but rather the string "medra" within subject IRIs. This means you can't query direct for "medra" but only look up subjects like <https://ontology/meddra_10047786> that contain "medra". So you are converting between RDFLib objects and simple strings etc. This is error prone and will be slow if you have lots of triples.
You should tag things with "medra" in some way that you can then use in queries, for example:
Case 1:
<https://ontology/meddra_10047786>
dcat:keyword "medra" ;
.
or
Case 2:
<https://ontology/meddra_10047786>
dcterms:type <http://example.com/medra> ;
.
Now you can filter on that like this, for Case 1:
for s, o in g.subject_objects(predicate=DCAT.keyword):
if str(o) == "medra":
for o2 in g.objects(subject=s, predicate=URIRef("http://formats/oboInOwl#hasDbXref")):
print(s, o2) # prints subjects & objects
For Case 2:
for s in g.subjects(predicate=DCTERMS.type, object=URIRef("http://example.com/medra"):
for g.objects(subject=s, predicate=predicate=URIRef("http://formats/oboInOwl#hasDbXref")):
print(s, o2) # prints subjects & objects

Related

Cypher query for gremlin traversal

New to cypher query
g.V().has('Entity1','id',within(id1)).in('Entity2').
where(__.out('Entity3').where(__.out('Entity4').has('name',within(name))))
how to convert the above gremlin to cypher and return adjacent Entity2 invertex.
Here condition is
out('Entity3') should be out of Entity2
out('Entity4') should be out of Entity3 and name in the provided list of values
Return adjacent vertex of inEntity2
Straight answer:
MATCH (m:Entity1 )<-[:Entity2]-(n)
WHERE (n)-[:Entity3]->()-[:Entity4]->({name: "ABC"})
AND m.id in ["id1"]
RETURN n
# Assuming id is a property here.
# If id is the actual ID of the node
MATCH (m:Entity1 )<-[:Entity2]-(n)
WHERE (n)-[:Entity3]->()-[:Entity4]->({name: "ABC"})
AND ID(m) in ["id1"]
RETURN n
I tried to create the graph for you use-case using this query:
CREATE (a:Entity2)-[:Entity2]->(b:Entity1 {id:"id1"}),
(a)-[:Entity3]->(:Entity3)-[:Entity4]->(:Entity4 {name:"ABC"})
the graph looks like this:
However, I think while writing your gremlin traversal you had the intention of specifying the label of the vertex rather than label of the edge. That is why in the query I wrote to create the graph, the relationship and the vertex, relationship is pointing to have same label.
If that is your intention then your cypher query would look like.
MATCH (:Entity1 {id:"id1"})<--(n:Entity2)
WHERE (n)-->(:Entity3)-->(:Entity4 {name: "ABC"})
RETURN n
I'm not 100% sure what you are looking for as the Gremlin above seems incomplete compared to the description but I think what you are looking for is something like this:
MATCH (e1:Entity1)<-[:Entity2]-(e2)-[:Entity3]->(e3)-[:Entity4]->(e4 {code: 'LHR'})
WHERE e1 IN (id1)
RETURN e2

Analysis with SPARQL

I am trying to accomplish some relatively simple analysis with a specific graph.
In Marklogic SPARQL path are created with the following patterns
path+ (one or more duplicate path links)
path* (zero or more duplicate path links)
path? (zero or one path link)
path1/path2 (traversing through 2 different links)
From here, one analysis I would like to achieve is retrieving all nodes that fulfills a specific condition between node X and node Y. Based on this my query would be something like
?nodeX <nodeID> 1
?nodeY <nodeID> 250
?nodeX <nodeLink>* ?nodeY
Which does not really seem correct to me, as I don't think this allows me to retrieve the path linking nodeX to nodeY.
I would also like to know if it is possible to do things such as
Betweeness centrality which is a measure of the number of times a vertex is found between the shortest path of each vertex pair in a graph.
Closeness centrality which is a measure of the distance of one vertex to all other reachable vertices in the graph.
==Update==
Based on the suggestion I have managed to retrieve the path using the following query.
?nodeX <nodeID> "1"
?nodeY <nodeID> "250"
?nodeX <nodeLink>* ?v
?v ?p ?u
?u <nodeLink>* ?nodeY
When I attempted to do <p> | !<p> in my query an error occurred and stating ! was not a valid expression. However, I believe I can still do the same by using ?path which will accept any predicate.

SPARQL path traversing

I am trying to create a query using SPARQL on a ttl file where I have part of the graph representing links as follows:
Is it possible to search for the type Debit and get all the literals associated with its parent ie: R494Vol1D2, Salvo, Vassallo?
Do I need to use paths?
As AKSW correctly said, RDF is about directed graphs. So I created a small n-triples file based on your image of the graph. I assume that the dataset looks like this:
<http://natarchives.com.mt/deed/R494Vol1-D2> <http://purl.org/dc/terms/type> "Debit".
<http://natarchives.com.mt/deed/R494Vol1-D2> <http://purl.org/dc/terms/identifier> "R494Vol1D2".
<http://natarchives.com.mt/deed/R494Vol1-D2> <http://data.archiveshub.ac.uk/def/associatedWith> <http://natarchives.com.mt/person/person796>.
<http://natarchives.com.mt/person/person796> <http://xmlns.com/foaf/0.1/firstName> "Salvo".
<http://natarchives.com.mt/person/person796> <http://xmlns.com/foaf/0.1/family_name> "Vassallo".
Also I did not know the prefix locah but according to http://prefix.cc it stands for http://data.archiveshub.ac.uk/def/
So if this dataset is correct you could use the following query:
1 SELECT ?literal WHERE{
2 ?start <http://purl.org/dc/terms/type> "Debit".
3 ?start <http://data.archiveshub.ac.uk/def/associatedWith>* ?parent.
4 ?parent ?hasLiteral ?literal.
5 FILTER(isLiteral(?literal) && ?literal != "Debit" )
6 }
In line 2 we define the starting point of our path, which is every vertex that has the type "Debit". Then we look for all vertices that are connected to ?start with an edge labelled with <http://data.archiveshub.ac.uk/def/associatedWith>. These vertices are then bound to ?parent. After that we look for all triples that have ?parent as subject and store the object in ?literal. In Line 6 we filter everything that is not a literal or is "Debit" from ?literal resulting in the desired outcome.
If I modeled the direction of <http://data.archiveshub.ac.uk/def/associatedWith> wrongly, you could change line 3 of the query to:
?start ^<http://data.archiveshub.ac.uk/def/associatedWith>* ?parent
This would change the direction of the edge.
And to answer the question if you need to use paths: If you do not know how long the path of edges labeled with <http://data.archiveshub.ac.uk/def/associatedWith> will be, then in my opinion yes, you will have to use either * or + of property paths.

Neo4j: How to pass a variable to Neo4j Apoc (apoc.path.subgraphAll) Property

Am new to Neo4j and trying to do a POC by implementing a graph DB for Enterprise Reference / Integration Architecture (Architecture showing all enterprise applications as Nodes, Underlying Tables / APIs - logically grouped as Nodes, integrations between Apps as Relationships.
Objective is to achieve seamlessly 'Impact Analysis' using the strength of Graph DB (Note: I understand this may be an incorrect approach to achieve whatever am trying to achieve, so suggestions are welcome)
Let me come brief my question now,
There are four Apps - A1, A2, A3, A4; A1 has set of Tables (represented by a node A1TS1) that's updated by Integration 1 (relationship in this case) and the same set of tables are read by Integration 2. So the Data model looks like below
(A1TS1)<-[:INT1]-(A1)<-[:INT1]-(A2)
(A1TS1)-[:INT2]->(A1)-[:INT2]->(A4)
I have the underlying application table names captured as a List property in A1TS1 node.
Let's say one of the app table is altered for a new column or Data type and I wanted to understand all impacted Integrations and Applications. Now am trying to write a query as below to retrieve all nodes & relationships that are associated/impacted because of this table alteration but am not able to achieve this
Expected Result is - all impacted nodes (A1TS1, A1, A2, A4) and relationships (INT1, INT2)
Option 1 (Using APOC)
MATCH (a {TCName:'A1TS1',AppName:'A1'})-[r]-(b)
WITH a as STRTND, Collect(type(r)) as allr
CALL apoc.path.subgraphAll(STRTND, {relationshipFilter:allr}) YIELD nodes, relationships
RETURN nodes, relationships
This faile with error Failed to invoke procedure 'apoc.path.subgraphAll': Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
Option 2 (Using with, unwind, collect clause)
MATCH (a {TCName:'A1TS1',AppName:'A1'})-[r]-(b)
WITH a as STRTND, Collect(r) as allr
UNWIND allr as rels
MATCH p=()-[rels]-()-[rels]-()
RETURN p
This fails with error "Cannot use the same relationship variable 'rels' for multiple patterns" but if I use the [rels] once like p=()-[rels]=() it works but not yielding me all nodes
Any help/suggestion/lead is appreciated. Thanks in advance
Update
Trying to give more context
Showing the Underlying Data
MATCH (TC:TBLCON) RETURN TC
"TC"
{"Tables":["TBL1","TBL2","TBL3"],"TCName":"A1TS1","AppName":"A1"}
{"Tables":["TBL4","TBL1"],"TCName":"A2TS1","AppName":"A2"}
MATCH (A:App) RETURN A
"A"
{"Sponsor":"XY","Platform":"Oracle","TechOwnr":"VV","Version":"12","Tags":["ERP","OracleEBS","FinanceSystem"],"AppName":"A1"}
{"Sponsor":"CC","Platform":"Teradata","TechOwnr":"RZ","Tags":["EDW","DataWarehouse"],"AppName":"A2"}
MATCH ()-[r]-() RETURN distinct r.relname
"r.relname"
"FINREP" │ (runs between A1 to other apps)
"UPFRNT" │ (runs between A2 to different Salesforce App)
"INVOICE" │ (runs between A1 to other apps)
With this, here is what am trying to achieve
Assume "TBL3" is getting altered in App A1, I wanted to write a query specifying the table "TBL3" in match pattern, get all associated relationships and connected nodes (upstream)
May be I need to achieve in 3 steps,
Step 1 - Write a match pattern to find the start node and associated relationship(s)
Step 2 - Store that relationship(s) from step 1 in a Array variable / parameter
Step 3 - Pass the start node from step 1 & parameter from step 2 to apoc.path.subgraphAll to see all the impacted nodes
This may conceptually sound valid but how to do that technically in neo4j Cypher query is the question.
Hope this helps
This query may do what you want:
MATCH (tc:TBLCON)
WHERE $table IN tc.Tables
MATCH p=(tc)-[:Foo*]-()
WITH tc,
REDUCE(s = [], x IN COLLECT(NODES(p)) | s + x) AS ns,
REDUCE(t = [], y IN COLLECT(RELATIONSHIPS(p)) | t + y) AS rs
UNWIND ns AS n
WITH tc, rs, COLLECT(DISTINCT n) AS nodes
UNWIND rs AS rel
RETURN tc, nodes, COLLECT(DISTINCT rel) AS rels;
It assumes that you provide the name of the table of interest (e.g., "TBL3") as the value of a table parameter. It also assumes that the relationships of interest all have the Foo type.
It first finds tc, the TBLCON node(s) containing that table name. It then uses a variable-length non-directional search for all paths (with non-repeating relationships) that include tc. It then uses COLLECT twice: to aggregate the list of nodes in each path, and to aggregate the list of relationships in each path. Each aggregation result would be a list of lists, so it uses REDUCE on each outer list to merge the inner lists. It then uses UNWIND and COLLECT(DISTINCT x) on each list to produce a list with unique elements.
[UPDATE]
If you differentiate between your relationships by type (rather than by property value), your Cypher code can be a lot simpler by taking advantage of APOC functions. The following query assumes that the desired relationship types are passed via a types parameter:
MATCH (tc:TBLCON)
WHERE $table IN tc.Tables
CALL apoc.path.subgraphAll(
tc, {relationshipFilter: apoc.text.join($types, '|')}) YIELD nodes, relationships
RETURN nodes, relationships;
WIth some lead from cybersam's response, the below query gets me what I want. Only constraint is, this result is limited to 3 layers (3rd layer through Optional Match)
MATCH (TC:TBLCON) WHERE 'TBL3' IN TC.Tables
CALL apoc.path.subgraphAll(TC, {maxLevel:1}) YIELD nodes AS invN, relationships AS invR
WITH TC, REDUCE (tmpL=[], tmpr IN invR | tmpL+type(tmpr)) AS impR
MATCH FLP=(TC)-[]-()-[FLR]-(SL) WHERE type(FLR) IN impR
WITH FLP, TC, SL,impR
OPTIONAL MATCH SLP=(SL)-[SLR]-() WHERE type(SLR) IN impR RETURN FLP,SLP
This works for my needs, hope this might also help someone.
Thanks everyone for the responses and suggestions
****Update****
Enhanced the query to get rid of Optional Match criteria and other given limitations
MATCH (initTC:TBLCON) WHERE $TL IN initTC.Tables
WITH Reduce(O="",OO in Reduce (I=[], II in collect(apoc.node.relationship.types(initTC)) | I+II) | O+OO+"|") as RF
MATCH (TC:TBLCON) WHERE $TL IN TC.Tables
CALL apoc.path.subgraphAll(TC,{relationshipFilter:RF}) YIELD nodes, relationships
RETURN nodes, relationships
Thanks all (especially cybersam)

Pig - comparing two similar statement : one working, the other not

I begin to be really annoyed with PIG :the language seems really not stable, the documentation is poor, there are not that many examples on internet, and any small change in the code can give radical differences :from failure to expected result.... Here is another kind of this last theme :
grunt> describe actions_by_unite;
actions_by_unite: {
group: chararray,
nb_actions_by_unite_and_action: {
(
unite: chararray,
lib_type_action: chararray,
double
)
}
}
-- works :
z = foreach actions_by_unite {
generate group, SUM(nb_actions_by_unite_and_action.$2);};
-- doesn't work :
z = foreach actions_by_unite {
x = SUM(nb_actions_by_unite_and_action.$2);
generate group, x;};
-- error :
2015-05-08 14:43:44,712 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 107, column 16> Invalid scalar projection: x : A column needs to be projected from a relation for it to be used as a scalar
Details at logfile: /private/tmp/pig-err.log
And so :
-- doesn't work neither:
z = foreach actions_by_unite { x = SUM(nb_actions_by_unite_and_action.$2);
generate group, x.$0;};
--error :
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (AC,EMAIL,1.1186133550060547E-4), 2nd :(AC,VISITE,6.25755280560356E-4)
at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:120)
Does anyone would know why ?
Do you have some nice blog / ressources to propose with examples to master this language ?
I have the o'reilly book, but it seems a bit old, I have the 'Agile Data Science' and the "Hadoop definitive guide" book with some examples in it... I found this page really interesting : https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/
Any good video on coursera or other inputs ? Do you guys also have problems with this language ? or I am simply dumb ?....
That thing in particular is not because of Pig being unstable, it's because what you are trying to do is correct in the first approach, but wrong in the others.
When you make a group by, you have for each group a bag that contains X tuples. Inside a nested foreach, you have one group with its bag for each iteration, which means that a SUM inside there will yield a scalar value: the sum of the bag you are currently working with. Apache Pig does not work with scalars, it works with relations, therefore you cannot assign a scalar value to an alias, which is exactly what you are doing in the second and third approach.
Therefore, the error comes from attempting something like:
A = foreach B {
x = SUM(bag.$0);
}
However, if you want to emit for each of the groups a scalar, you can perfectly do this as long as you never assign a scalar to an alias. That is why it works perfectly if you do the sum at the end of the foreach, because you are returning for each of the groups a tuple with two values: the group and the sum.