Neo4j: How to pass a variable to Neo4j Apoc (apoc.path.subgraphAll) Property - variables

Am new to Neo4j and trying to do a POC by implementing a graph DB for Enterprise Reference / Integration Architecture (Architecture showing all enterprise applications as Nodes, Underlying Tables / APIs - logically grouped as Nodes, integrations between Apps as Relationships.
Objective is to achieve seamlessly 'Impact Analysis' using the strength of Graph DB (Note: I understand this may be an incorrect approach to achieve whatever am trying to achieve, so suggestions are welcome)
Let me come brief my question now,
There are four Apps - A1, A2, A3, A4; A1 has set of Tables (represented by a node A1TS1) that's updated by Integration 1 (relationship in this case) and the same set of tables are read by Integration 2. So the Data model looks like below
(A1TS1)<-[:INT1]-(A1)<-[:INT1]-(A2)
(A1TS1)-[:INT2]->(A1)-[:INT2]->(A4)
I have the underlying application table names captured as a List property in A1TS1 node.
Let's say one of the app table is altered for a new column or Data type and I wanted to understand all impacted Integrations and Applications. Now am trying to write a query as below to retrieve all nodes & relationships that are associated/impacted because of this table alteration but am not able to achieve this
Expected Result is - all impacted nodes (A1TS1, A1, A2, A4) and relationships (INT1, INT2)
Option 1 (Using APOC)
MATCH (a {TCName:'A1TS1',AppName:'A1'})-[r]-(b)
WITH a as STRTND, Collect(type(r)) as allr
CALL apoc.path.subgraphAll(STRTND, {relationshipFilter:allr}) YIELD nodes, relationships
RETURN nodes, relationships
This faile with error Failed to invoke procedure 'apoc.path.subgraphAll': Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
Option 2 (Using with, unwind, collect clause)
MATCH (a {TCName:'A1TS1',AppName:'A1'})-[r]-(b)
WITH a as STRTND, Collect(r) as allr
UNWIND allr as rels
MATCH p=()-[rels]-()-[rels]-()
RETURN p
This fails with error "Cannot use the same relationship variable 'rels' for multiple patterns" but if I use the [rels] once like p=()-[rels]=() it works but not yielding me all nodes
Any help/suggestion/lead is appreciated. Thanks in advance
Update
Trying to give more context
Showing the Underlying Data
MATCH (TC:TBLCON) RETURN TC
"TC"
{"Tables":["TBL1","TBL2","TBL3"],"TCName":"A1TS1","AppName":"A1"}
{"Tables":["TBL4","TBL1"],"TCName":"A2TS1","AppName":"A2"}
MATCH (A:App) RETURN A
"A"
{"Sponsor":"XY","Platform":"Oracle","TechOwnr":"VV","Version":"12","Tags":["ERP","OracleEBS","FinanceSystem"],"AppName":"A1"}
{"Sponsor":"CC","Platform":"Teradata","TechOwnr":"RZ","Tags":["EDW","DataWarehouse"],"AppName":"A2"}
MATCH ()-[r]-() RETURN distinct r.relname
"r.relname"
"FINREP" │ (runs between A1 to other apps)
"UPFRNT" │ (runs between A2 to different Salesforce App)
"INVOICE" │ (runs between A1 to other apps)
With this, here is what am trying to achieve
Assume "TBL3" is getting altered in App A1, I wanted to write a query specifying the table "TBL3" in match pattern, get all associated relationships and connected nodes (upstream)
May be I need to achieve in 3 steps,
Step 1 - Write a match pattern to find the start node and associated relationship(s)
Step 2 - Store that relationship(s) from step 1 in a Array variable / parameter
Step 3 - Pass the start node from step 1 & parameter from step 2 to apoc.path.subgraphAll to see all the impacted nodes
This may conceptually sound valid but how to do that technically in neo4j Cypher query is the question.
Hope this helps

This query may do what you want:
MATCH (tc:TBLCON)
WHERE $table IN tc.Tables
MATCH p=(tc)-[:Foo*]-()
WITH tc,
REDUCE(s = [], x IN COLLECT(NODES(p)) | s + x) AS ns,
REDUCE(t = [], y IN COLLECT(RELATIONSHIPS(p)) | t + y) AS rs
UNWIND ns AS n
WITH tc, rs, COLLECT(DISTINCT n) AS nodes
UNWIND rs AS rel
RETURN tc, nodes, COLLECT(DISTINCT rel) AS rels;
It assumes that you provide the name of the table of interest (e.g., "TBL3") as the value of a table parameter. It also assumes that the relationships of interest all have the Foo type.
It first finds tc, the TBLCON node(s) containing that table name. It then uses a variable-length non-directional search for all paths (with non-repeating relationships) that include tc. It then uses COLLECT twice: to aggregate the list of nodes in each path, and to aggregate the list of relationships in each path. Each aggregation result would be a list of lists, so it uses REDUCE on each outer list to merge the inner lists. It then uses UNWIND and COLLECT(DISTINCT x) on each list to produce a list with unique elements.
[UPDATE]
If you differentiate between your relationships by type (rather than by property value), your Cypher code can be a lot simpler by taking advantage of APOC functions. The following query assumes that the desired relationship types are passed via a types parameter:
MATCH (tc:TBLCON)
WHERE $table IN tc.Tables
CALL apoc.path.subgraphAll(
tc, {relationshipFilter: apoc.text.join($types, '|')}) YIELD nodes, relationships
RETURN nodes, relationships;

WIth some lead from cybersam's response, the below query gets me what I want. Only constraint is, this result is limited to 3 layers (3rd layer through Optional Match)
MATCH (TC:TBLCON) WHERE 'TBL3' IN TC.Tables
CALL apoc.path.subgraphAll(TC, {maxLevel:1}) YIELD nodes AS invN, relationships AS invR
WITH TC, REDUCE (tmpL=[], tmpr IN invR | tmpL+type(tmpr)) AS impR
MATCH FLP=(TC)-[]-()-[FLR]-(SL) WHERE type(FLR) IN impR
WITH FLP, TC, SL,impR
OPTIONAL MATCH SLP=(SL)-[SLR]-() WHERE type(SLR) IN impR RETURN FLP,SLP
This works for my needs, hope this might also help someone.
Thanks everyone for the responses and suggestions
****Update****
Enhanced the query to get rid of Optional Match criteria and other given limitations
MATCH (initTC:TBLCON) WHERE $TL IN initTC.Tables
WITH Reduce(O="",OO in Reduce (I=[], II in collect(apoc.node.relationship.types(initTC)) | I+II) | O+OO+"|") as RF
MATCH (TC:TBLCON) WHERE $TL IN TC.Tables
CALL apoc.path.subgraphAll(TC,{relationshipFilter:RF}) YIELD nodes, relationships
RETURN nodes, relationships
Thanks all (especially cybersam)

Related

How to use multiple union step in gremlin query?

I want to convert following cypher query in gremlin but facing issue while using union.
match d=(s:Group{id:123})<-[r:Member_Of*]-(p:Person) with d,
RELATIONSHIPS(d) as rels
WHERE NONE(rel in rels WHERE EXISTS(rel.`Ceased On`))
return *
UNION
match d=(s:Group{id:123})<-[r:Member_Of*]-(p:Group)-[r1:
Member_Of*]->(c:Group) with d,
RELATIONSHIPS(d) as rels
WHERE NONE(rel in rels WHERE EXISTS(rel.`Ceased On`))
return *
UNION
match d=(c:Group{id:123})-[r:Member_Of*]->(p:Group) with d,
RELATIONSHIPS(d) as rels
WHERE NONE(rel in rels WHERE EXISTS(rel.`Ceased On`))
return *
UNION
match d=(s:Group{id:123})<-[r:Member_Of*]-(p:Group)<-[r1:
Member_Of*]-(c:Person) with d,
RELATIONSHIPS(d) as rels
where NONE(rel in rels WHERE EXISTS(rel.`Ceased On`))
return *
In above cypher query, source vertex is Group, which has id '123', so for incoming and outgoing edge, I have created following gremlin query.
g.V().hasLabel('Group').has('id',123)
union(
__.inE('Member_Of').values('Name'),
__.outE('Member_Of').values('Name'))
.path()
Now I have to traverse incoming edge for the vertex, which is incoming vertex for source vertex in above query, where I am confused with union syntax.
Please help, Thanks :)
This part of the Cypher query
match d=(s:Group{id:123})<-[r:Member_Of*]-(p:Group)<-[r1:Member_Of*]-(c:Person)
In Gremlin, can be expressed as
g.V().has('Group','id',123).
repeat(inE('Member_Of').outV()).until(hasLabel('G')).
repeat(inE('Member_Of').outV()).until(hasLabel('Person')).
path().
by(elementMap())
If there are not really multiple hops involved (i.e. in Cypher if you do not really need the '*') you can remove the repeat construct and just keep the inE and outV steps. Anyway, this bit of Gremlin will get you the nodes and edges (relationships) along with all their properties etc. as the path will contain their elementMap.
Note that a straight port from Cypher to Gremlin may not take full advantage of the target database. For example, many Gremlin enabled stores allow user provided (real) ID values. This makes querying a lot more efficient as you can use something like this:
g.V('123')
to directly find a vertex.
Please add a comment below if this does not fully unblock you.
UPDATED: 2021-10-29
I used the air routes data set to test that the query pattern works, using this query:
g.V().has('airport','code','AUS').
repeat(inE('contains').outV()).until(hasLabel('country')).limit(1).
repeat(outE('contains').inV()).until(hasLabel('airport')).limit(3).
path().
by(elementMap())
Following is the total replacement of cypher query with gremlin query.
g.V().has('Group','id',123). //source vertex
union(
//Following represent incoming Member_Of edge towards source vertex from Person Node, untill the last node in the chain
repeat(inE('Member_Of').outV()).until(hasLabel('Person')),
//Following represent incoming Member_Of edge from Group to source vertex and edge outwards from same Group, untill the last node in the chain
repeat(inE('Member_Of').outV().hasLabel('Group').simplePath()).until(inE().count().is(0)).
repeat(outE('Member_Of').inV().hasLabel('Group').simplePath()).until(inE().count().is(0)),
//Following represent outgoing Member_Of edge from source vertex to another Group node, untill the last node in the chain
repeat(outE('Member_Of').inV().hasLabel('Group').simplePath()).until(outE().count().is(0)),
//Following represent incoming Member_Of edge from Group to source vertex and incoming edge from person to Group untill the last node in the chain
repeat(inE('Member_Of').outV().hasLabel('Group').simplePath()).until(inE().count().is(0)).
repeat(inE('Member_Of').outV().hasLabel('Person').simplePath()).until(inE().count().is(0))
).
path().by(elementMap())

Create subgraph query in Gremlin around single node with outgoing and incoming edges

I have a large Janusgraph database and I'd to create a subgraph centered around one node type and including incoming and outgoing nodes of specific types.
In Cypher, the query would look like this:
MATCH (a:Journal)N-[:PublishedIn]-(b:Paper{paperTitle:'My Paper Title'})<-[:AuthorOf]-(c:Author)
RETURN a,b,c
This is what I tried in Gremlin:
sg = g.V().outE('PublishedIn').subgraph('j_p_a').has('Paper','paperTitle', 'My Paper Title')
.inE('AuthorOf').subgraph('j_p_a')
.cap('j_p_a').next()
But I get a syntax error. 'AuthorOf' and 'PublishedIn' are not the only edge types ending at 'Paper' nodes.
Can someone show me how to correctly execute this query in Gremlin?
As written in your query, the outE step yields edges and the has step will check properties on those edges, following that the query processor will expect an inV not another inE. Without your data model it is hard to know exactly what you need, however, looking at the Cypher I think this is what you want.
sg = g.V().outE('PublishedIn').
subgraph('j_p_a').
inV().
has('Paper','paperTitle', 'My Paper Title').
inE('AuthorOf').
subgraph('j_p_a')
cap('j_p_a').
next()
Edited to add:
As I do not have your data I used my air-routes graph. I modeled this query on yours and used some select steps to limit the data size processed. This seems to work in my testing. Hopefully you can see the changes I made and try those in your query.
sg = g.V().outE('route').as('a').
inV().
has('code','AUS').as('b').
select('a').
subgraph('sg').
select('b').
inE('contains').
subgraph('sg').
cap('sg').
next()

Is there a more efficient way to return all records in a "traversed" link'ed list?

I have a generic (non-V) class with a LINKMAP field that links a chain of time-series data. Each record links to the next record based on a key of the next month/day/hour/etc. and a value of the #rid of the next record. I could use edges (non-lightweight, with a single property the same as the LINKMAP key), but I don't want to because I only need a one-direction link and want to use the least amount of storage space (plus my assumption is that while using edges might offer slightly easier querying, it wouldn't offer any benefit in terms of performance or storage space).
I currently have the following batch query:
begin;
let r = select from data where key='AAA';
let y = select expand(links['2017']) from $r;
let m = select expand(links['07']) from $y;
let d = select expand(links['15']) from $m;
let h = select expand(links['10']) from $d;
return (select $r[0], $y[0], $m[0], $d[0], $h[0])
This works, in the sense that I get a result set with the #rids of each record in the link'ed list.
I have two questions:
Is there a more efficient way to do this query? I've tried various TRAVERSE options with $path and such, but nothing else I attempted worked at all. I did not try MATCH because I'm not using edges.
The current result set is flat, with each value prop ($r, $y, etc.) containing the corresponding #rid. That's OK for my purposes, but I'd like to know if there's a way to return the actual records instead of just the #rids. Nothing I've tried works.
PS - I'm using orientjs, but the batch runs identically from the console as well.

Django ORM Cross Product

I have three models:
class Customer(models.Model):
pass
class IssueType(models.Model):
pass
class IssueTypeConfigPerCustomer(models.Model):
customer=models.ForeignKey(Customer)
issue_type=models.ForeignKey(IssueType)
class Meta:
unique_together=[('customer', 'issue_type')]
How can I find all tuples of (custmer, issue_type) where there is no IssueTypeConfigPerCustomer object?
I want to avoid a loop in Python. A solution which solves this in the DB would be preferred.
Background: for every customer and for every issue-type, there should be a config in the DB.
If you can afford to make one database trip for each issue type, try something like this untested snippet:
def lacking_configs():
for issue_type in IssueType.objects.all():
for customer in Customer.objects.filter(
issuetypeconfigpercustomer__issue_type=None
):
yield customer, issue_type
missing = list(lacking_configs())
This is probably OK unless you have a lot of issue types or if you are doing this several times per second, but you may also consider having a sensible default instead of making a config object mandatory for each combination of issue type and customer (IMHO it is a bit of a design-smell).
[update]
I updated the question: I want to avoid a loop in Python. A solution which solves this in the DB would be preferred.
In Django, every Queryset is either a list of Model instances or a dict (values querysets), so it is impossible to return the format you want (a list of tuples of Model) without some Python (and possibly multiple trips to the database).
The closest thing to a cross product would be using the "extra" method without a where parameter, but it involves raw SQL and knowing the underlying table name for the other model:
missing = Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype']
)
As a result, each Customer object will have an extra attribute, "issue_type_id", containing the id of one IssueType. You can use the where parameter to filter based on NOT EXISTS (SELECT 1 FROM appname_issuetypeconfigpercustomer WHERE issuetype_id=appname_issuetype.id AND customer_id=appname_customer.id). Using the values method you can have something close to what you want - this is probably enough information to verify the rule and create the missing records. If you need other fields from IssueType just include them in the select argument.
In order to assemble a list of (Customer, IssueType) you need something like:
cross_product = [
(customer, IssueType.objects.get(pk=customer.issue_type_id))
for customer in
Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype'],
where=["""
NOT EXISTS (
SELECT 1
FROM appname_issuetypeconfigpercustomer
WHERE issuetype_id=appname_issuetype.id
AND customer_id=appname_customer.id
)
"""]
)
]
Not only this requires the same number of trips to the database as the "generator" based version but IMHO it is also less portable, less readable and violates DRY. I guess you can lower the number of database queries to a couple using something like this:
missing = Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype'],
where=["""
NOT EXISTS (
SELECT 1
FROM appname_issuetypeconfigpercustomer
WHERE issuetype_id=appname_issuetype.id
AND customer_id=appname_customer.id
)
"""]
)
issue_list = dict(
(issue.id, issue)
for issue in
IssueType.objects.filter(
pk__in=set(m.issue_type_id for m in missing)
)
)
cross_product = [(c, issue_list[c.issue_type_id]) for c in missing]
Bottom line: in the best case you make two queries at the cost of legibility and portability. Having sensible defaults is probably a better design compared to mandatory config for each combination of Customer and IssueType.
This is all untested, sorry if some homework was left for you.

Hadoop Pig - Replace strings in a relation with their corresponding values in a map

I have a relation called conversations_grouped made up of bags of tuples of varying sizes, like so:
DUMP conversations_grouped:
...
({(L194),(L195),(L196),(L197)})
({(L198),(L199)})
({(L200),(L201),(L202),(L203)})
({(L204),(L205),(L206)})
({(L207),(L208)})
({(L271),(L272),(L273),(L274),(L275)})
({(L276),(L277)})
({(L280),(L281)})
({(L363),(L364)})
({(L365),(L366)})
({(L666256),(L666257)})
({(L666369),(L666370),(L666371),(L666372)})
({(L666520),(L666521),(L666522)})
Each L[0-9]+ is a tag corresponding to a string. For example, L194 might be "Hello, how are you doing?" and L195 might be "fine, how are you?". This correspondence is maintained by a map called line_map. Here's a sample:
DUMP line_map;
...
([L666324#Do you think she might be interested in someone?])
([L666264#Well that's typical of Her Majesty's army. Appoint an engineer to do a soldier's work.])
([L666263#Um. There are rumours that my Lord Chelmsford intends to make Durnford Second in Command.])
([L666262#Lighting COGHILL' 5 cigar: Our good Colonel Dumford scored quite a coup with the Sikali Horse.])
([L666522#So far only their scouts. But we have had reports of a small Impi farther north, over there. ])
([L666521#And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?])
([L666520#Well I assure you, Sir, I have no desire to create difficulties. 45])
([L666372#I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.])
([L666371#Lord Chelmsford seems to want me to stay back with my Basutos.])
([L666370#I'm to take the Sikali with the main column to the river])
([L666369#Your orders, Mr Vereker?])
([L666257#Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot])
([L666256#Colonel Durnford... William Vereker. I hear you 've been seeking Officers?])
What I'm trying to do now is parse through each line and replace the L[0-9]+ tags with their corresponding text from line_map. Is it possible to make references to line_map from within a Pig FOREACH statement, or is there something else I have to do?
The first issue with this is that in a map the key must be a quoted string. So you can't use a schema value to access the map. E.G. This will not work.
C: {foo: chararray, M: [value:chararray]}
D = FOREACH C GENERATE M#foo ;
The solution that comes to mind is to FLATTEN conversations_grouped. Then do a join between conversations_grouped and line_map on the L[0-9]+ tag. You'll probably want to project out some of the extra fields (like the L[0-9]+ tag after the join) to make the next step faster. After that you'll have to regroup the data, and massage it into the correct format.
This won't work unless each bag has it's own unique ID for the regrouping, but if each of the L[0-9]+ tags appear in only one bag (conversation) you can use this to create a unique id.
-- A is dumped conversations_grouped
B = FOREACH A {
-- Pulls out an element from the bag to use as the id
id = LIMIT tags 1 ;
-- Flattens B into id, tag form. Each group of tags will have the same id.
GENERATE FLATTEN(id), FLATTEN(tags) ;
}
The schema and output for B is:
B: {id: chararray,tags::tag: chararray}
(L194,L194)
(L194,L195)
(L194,L196)
(L194,L197)
(L198,L198)
(L198,L199)
(L200,L200)
(L200,L201)
(L200,L202)
(L200,L203)
(L204,L204)
(L204,L205)
(L204,L206)
(L207,L207)
(L207,L208)
(L271,L271)
(L271,L272)
(L271,L273)
(L271,L274)
(L271,L275)
(L276,L276)
(L276,L277)
(L280,L280)
(L280,L281)
(L363,L363)
(L363,L364)
(L365,L365)
(L365,L366)
(L666256,L666256)
(L666256,L666257)
(L666369,L666369)
(L666369,L666370)
(L666369,L666371)
(L666369,L666372)
(L666520,L666520)
(L666520,L666521)
(L666520,L666522)
Assuming that the tags are unique, the rest is done like:
-- A2 is line_map, loaded in tag/message pairs instead of a map
-- Joins conversations_grouped and line_map on tag
C = FOREACH (JOIN B by tags::tag, A2 by tag)
-- This generate removes the tag
GENERATE id, message ;
-- Regroups C on the id created in B
D = FOREACH (GROUP C BY id)
-- This step limits the output to just messages
GENERATE C.(message) AS messages ;
Schema and output from D:
D: {messages: {(A2::message: chararray)}}
({(Colonel Durnford... William Vereker. I hear you 've been seeking Officers?),(Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot)})
({(Your orders, Mr Vereker?),(I'm to take the Sikali with the main column to the river),(Lord Chelmsford seems to want me to stay back with my Basutos.),(I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.)})
({(Well I assure you, Sir, I have no desire to create difficulties. 45),(And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?),(So far only their scouts. But we have had reports of a small Impi farther north, over there. )})
NOTE: If at worst, (the L[0-9]+ tags aren't unique) you can give each line of the input file(s) a sequential, integer id before you load it into pig.
UPDATE: If you are using pig 0.11, then you can also use the RANK operator.