Cypher BFS with multiple Relations in Path - cypher

I'd like to model autonomous systems and their relationships in Graph Database (memgraph-db)
There are two different kinds of relationships that can exist between nodes:
undirected peer2peer relationships (edges without arrows in image)
directed provider2customer relationships (arrows pointing to provider in image)
The following image shows valid paths that I want to find with some query
They can be described as
(s)-[:provider*0..n]->()-[:peer*0..n]—()<-[:provider*0..n]-(d)
or in other words
0-n c2p edges followed by 0-n p2p edges followed by 0-n p2c edges
I can fix the first and last node and would like to find a (shortest/cheapest) path. As I understand I can do BFS if there is ONE relation on the path.
Is there a way to query for paths of such form in Cypher?
As an alternative I could do individual queries where I specify the length of each of the segments and then do a query for every length of path until a path is found.
i.e.
MATCH (s)<-[]->(d) // All one hop paths
MATCH (s)-[:provider]->()-[:peer]-(d)
MATCH (s)-[:provider]->()<-[:provider]-(d)
...

Since it's viable to have 7 different path sections, I don't see how 3 BFS patterns (... BFS*0..n) would yield a valid solution. It's impossible to have an empty path because the pattern contains some nodes between them (I have to double-check that).
Writing individual patterns is not great.
Some options are:
MATCH path=(s)-[:BFS*0.n]-(d) WHERE {{filter_expression}} -> The expression has to be quite complex in order to yield valid paths.
MATCH path=(s)-[:BFS*0.n]-(d) CALL module.filter_procedure(path) -> The module.procedure(path) could be implemented in Python or C/C++. Please take a look here. I would recommend starting with Python since it's much easier. Python for the PoC should be fine. I would also recommend starting with this option because I'm pretty confident the solution will work, + it's modular. After all, the filter_procedure could be extended easily, while the query will stay the same.
Could you please provide a sample dataset in a format of a Cypher query (a couple of nodes and edges / a small graph)? I'm glad to come up with a solution.

Related

Implementations of (fully) dynamic connectivity data structures

The dynamic connectivity problem for graphs consists in maintaining a graph data structure that allows for adding and deleting edges of the graph.
Moreover, the data structure should support connectivity queries.
Typically, such a query is of the form ''Are the nodes u and v connected in the graph?''
There are variants of the dynamic connectivity problem that also support different connectivity queries like 2-edge-connectivity or biconnectivity.
My question is: Are there existing efficient implementations of dynamic connectivity data structures?
By efficient I mean that data structures with a low amortized operation costs.
In particular, I am NOT interested in trivial implementations with a complexity of O(n) per operation!
Below I describe in more detail what I am looking for an what I already know.
If only edge insertions are allowed the dynamic connectivity problem can be solved by the well known disjoint-set (aka union find) data structure.
For this data structure there are implementations available in many different programming languages.
Unfortunately, this does not seem to be the case for the dynamic connectivity problem that also allows edge deletions.
The situation is even worse for data structures that also allow other connectivity queries like 2-edge- or biconnectivity.
To the best of my knowledge the algorithms presented in Holm et al. (2001) are still state of the art for many dynamic connectivity problems.
This publication was accompanied by an experimental study, however, as far as I can tell the code was never made publicly available. Also, therein only implementations for the regular connectivity problem are discussed, not for 2-edge- or biconnectivity.
The algorithms by Holm et al. (and also by other authors) are highly non-trivial.
Even though the algorithms are described in much detail it requires a lot of expertise to implement these algorithms in practice.
Because of this I am looking for existing implementation of different dynamic connectivity data structures.
The table below summarizes the (currently underwhelming) implementations of different combinations of supported manipulations and queries.
Graph Manipulations
Connectivity
2-edge-connectivity
Biconnectivity
incremental (adding edges)
disjoint-set
decremental (deleting edges)
Rafael Glikis
fully (adding and deleting edges)
I have searched for implementations in different places. I have looked on git-hub, I have looked through the external links in the relevant Wikipedia articles and I have skimmed through a lot of literature without any success.
I expect we will need a framework for trying things out so that we can discuss this in concrete terms.
I have implemented a small windows application that accepts user queries to read, build, edit and query the connectivity of a graph, showing the time taken to execute each.
Sample run:
Supported queries
add v1 v2 : add link to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.539246 0.539246 query
type query> delete 23 20
4720 vertices 27443 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.004432 0.004432 query
type query> add 23 20
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.0046639 0.0046639 query
The complete application is at https://github.com/JamesBremner/graphConnectivity
To demonstrate how this application can be used, I built it with the graph engine at https://github.com/JamesBremner/PathFinderFeb2023 and ran it on a couple of the test datasets from https://dyngraphlab.github.io/
dataset
edge count
delete
add
3elt.graph.seq.txt
27,443
5ms
5ms
144.graph.seq.txt
2,148,787
13ms
13ms
To get the average time to perform multiple queries, use the random command, like this:
Supported queries
add v1 v2 : add link to graph
add random n : add n random links to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
type query> add random 10
4720 vertices 27454 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
10 1.62e-06 1.62e-05 randomAdd

Can RDF/SPARQL be used for sub-graph matching?

I would like to build a knowledge graph of a set of instances, where each instance is itself a collection of ordered sub-instances. As a simple example, let's assume my instances are chains of marbles {CHAIN1, CHAIN2, CHAIN3, ...} and the sub-instances are colored marbles {CHAIN1: YELLOW-RED-BLUE-RED; CHAIN2: BLUE-YELLOW-GREEN; CHAIN3: GREEN-RED-BLUE-RED}.
Just to clarify, an incorrect approach would define CHAIN1 something like this:
:CHAIN1 :has_marble :YELLOW, :RED, :BLUE, :RED
but querying this would clearly only yield a "bag of marbles" situation.
I would like to be able to:
Query the knowledge graph such that I can get back the marbles for each chain in the correct order.
Match sequences of marbles between different chains. For example, I might want to get all the chains that have the sequence :RED-:BLUE-:RED as a sub-sequence (i.e., CHAIN1 and CHAIN3).
Questions:
What would be the best way of building this knowledge graph? Should I store the marbles as RDF sequences using rdf:first/rdf:rest? Or is there a better, more flexible option? If possible, I would like to be able to define the type of relation between the marbles, say :RED :is_followed_by :BLUE.
Is the type of graph matching I'm after possible? And how about if I'd like to match the sequences using some properties that describe each marble? Say, :BLUE :has_shape :SQUARE, and match the sequence of marbles by their shape?
Note: What I really want to model are chains of DNA and protein sequences, so if anyone has specific recommendations for such applications, that would be even more helpful.

Apache Lucene: Creating an index between strings and doing intelligent searching

My problem is as follows: Let's say I have three files. A, B, and C. Each of these files contains 100-150M strings (one per line). Each string is in the format of a hierarchical path like /e/d/f. For example:
File A (RTL):
/arbiter/par0/unit1/sigA
/arbiter/par0/unit1/sigB
...
/arbiter/par0/unit2/sigA
File B (SCH)
/arbiter_sch/par0/unit1/sigA
/arbiter_sch/par0/unit1/sigB
...
/arbiter_sch/par0/unit2/sigA
File C (Layout)
/top/arbiter/par0/unit1/sigA
/top/arbiter/par0/unit1/sigB
...
/top/arbiter/par0/unit2/sigA
We can think of file A corresponding to circuit signals in a hardware modeling language. File B corresponding to circuit signals in a schematic netlist. File C corresponding to circuit signals in a layout (for manufacturing).
Now a signal will have a mapping between File A <-> File B <-> File C. For example in this case, /arbiter/par0/unit1/sigA == /arbiter_sch/par0/unit1/sigA == /top/arbiter/par0/unit1/sigA. Of course, this association (equivalence) is established by me, and I don't expect the matcher to figure this out for me.
Now say, I give '/arbiter/par0/unit1/sigA'. In this case, the matcher should return a direct match from file A since it is found. For file B/C a direct match is not possible. So it should return the best possible matches (i.e., edit distance?) So in this example, it can give /arbiter_sch/par0/unit1/sigA from file B and /top/arbiter/par0/unit1/sigA from file C.
Instead of giving a full string search, I could also give something like *par0*unit1*sigA and it should give me all the possible matches from fileA/B/C.
I am looking for solutions, and came across Apache Lucene. However, I am not totally sure if this would work. I am going through the docs to get some idea.
My main requirements are the following:
There will be 3 text files with full path to signals. (I can adjust the format to make it more compact if it helps building the indexer more quickly).
Building the index should be fairly fast (take a couple of hours). The files above are static (no modifications).
Searching should be comprehensive. It is OK if it takes ~1s / search but the matching should support direct match, regex match, and edit distance matching. The main challenge is each file can have 100-150 million signals.
Can someone tell me if such a use case can be easily addressed by Lucene? What would be the correct way to go about building a index and doing quick/fast searching? I would like to write some proof-of-concept code and test the performance. Thanks.
i think based on your requirements the best solution would be a PoC with a given test set of entries. Based on this it should be possible to evaluate the target indexing time you like to achieve. Because you only use static informations it's easier, because do don't have to care about topics like NRT (near-real-time searches).
Personally i never used lucene for such a big information set but i think lucene is able to handle this.
How i would do it:
Read tutorials and best practices about lucene, indexing, searching and understand how it works
Define an data set for indexing lets say 1000 lines for each file
Define your lucene document structure
this is really important because based on this you will apply your
searches. take care about analyzer tasks like tokanization if needed
and how. If you need fulltext search care about a TextField.
Write code for simple indexing
Run small tests with indexing and inspect your index with Luke
Write code for simple searching
Define queries and your expected results. execute searches and check
results.
Try to structure your code. separate indexing and searching -> it will be easier to refactor.

Aligning metagenomic reads (OTU representatives) to my phylogenetic tree ~AND~ evaluating how multiple samples align

So...
I am working with amino acid sequences.
I got a fasta file containing reference sequences and a txt file giving the categorization (domain, subdomain, nickname etc etc etc.) for each sequence. First thing, I need to construct a tree with these sequences to check that they group according to the categories they belong to. Example: sequences belonging to domain A group together, sequences belonging to subdomain A1 group together and so on.
Then... I got a fasta file containing OTU representatives derived from metagenomic reads AND a OTU count table declaring how many of each of these representatives are present in each one of my 20 samples. I need to align these OTU sequences to the previously constructed tree ~AND~ consider the per-sample count, to check the categories each sample is enriched for.
What softwares should I use?
Thank you!
We used MAFFT for alignment and Fasttree to build phylogenetic trees, but there are many other options.
We used pplacer to add short aminoacid sequences derived from metagenomic reads to a tree. Not sure about the abundance part, but you could get started from there

Recursive Hierarchy Ranking

I have no idea if I wrote that correctly. I want to start learning higher end data mining techniques and I'm currently using SQL server and Access 2016.
I have a system that tracks ID cards. Each ID is tagged to one particular level of a security hierarchy, which has many branches.
For example
Root
-Maintenance
- Management
- Supervisory
- Manager
- Executive
- Vendors
- Secure
- Per Diem
- Inside Trades
There are many other departments like Maintenance, some simple, some with much more convoluted, hierarchies.
Each ID card is tagged to a level so in the Maintenance example, - Per Diem:Vendors:Maintenance:Root. Others may be just tagged to Vendors, Some to general Maintenance itself (No one has root, thank god).
So lets say I have 20 ID Cards selected, these are available personnel I can task to a job but since they have different area's of security I want to find a commonalities they can all work on together as a 20 person group or whatever other groupings I can make.
So the intended output would be
CommonMatch = - Per Diem
CardID = 1
CardID = 3
CommonMatch = Vendors
CardID = 1
CardID = 3
CardID = 20
So in the example above, while I could have 2 people working on -Per Diem work, because that is their lowest common security similarity, there is also card holder #20 who has rights to the predecessor group (Vendors), that 1 and 3 share, so I could have three of them work at that level.
I'm not looking for anyone to do the work for me (Although examples always welcome), more to point me in the right direction on what I should be studying, what I'm trying to do is called, etc. I know CTE's are a way to go but that seems like only a tool in a much bigger process that needs to be done.
Thank you all in advance
Well, it is not so much a graph-theory or data-mining problem but rather a data-structure problem and one that has almost solved itself.
The objective is to be able to partition the set of card IDs into disjoint subsets given a security clearance level.
So, the main idea here would be to layout the hierarchy tree and then assign each card ID to the path implied by its security level clearance. For this purpose, each node of the hierarchy tree now becomes a container of card IDs (e.g. each node of the hierarchy tree holds a) its own name (as unique identification) b) pointers to other nodes c) a list of card IDs assigned to its "name".)
Then, retrieving the set of cards with clearance UP TO a specific security level is simply a case of traversing the tree from that specific level downwards until the tree's leafs, all along collecting the card IDs from the node containers as they are encountered.
Suppose that we have access tree:
A
+-B
+-C
D
+-E
And card ID assignments:
B:[1,2,3]
C:[4,8]
E:[10,12]
At the moment, B,C,E only make sense as tags, there is no structural information associated with them. We therefore need to first "build" the tree. The following example uses Networkx but the same thing can be achieved with a multitude of ways:
import networkx
G = networkx.DiGraph() #Establish a directed graph
G.add_edge("A","B")
G.add_edge("A","C")
G.add_edge("A","D")
G.add_edge("D","E")
Now, assign the card IDs to the node containers (in Networkx, nodes can be any valid Python object so I am going to go with a very simple list)
G.node["B"]=[1,2,3]
G.node["C"]=[4,8]
G.node["E"]=[10,12]
So, now, to get everybody working under "A" (the root of the tree), you can traverse the tree from that level downwards either via Depth First Search (DFS) or Breadth First Search (BFS) and collect the card IDs from the containers. I am going to use DFS here, purely because Networkx has a function that returns the visited nodes depending on visiting order, directly.
#dfs_preorder_nodes returns a generator, this is an efficient way of iterating very large collections in Python but I am casting it to a "list" here, so that we get the actual list of nodes back.
vis_nodes = list(networkx.dfs_preorder_nodes(G,"A")); #Start from node "A" and DFS downwards
cardIDs = []
#I could do the following with a one-line reduce but it might be clearer this way
for aNodeID in vis_nodes:
if G.node[aNodeID]:
cardIDs.extend(G.node[aNodeID])
In the end of the above iteration, cardIDs will contain all card IDs from branch "A" downwards in one convenient list.
Of course, this example is ultra simple, but since we are talking about trees, the tree can be as large as you like and you are still traversing it in the same way requiring only a single point of entry (the top level branch).
Finally, just as a note, the fact that you are using Access as your backend is not necessarily an impediment but relational databases do not handle graph type data with great ease. You might get away easily for something like a simple tree (like what you have here for example), but the hassle of supporting this probably justifies undertaking this process outside of the database (e.g, use the database just for retrieving the data and carry out the graph type data processing in a different environment. Doing a DFS on SQL is the sort of hassle I am referring to above.)
Hope this helps.