Detect unnecessary explicit relation between nodes - cypher

I have group node structure where a node inherits permissions from previous node
Manager ---Implies---> PowerUser ---Implies---> User
But the structure is not clean and sometimes i have 2 edges from manager
to poweruser and again to user which is already implied by poweruser
Manager ---Implies---> PowerUser ---Implies---> User
---Implies----------------------------> User
How can i query nodes to dectect that i already have a implicit relation and that i don't need the extra explicit relation

It looks like PowerUser is a label in your structure. You can write a variable-length traversal [1] that considers paths of any length so long as this label appears somewhere:
MATCH (entity)-[*0..]->(:PowerUser)
Would be the broadest traversal to accomplish this, matching any node connected to the PowerUser label by 0 or more edges.
Given that PowerUser is a permission, however, it seems like a more appropriate design would be to treat it as a property. Since Cypher is schema-less, properties are not scoped to specific labels, so it can be set and filtered on nodes with the Manager or User labels. This approach would allow more succinct expressions like:
MATCH (entity {PowerUser: true})-[]->()
If this doesn't match your use case, feel free to provide more details about your graph structure!
https://oss.redislabs.com/redisgraph/commands/#variable-length-relationships

Related

Cypher BFS with multiple Relations in Path

I'd like to model autonomous systems and their relationships in Graph Database (memgraph-db)
There are two different kinds of relationships that can exist between nodes:
undirected peer2peer relationships (edges without arrows in image)
directed provider2customer relationships (arrows pointing to provider in image)
The following image shows valid paths that I want to find with some query
They can be described as
(s)-[:provider*0..n]->()-[:peer*0..n]—()<-[:provider*0..n]-(d)
or in other words
0-n c2p edges followed by 0-n p2p edges followed by 0-n p2c edges
I can fix the first and last node and would like to find a (shortest/cheapest) path. As I understand I can do BFS if there is ONE relation on the path.
Is there a way to query for paths of such form in Cypher?
As an alternative I could do individual queries where I specify the length of each of the segments and then do a query for every length of path until a path is found.
i.e.
MATCH (s)<-[]->(d) // All one hop paths
MATCH (s)-[:provider]->()-[:peer]-(d)
MATCH (s)-[:provider]->()<-[:provider]-(d)
...
Since it's viable to have 7 different path sections, I don't see how 3 BFS patterns (... BFS*0..n) would yield a valid solution. It's impossible to have an empty path because the pattern contains some nodes between them (I have to double-check that).
Writing individual patterns is not great.
Some options are:
MATCH path=(s)-[:BFS*0.n]-(d) WHERE {{filter_expression}} -> The expression has to be quite complex in order to yield valid paths.
MATCH path=(s)-[:BFS*0.n]-(d) CALL module.filter_procedure(path) -> The module.procedure(path) could be implemented in Python or C/C++. Please take a look here. I would recommend starting with Python since it's much easier. Python for the PoC should be fine. I would also recommend starting with this option because I'm pretty confident the solution will work, + it's modular. After all, the filter_procedure could be extended easily, while the query will stay the same.
Could you please provide a sample dataset in a format of a Cypher query (a couple of nodes and edges / a small graph)? I'm glad to come up with a solution.

O.O.P and Class Properties

I'm new to O.O.P and would like advice on best practice.
Say for example I have a Course class which holds course information, and a Location class which holds location details. Classes have corresponding repository classes. Now, each Course HAS A location which I have added Location as a property.
When I am pulling the details of a Course from the database, is it best practice to:
A – Populate the Location object from within the CourseRepository Class meaning SQL would return both course and location details
B – Only populate Course object, returning the Location ID, then use the LocationRepository class to find the location details
I’m leaning more towards B as this is a separation of responsibility, however, the thing that’s getting me is performance. Say I need a List instead which returns a result of 50. Would it be wise to query SQL 50 times to seek location details? Would appreciate your thoughts on this.
Lewis
In part, you're thinking in a wrong conceptual direction. It should be: one location can have many courses, not the reciprocal.
That said, theoretical, a Course domain object should not contain a location as class member, but just a location id. On the other hand, a domain object Location could contain an array of Course objects as class member, if needed. You see the difference?
Now, in your case, indeed pass a Location as argument to a Course object. And, in the Course repository, define a method like fetchCoursesWithLocations() in which you run only one sql query to fetch 50 courses TOGETHER WITH the corresponding location details - based on your criterias - into an array. Then loop through the records array. For each of the record item build a Location object and a Course object (to which you pass the Location object as argument). Then pass each so created Course object to another array holding all resulting Course objects, or to a CourseCollection object (which I recommend). In the end return the Courses array (or the CourseCollection content) from the method.
Now, all is somehow too complex to present in here. But I'll give you here three great articles (a serie) which will make the whole process very clear to you. You'll find out in there how a CourseCollection should see, too. In the articles (from the second one upwards), it is used the term "Mapper", which I'm pretty sure it's the same as your "repository". Actually, there are two abstraction layers for data access in the db: mappers and repositories. Plus the adapters.
Look to the part with the PostMapper and the CommentMapper. They are the parallels to your CourseRepository, respectively your LocationRepository. The same roles have Post and Comment models (domain objects!): as parallels to your Course and Location.
The articles are:
Building a Domain Model - An Introduction to Persistence
Agnosticism
Building a Domain Model - Integrating Data Mappers
Handling Collections of Aggregate Roots - the Repository Pattern

Recursive Hierarchy Ranking

I have no idea if I wrote that correctly. I want to start learning higher end data mining techniques and I'm currently using SQL server and Access 2016.
I have a system that tracks ID cards. Each ID is tagged to one particular level of a security hierarchy, which has many branches.
For example
Root
-Maintenance
- Management
- Supervisory
- Manager
- Executive
- Vendors
- Secure
- Per Diem
- Inside Trades
There are many other departments like Maintenance, some simple, some with much more convoluted, hierarchies.
Each ID card is tagged to a level so in the Maintenance example, - Per Diem:Vendors:Maintenance:Root. Others may be just tagged to Vendors, Some to general Maintenance itself (No one has root, thank god).
So lets say I have 20 ID Cards selected, these are available personnel I can task to a job but since they have different area's of security I want to find a commonalities they can all work on together as a 20 person group or whatever other groupings I can make.
So the intended output would be
CommonMatch = - Per Diem
CardID = 1
CardID = 3
CommonMatch = Vendors
CardID = 1
CardID = 3
CardID = 20
So in the example above, while I could have 2 people working on -Per Diem work, because that is their lowest common security similarity, there is also card holder #20 who has rights to the predecessor group (Vendors), that 1 and 3 share, so I could have three of them work at that level.
I'm not looking for anyone to do the work for me (Although examples always welcome), more to point me in the right direction on what I should be studying, what I'm trying to do is called, etc. I know CTE's are a way to go but that seems like only a tool in a much bigger process that needs to be done.
Thank you all in advance
Well, it is not so much a graph-theory or data-mining problem but rather a data-structure problem and one that has almost solved itself.
The objective is to be able to partition the set of card IDs into disjoint subsets given a security clearance level.
So, the main idea here would be to layout the hierarchy tree and then assign each card ID to the path implied by its security level clearance. For this purpose, each node of the hierarchy tree now becomes a container of card IDs (e.g. each node of the hierarchy tree holds a) its own name (as unique identification) b) pointers to other nodes c) a list of card IDs assigned to its "name".)
Then, retrieving the set of cards with clearance UP TO a specific security level is simply a case of traversing the tree from that specific level downwards until the tree's leafs, all along collecting the card IDs from the node containers as they are encountered.
Suppose that we have access tree:
A
+-B
+-C
D
+-E
And card ID assignments:
B:[1,2,3]
C:[4,8]
E:[10,12]
At the moment, B,C,E only make sense as tags, there is no structural information associated with them. We therefore need to first "build" the tree. The following example uses Networkx but the same thing can be achieved with a multitude of ways:
import networkx
G = networkx.DiGraph() #Establish a directed graph
G.add_edge("A","B")
G.add_edge("A","C")
G.add_edge("A","D")
G.add_edge("D","E")
Now, assign the card IDs to the node containers (in Networkx, nodes can be any valid Python object so I am going to go with a very simple list)
G.node["B"]=[1,2,3]
G.node["C"]=[4,8]
G.node["E"]=[10,12]
So, now, to get everybody working under "A" (the root of the tree), you can traverse the tree from that level downwards either via Depth First Search (DFS) or Breadth First Search (BFS) and collect the card IDs from the containers. I am going to use DFS here, purely because Networkx has a function that returns the visited nodes depending on visiting order, directly.
#dfs_preorder_nodes returns a generator, this is an efficient way of iterating very large collections in Python but I am casting it to a "list" here, so that we get the actual list of nodes back.
vis_nodes = list(networkx.dfs_preorder_nodes(G,"A")); #Start from node "A" and DFS downwards
cardIDs = []
#I could do the following with a one-line reduce but it might be clearer this way
for aNodeID in vis_nodes:
if G.node[aNodeID]:
cardIDs.extend(G.node[aNodeID])
In the end of the above iteration, cardIDs will contain all card IDs from branch "A" downwards in one convenient list.
Of course, this example is ultra simple, but since we are talking about trees, the tree can be as large as you like and you are still traversing it in the same way requiring only a single point of entry (the top level branch).
Finally, just as a note, the fact that you are using Access as your backend is not necessarily an impediment but relational databases do not handle graph type data with great ease. You might get away easily for something like a simple tree (like what you have here for example), but the hassle of supporting this probably justifies undertaking this process outside of the database (e.g, use the database just for retrieving the data and carry out the graph type data processing in a different environment. Doing a DFS on SQL is the sort of hassle I am referring to above.)
Hope this helps.

ServerSide Sorting control prevents retrieval of operational attributes

If I use a ServerSideSort control, the entries are sorted nicely for me, but I can't retrieve any operational attributes, even if I specifically request them, e.g. "entryUUID", or "+". If I remove the SSS control I get the operational attributes as I always did before. All I get are the ordinary attributes.
Is this a known feature of the SSS specification? or a known problem in OpenLDAP 2.4.30?
This condition, and those described in your comments, sounds like a server software defect. An LDAP compliant server should either:
return the entries with attribute as requested, even if unsorted (criticality false)
return unavailableCriticalExtension and no entries (criticality true)
As for sorting on an operational attribute like entryUUID, several servers I tested refused to sort, but did return results (with criticality false).
Perhaps you could export the data to an LDIF file and deal with your entries with single digit entryUUID, and re-import the data.

What do people use for CN with inetOrgPerson in LDAP directories

I've been using givenName+" "+surname for the CN field and I woke up screaming last night 'what about John Smith'? I can imagine any large organization employing multiple people with the same name. So of course this isn't going to work. What do people use instead?
EDIT Note: in inetOrgPerson the CN is part of the DN.
EDIT Note: in this situation I am expecting to grow to hundreds of thousands of user entries.
In a LDAP Directory, whatever if it's OpenLDAP or Active-Directory, a rule is that a DistinguishName (DN) must be unique, independently of the attribute (or the attributes) used to constitute the Relative Distinguish Name (RDN).
How do people make sure that it's unique :
I would say that in a small business the guy who creates the entry in the directory guarantee that it's unique, first by knowledge, second by preliminary search. If a duplicate appears he finds some solutions like 'John E Smith'. Using this solution if the name changes (marriage, divorce etc.), the LDAP record has to "move" from one DN to another. It's better to avoid changing the DN of an entry whenever possible, but in a small directory it's not important.
In a medium business the uniqueness is most of the time given by the employee ID coming from human resources. For example FR12345678. I saw, in big companies, people logging in with their employee ID. For the thing I describe here, it's more standart to use the uid attribute to name an object in spite of cn (but some directories don't let you choise of the naming attribute, I think it's a
X500 feature).
In most directories (not in AD) you can use more than one attribute to compose the RDN. For example sn=Assin+TelephoneNumber=1234 is a valid RDN in an openLDAP and it can make sense in a PBX.
One more thing
In some directories (designed for system administration) some attributes are tested by the server side as unique all over the tree. That's the case of sAMAccountName or userPrincipalName in Active-Directory and they are used for loging purpose. Using the CN attribute with "given-Name Name" oblige the administrators to guarantee uniqueness. You can use unique attribute in OpenLDAP for that in the database definition in slapd.conf, add :
# index since the unique overlay will search for matching mail attributes
index mail eq
overlay unique
unique_attributes mail
If unique overlay is not compiled in, you'll need to recompile with :
./configure ... --enable-unique
Adding to JPBlanc's answer with some of my experience. We have several ldap servers/trees where I work. Our AD server is using the DisplayName as the value of the CN. Out of 4K+ users we have only had a few instances where duplicates have occurred. I believe the default action there is to tack a 1 on the value if there is a dupe. It is surprisingly rare even with a high turn over rate in the largest section of that user base. We have two different e-directory trees that are linked to each other and those use the username. Username is first initial + last name. Any duplicates there have an incrementing number attached to them. As you can imagine that happens a lot with the Browns and the Smiths and other common names. Another tree that is an ADLDS (formerly ADAM) directory uses a uniquely generated number for each new entry as the CN. It is basically an auto-incremented number that is controlled by an external loading process. Lastly we have a directory for external partners (think independent agents) that uses a combination of email address + an id number as the CN.
I do a lot of maintenance work on the user bases and my least favorite scheme is the externally generated number. If I get a support call about Joe Brown in all of the other systems I can at least have an idea of where I need to browse to find him. Sure a simple search filter will give me all of the Browns but I still have to write it and execute it. So my advice is to use some part of the name for the CN and ensure uniqueness somehow. From an administration point of view it will be a bit easier. Really the CN is important but you'll be dealing with the rest of the user attributes far more so don't sweat it too badly.