DAG: Minimizing distance between entries in grouped nodes - optimization

I have a directed acyclic graph with nodes that are lists of entries that connect to entries in other nodes. Kind of like this:
entry ]
entry--| ] node 1
entry | ]
----- |
entry<-| ] node 2
entry | ]
----- |
entry | ] node 3
entry--| ]
The ordering of entries within a node is fixed. The entries are stored in an array with absolute indexes to the entries they link to. There is a maximum of 1 link per entry, and every node has at least 1 link. (in other words, this is a highly connected graph). The graph contains approximately 100,000 entries grouped in 40,000 nodes.
What I need to do is minimize the maximum distance between entries by reordering nodes so that I can use relative indexes for the links and compress the underlying data structure.
Because compression and performance is the goal, solutions that add external data (jump tables, special jump elements in the list) are unacceptable. I really need an algorithm for reordering nodes that minimizes the maximum distance between entries. Any thoughts?

The problem you are describing is how to minimize the maximum distance. I think it's NP-hard so a simple solution won't be very good. You could however model it as an ILP problem and use some solver for that.
You would then minimize M as an objective.
constraints would be M>= abs(s_i-e_i) for all links l_i. s_i and e_i represent the absolute indices of the start and end entry of your link.
These entries can be rewritten in terms of the node they belong to as s_i=n_i+c_i with n_i the index of the node s_i belongs to and c_i the fixed offset within that node (among the other entries). e_i is similarly rewritten. Then you're set to optimize n_i with the solver

Related

Hash tables Time Complexity Confusion

I just started learning about Hash Dictionaries. Currently we are implementing a hash dictionary with separate buckets that are made of chains (linked lists). The book posed this problem and I am having a lot of trouble figuring it out. Imagine we have an initial table size of 10 ie 10 buckets. If we want to know the time complexity for n insertions and a single lookup, how do we figure this out? (Assuming a pointer access is one unit of time).
It poses three scenarios:
A hash dictionary that does not resize, what is the time complexity for n insertions and 1 lookup?
A hash dictionary that resizes by 1 when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
A hash dictionary that resizes by doubling the table size when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
MY initial thoughts had me really confused. I couldn't quite figure out how to know the length of some given chain for an insertion. Assuming k length (I thought), there is the pointer access of the for loop going through the whole chain so k units of time. Then, in each iteration to insert it checks if the current node's data is equivalent to the key trying to be inserted (if it exists, overwrite it) so either 2k units of time if not found, 2k+1 if found. Then, it does 5 pointer accesses to prepend some element. So, 2k+5 or 2k+1 to insert 1 time. Thus, O(kn) for the first scenario for n insertions. To lookup, it seems to be 2k+1 or 2k. So for 1 lookup, o(k). I don't have a clue how to approach the other two scenarios. Some help would be great. Once again to clarify: k isn't mentioned in the problem. The only facts given are an initial size of 10 and the information given in the scenarios, so k can't be used as the results for the time complexity of n insertions or 1 lookup.
if you have a hash dictionary then your insert, delete and search operation will take O(n) of Time-Complexity for 1 key in the worst case scenario. For n insertions it would be O(n^2). It doesn't matter what the size of your table is.
|--------|
|element1| -> element2 -> element3 -> element4 -> element5
|--------|
| null |
|--------|
| null |
|--------|
| null |
|--------|
| null |
|--------|
Now for Average Case
Scenario one will have the load factor fixed (assuming m slots) : n/m. Therefore, one insert function will be O(1+n/m). 1 for the hash function computation and n/m for the lookup.
For the 2nd and 3rd scenario it should be O(1+n/m+1) and O(1+n/2m) respectively.
For your confusion, you can ask yourself a question that what will be the expected chain length for any random set of keys. The solution will be that we can't be sure at all.
That's where the idea of load factor comes into place to define the average case scenario, we give each slot equal probability to form a chain, if our no. of keys is greater than the slot count.
Imagine we have an initial table size of 10 ie 10 buckets. If we want to know the time complexity for n insertions and a single lookup, how do we figure this out?
When we talk about time complexity, we're looking at the steepness of the n-vs-time-for-operation curve as n approaches infinity. In the case above, you're saying there are only ten buckets, so - assuming the hash function scatters the insertions across the buckets with near-uniform distribution (as it should), n insertions will result in 10 lists of roughly n/10 elements.
During each insertion, you can hash to the correct bucket in O(1) time. Now - a crucial factor here is whether you want your hash table implementation to protect you against duplicate insertions.
If you simply trust there will be no duplicates, or the hash table is allowed to have duplicates (e.g. C++'s unordered_multiset), then the insertion itself can be done without inspecting the existing bucket content, at an accessible end of the bucket's list (i.e. using a head or tail pointer), also in O(1) time. That means the overall time per insertion is O(1), and the total time for n insertions is O(n).
If the implementation much identify and avoid duplicates, then for each insertion it has to search along the existing linked list, the size of which is related to n by a constant #buckets factor (1/10) and varies linearly during insertion from 1 to 1/10 of the final number of elements, so on average is n/2/10 which - removing constant factors - simplifies to n. In other words, each insertion is O(n).
Presumably the question intends to ask the time for a single lookup done after all elements are inserted: in that case you have the 10 linked lists of ~n/10 length, so the lookup will hash to one of those lists and then on average have to look half way along the list before finding the desired value: that's roughly n/20 elements searched, but as /20 is a constant factor it can be dropped, and we can say the average complexity is O(n).
A hash dictionary that does not resize, what is the time complexity for n insertions and 1 lookup?
Well, we discussed that above with our hash table size stuck at 10.
A hash dictionary that resizes by 1 when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
Say the table has 100 buckets and 80 elements, you insert an 81st element, it resizes to 101, the load factor is then about .802 - should it immediately resize again, or wait until doing another insertion? Anyway, ignoring that -each resize operation involves visiting, rehashing (unless the elements or nodes cache the hash values), and "rewiring" the linked lists for all existing elements: that's O(s) where s is the size of the table at that point in time. And you're doing that once or twice (depending on your answer to "immediately resize again" behaviour above) for s values from 1 to n, so s averages n/2, which simplifies to n. The insertion itself may or may not involve another iteration of the bucket's linked list (you could optimise to search while resizing). Regardless the overall time complexity is O(n2).
The lookup then takes O(1), because the resizing has kept the load factor below a constant amount (i.e. the average linked list length is very, very short (even ignoring the empty buckets).
A hash dictionary that resizes by doubling the table size when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
If you consider the resultant hash table there with n elements inserted, about half the elements will have been inserted without needing to be rehashed, while for about a quarter, they'll have been rehashed once, and an eight rehashed twice, a sixteenth rehashed 3 times, a 32nd rehashed 4 times: if you sum up that series - 1/4 + 2/8 + 3/16 + 4/32 + 5/64 + 6/128... - the series approaches 1 as n goes to infinity. In other words, the average amount of repeated rehashing/linking work done per element in the final table size doesn't increase with n - it's constant. So, the total time to insert is simply O(n). Then because the load factor is kept below 0.8 - a constant rather than a function of n - the lookup time is O(1).

Does anybody know of an efficient algorithm that removes unconnected single edges from a edgelist

I've got an directed graph where I have a large number of unconnected edges, essential, both the source and destination appear a single time in the data set. In more graph theory terms, removing one of these edges from the dataset decreases the number of connected components by 1. I'm trying to find an algorithm that might be more efficient than what I'm currently doing in preprocessing(distributed environment), which the pseudo code is:
Group By Count Sources
Group By Count Destinations
Join SourceCount and DestCount to edgelist
Filter on edges where (SourceCount + DestCount) > 1
Anybody know of a more efficient way of doing this?

Is a given key a member of a binary tree - probabilistic answer

The Problem:
Given a BST with N nodes, with a domain of cardinality D (domain being the possible values for the node keys).
Given a key that is in the domain but may or may not be a member of the BST.
At the start, our confidence that the node is in the tree should be 1/D, but as we go deeper into the tree both D and N are split approximately in half. That would suggest that our confidence that our key is a member of the tree should remain constant until we hit the bottom or discover the key. However, I'm not sure if that reasoning is complete, since it seems more like we are choosing N nodes from D.
I was thinking something along the lines of this, but the reasoning here still doesn't seem complete. Can somebody point me in the right direction?
Apriori, the probability that your key in is the tree is N/D.
Without loss of generality, let assume that the node's value range is [1..D].
When you walk down the tree, either:
The current node matches your key, hence P = 1
The current node has value C which is larger than your key, you go left, but you don't know how many items are in the left sub-tree. Now you can make one of these assumptions:
The tree is balanced. The range in the subtree is [1..C-1], and there are (D-1)/2 nodes in the subtree. Hence, P = ((D-1)/2)/(C-1)
The tree is not balanced. The range in the subtree is [1..C-1], and the maximum likelihood estimation for the number of nodes in the subtree is N * (C-1)/D. Hence, P = (N*(C-1)/D)/(C-1) = N/D. (no change)
If you know more about how the tree was constructed - you can make a better MLE for the number of nodes in the subtree.
The current node has value C which is smaller than your key, you go right, but you don't know how many items are in the right sub-tree.
...

Recursive Hierarchy Ranking

I have no idea if I wrote that correctly. I want to start learning higher end data mining techniques and I'm currently using SQL server and Access 2016.
I have a system that tracks ID cards. Each ID is tagged to one particular level of a security hierarchy, which has many branches.
For example
Root
-Maintenance
- Management
- Supervisory
- Manager
- Executive
- Vendors
- Secure
- Per Diem
- Inside Trades
There are many other departments like Maintenance, some simple, some with much more convoluted, hierarchies.
Each ID card is tagged to a level so in the Maintenance example, - Per Diem:Vendors:Maintenance:Root. Others may be just tagged to Vendors, Some to general Maintenance itself (No one has root, thank god).
So lets say I have 20 ID Cards selected, these are available personnel I can task to a job but since they have different area's of security I want to find a commonalities they can all work on together as a 20 person group or whatever other groupings I can make.
So the intended output would be
CommonMatch = - Per Diem
CardID = 1
CardID = 3
CommonMatch = Vendors
CardID = 1
CardID = 3
CardID = 20
So in the example above, while I could have 2 people working on -Per Diem work, because that is their lowest common security similarity, there is also card holder #20 who has rights to the predecessor group (Vendors), that 1 and 3 share, so I could have three of them work at that level.
I'm not looking for anyone to do the work for me (Although examples always welcome), more to point me in the right direction on what I should be studying, what I'm trying to do is called, etc. I know CTE's are a way to go but that seems like only a tool in a much bigger process that needs to be done.
Thank you all in advance
Well, it is not so much a graph-theory or data-mining problem but rather a data-structure problem and one that has almost solved itself.
The objective is to be able to partition the set of card IDs into disjoint subsets given a security clearance level.
So, the main idea here would be to layout the hierarchy tree and then assign each card ID to the path implied by its security level clearance. For this purpose, each node of the hierarchy tree now becomes a container of card IDs (e.g. each node of the hierarchy tree holds a) its own name (as unique identification) b) pointers to other nodes c) a list of card IDs assigned to its "name".)
Then, retrieving the set of cards with clearance UP TO a specific security level is simply a case of traversing the tree from that specific level downwards until the tree's leafs, all along collecting the card IDs from the node containers as they are encountered.
Suppose that we have access tree:
A
+-B
+-C
D
+-E
And card ID assignments:
B:[1,2,3]
C:[4,8]
E:[10,12]
At the moment, B,C,E only make sense as tags, there is no structural information associated with them. We therefore need to first "build" the tree. The following example uses Networkx but the same thing can be achieved with a multitude of ways:
import networkx
G = networkx.DiGraph() #Establish a directed graph
G.add_edge("A","B")
G.add_edge("A","C")
G.add_edge("A","D")
G.add_edge("D","E")
Now, assign the card IDs to the node containers (in Networkx, nodes can be any valid Python object so I am going to go with a very simple list)
G.node["B"]=[1,2,3]
G.node["C"]=[4,8]
G.node["E"]=[10,12]
So, now, to get everybody working under "A" (the root of the tree), you can traverse the tree from that level downwards either via Depth First Search (DFS) or Breadth First Search (BFS) and collect the card IDs from the containers. I am going to use DFS here, purely because Networkx has a function that returns the visited nodes depending on visiting order, directly.
#dfs_preorder_nodes returns a generator, this is an efficient way of iterating very large collections in Python but I am casting it to a "list" here, so that we get the actual list of nodes back.
vis_nodes = list(networkx.dfs_preorder_nodes(G,"A")); #Start from node "A" and DFS downwards
cardIDs = []
#I could do the following with a one-line reduce but it might be clearer this way
for aNodeID in vis_nodes:
if G.node[aNodeID]:
cardIDs.extend(G.node[aNodeID])
In the end of the above iteration, cardIDs will contain all card IDs from branch "A" downwards in one convenient list.
Of course, this example is ultra simple, but since we are talking about trees, the tree can be as large as you like and you are still traversing it in the same way requiring only a single point of entry (the top level branch).
Finally, just as a note, the fact that you are using Access as your backend is not necessarily an impediment but relational databases do not handle graph type data with great ease. You might get away easily for something like a simple tree (like what you have here for example), but the hassle of supporting this probably justifies undertaking this process outside of the database (e.g, use the database just for retrieving the data and carry out the graph type data processing in a different environment. Doing a DFS on SQL is the sort of hassle I am referring to above.)
Hope this helps.

Manhattan distance with n dimensions in oracle

I have a table with around 5 millions rows and each row has 10 columns representing 10 dimensions.
I would like to be able when a new input is coming to perform a search in the table to return the closest rows using Manhattan distances.
The distance is the sum of abs(Ai-Aj)+abs(Bi-Bj)...
The problem is that for the moment if I do a query, it does a full scan of the entire table, to calculate the distances from every rows, and then sort them to find the top X.
Is there a way to speed the process and make the query more efficient?
I looked at the distance function online for the SDO_GEOMETRY, but I couldn't find it for more than 4 dimensions.
Thank you
If you are inserting a point A and you want to look for points that are within a neighbourhood of radius r (i.e., are less than r away, on any metric), you can do a really simply query:
select x1, x2, ..., xn
from points
where x1 between a1 - r and a1 + r
and x2 between a2 - r and a2 + r
...
and xn between an - r and an + r
...where A = (a1, a2, ..., an), to find a bound. If you have an index over all x1, ..., xn fields of points, then this query shouldn't require a full scan. Now, this result may include points that are outside the neighbourhood (i.e., the bits in the corners), but is an easy win to find an appropriate subset: you can now check against the records in this subquery, rather than checking against every point in your table.
You may be able to refine this query further because, with the Manhattan metric, a neighbourhood will be square shaped (although at 45 degrees to the above) and squares are relatively easy to work with! (Even in 10 dimensions.) However, the more complicated logic required may be more of an overhead than an optimisation, ultimately.
I suggest using function based index. You need this distance calculated, therefore pre calculate it using function based index.
You may want to read following question and it links. Function based index creates hidden column for you. This hidden column will hold manhanttan distance , therefore sorting will be easier.
Thanks for #Xophmeister's comment. Function based index will not help you for arbitrary point. I do not know any sql function to help you here. But if you are willing to use machine learning data mining algorithm.
I suggest cluster your 5 million rows using k-means clustering. Lets say 1000 cluster center you found. Put this cluster centers to another table.
By definition clustering , your points will be assigned to cluster centers. Because of this you know which points are nearest to this cluster center, say
cluster (1) contains 20.000 points, ... cluster ( 987) contains 10.000 points ...
Your arbitrary point will be near to one cluster. You find that your point is nearest to cluster 987. Run your sql , using only points which belongs to this cluster center, that 10.000 points.
You need to add several tables/columns to your schema to make this effective. If your 5.000.000 rows changes continuously, you need to run k-means clustering again as they change. But if they are fairly constant values, one clustering per week or per month will be enough.