Representing multiply-linked lists in SQL - sql

I have a data structure consisting of a set of objects which are arranged into a multiply-linked list which is also (isomorphically) a valid DAG. It could be viewed as one single multiply-linked list, or as a series of n doubly-linked lists which may share members. (This is the same data structure from Algorithm for quickly obtaining a partial ordering over multiple linked lists, for those of you following my questions.)
I am looking for a general technique, in no specific SQL dialect, for expressing this multiply-linked list/DAG in SQL, such that it's easy to take a given node and obtain:
The previous and next links in the DAG, given a topological ordering of the DAG
The previous and next links in each doubly-linked list to which this node belongs
Using the example data from that other question:
first = [a, b, d, f, h, i];
second = [a, b, c, f, g, i];
third = [a, e, f, g, h, i];
I'd want to be able to, given node f, obtain [(c|d|e), g] from the overall DAG's topology and also {first: [d, h], second: [c, g], third: [e, g]} from each of the lists orderings.
Here's the fun part: n, the number of doubly-linked lists, is not fixed and may grow at any time. I'd rather not redo the schema each time that happens.
All of the algorithms I've come up with so far either (a) stuff a big pickle into the DB and pull it out in order to calculate orderings, or (b) require that the lists be explicitly enumerated as recursive relations in the DB.
I'll go with an option in (b) if I can't find something better but I'm hoping that there's something magical out there to make this easier.

Pre:
This is a question and answer forum, not 'lets sit down, group think for a bit, and solve the whole problem' forum.
I think what you want to investigate in a technique called 'modified preordered tree traversal' a mouthful i know, but it allows the storing of hierarchical data in a flat database and individual enties. Sadly, you do have to do some rewriting on inserts, but the selects can be done in a single query, so it's best for 'many view/ few changes' situations like a website. Luckily, you rarely have to rewrite the whole dataset (only the parts you changed and those hierarchically after them.
I remember a good article on the basics on it ( a couple years ago) but can't find the bookmark atm, so start with just a google search.
EDIT/UPDATE:
link: http://www.sitepoint.com/hierarchical-data-database/
No matter what, from dealing with this issue extensively, you will have to choose were to put the brunt of the work, on view, or on change. Depending on the size of the 'master' tree, you may (like me) decide to break the tree up into parts and use a tree of trees, limiting the update cost.

Related

Storing DNA in Swift

I'm going to write an application for dealing with raw DNA data samples, as the files you get from MyHeritage, Ancestry, FamilyTreeDNA, 23&me etc. Each of these files are basically a CSV-file with some quirks, and asked about decoding them in another question I posted earlier.
Now for the next part. When I have parsed/decoded those files, I want to put the DNA data in a database, so that I can compare one persons DNA to that of another person. It's a lot of data, but not more than most computers can handle.
In memory, I can have the full DNA for both persons, and compare them, and then create ArraySlices for the segments of DNA data that overlap, but ArraySlices aren't suitable for storage, In memory the ArraySlice can't exist by itself. It's just a reference into the full array, so if I would flatten the ArraySlice I would still get the whole array, even for the segments that don't match.
Each person shall have their full DNA on backing store, and can be read into memory, but how would you store the matching segments?
I'm thinking of something like:
//Bucket is the term FamilyTreeDNA for indicating whether the DNA match is on the maternal or paternal chromosome.
enum Bucket {
case .maternal
case .paternal
}
//THe first 22 chromosome pairs are just numbered from 1 to 22, but the last two are X and Y, so I use a String for storing chromosome "number"
struct SharedSegment {
let onChromosome: String
let bucket: Bucket
let range: Range<UInt>
}
I don't care if it takes more disk space, but I want to have lightning fast comparisons of DNA, so that I can compare all the DNA for all the individuals in a datavase, without it taking months to do so. Also storage space for the full DNA to make comparisons.
At the first stage I'm just building an app for storing the DNA kits I administer, but I already have plans for a services of type Gedmatch and DNAPainter if you have tried them. This means it's a services where people can upload their DNA to be compared to other peoples DNA, and lets says a million people upload their DNA to this service, and each of them should have their DNA compared to the other 999'999 people. The number of comparison will be huge, so my primary focus is on performance. Each file with raw DNA data will contain about 400-950 thousand lines of DNA data.
Each line will contain the chromosome number, the RSID, the position within the chromosome and a genotype. The latter is two letters "AA", "AC", "CT" etc. There are four different letters A, C, G and T. The reason there are two letter for each position is that you have chromosome pairs, where there is one chromosome inherited from the father and one from the mother, and there is one letter from each of those two chromosomes. Of course I can store them as just a string of characters, but there are chances of errors, so I would like to represent them in code as
enum Aminoacid {
noCall = 0
case A
case C
case G
case T
}
When sequencing DNA there are sometimes problem, and the sequencing equipment can't determine which amino acid it is in a certain position. This is called a "no call", therefore the case noCall in the enum. in the raw DNA file this is represented by a dash, so it can say in the results "-A"m which means that one of the parents had an A in that position, and the other could not be determined.
Is there any possibility to squeeze them together in 4 bits (nybble), so that I can store two of these letter per byte?It's even possible to squeeze into 3 bits, but I can't get three letters into a byte anyway. It solve be two letter á 3 birs each and two bits wasted in every byte, so I could just as well use 4 bits per amino acid. There are Uint64, Uint32, Uint16 and Uint8 in Swift, but no Uint4, which would be ideal for this case. I'm also thinking about whether to store the two letter from the maternal and paternal chromosome together or if U should split them into separate array One array for maternal DNA and one for paternal). There is a problem with that approach, and that is it's impossible to tell if the first letter on each row is maternal or paternal, until you have the DNA from at least one of the parents to compare with. In absence of their DNA, I would have to have a third array to store both letters in, until I can determine swhich one is maternal end paternal respectively. I'm trying to come up with the most effective way of storing this, to make the comparisons super fast.
In one way I don't like using enums, and that is because I will have to convert them to rawValue, do I can do something like
var genotype = Aminoacid.A.rawValue << 4 + Aminoacid.G.rawValue
As far as I can see that's the best way to squeeze two of these into ont byte, since there's no UInt4.
I'm not so fond of having lots of .rawValue all over my code. I would like to have only Aminoacid.A << 4 + Aminoacid.G, but unfortunately I don't think this is possible. Maybe there is a beter way to store these sequences of amino acids in the database, like enums with associated values or something. I don't know how efficient associated values till be, when working with such large data sets.
If there is anyone out there, that wants to collaborate on this project, that is so far just a hobby project, but I have plans for making a business out of it eventually. This means I can't employ anyone to do this, but if you're working on similar projects then let me know. We can make better things together. Just be aware that I'm writing in Swift, and I'm going to deploy on macOS, but Swift is also available for other platforms, so coders for Linux and Windows are equally welcome to work on a joint project.
This became a little offtopic. My question was about storage of raw DNA and shared segments in a way that is optimal for fast search and comparison of huge amounts of DNA.I probably won't use CoreData for storage, since I would like to keep the options for porting to other platforms than Apples. At the moment I'm using CoreData to experiment a little with storing DNA in different ways.

Recursive Hierarchy Ranking

I have no idea if I wrote that correctly. I want to start learning higher end data mining techniques and I'm currently using SQL server and Access 2016.
I have a system that tracks ID cards. Each ID is tagged to one particular level of a security hierarchy, which has many branches.
For example
Root
-Maintenance
- Management
- Supervisory
- Manager
- Executive
- Vendors
- Secure
- Per Diem
- Inside Trades
There are many other departments like Maintenance, some simple, some with much more convoluted, hierarchies.
Each ID card is tagged to a level so in the Maintenance example, - Per Diem:Vendors:Maintenance:Root. Others may be just tagged to Vendors, Some to general Maintenance itself (No one has root, thank god).
So lets say I have 20 ID Cards selected, these are available personnel I can task to a job but since they have different area's of security I want to find a commonalities they can all work on together as a 20 person group or whatever other groupings I can make.
So the intended output would be
CommonMatch = - Per Diem
CardID = 1
CardID = 3
CommonMatch = Vendors
CardID = 1
CardID = 3
CardID = 20
So in the example above, while I could have 2 people working on -Per Diem work, because that is their lowest common security similarity, there is also card holder #20 who has rights to the predecessor group (Vendors), that 1 and 3 share, so I could have three of them work at that level.
I'm not looking for anyone to do the work for me (Although examples always welcome), more to point me in the right direction on what I should be studying, what I'm trying to do is called, etc. I know CTE's are a way to go but that seems like only a tool in a much bigger process that needs to be done.
Thank you all in advance
Well, it is not so much a graph-theory or data-mining problem but rather a data-structure problem and one that has almost solved itself.
The objective is to be able to partition the set of card IDs into disjoint subsets given a security clearance level.
So, the main idea here would be to layout the hierarchy tree and then assign each card ID to the path implied by its security level clearance. For this purpose, each node of the hierarchy tree now becomes a container of card IDs (e.g. each node of the hierarchy tree holds a) its own name (as unique identification) b) pointers to other nodes c) a list of card IDs assigned to its "name".)
Then, retrieving the set of cards with clearance UP TO a specific security level is simply a case of traversing the tree from that specific level downwards until the tree's leafs, all along collecting the card IDs from the node containers as they are encountered.
Suppose that we have access tree:
A
+-B
+-C
D
+-E
And card ID assignments:
B:[1,2,3]
C:[4,8]
E:[10,12]
At the moment, B,C,E only make sense as tags, there is no structural information associated with them. We therefore need to first "build" the tree. The following example uses Networkx but the same thing can be achieved with a multitude of ways:
import networkx
G = networkx.DiGraph() #Establish a directed graph
G.add_edge("A","B")
G.add_edge("A","C")
G.add_edge("A","D")
G.add_edge("D","E")
Now, assign the card IDs to the node containers (in Networkx, nodes can be any valid Python object so I am going to go with a very simple list)
G.node["B"]=[1,2,3]
G.node["C"]=[4,8]
G.node["E"]=[10,12]
So, now, to get everybody working under "A" (the root of the tree), you can traverse the tree from that level downwards either via Depth First Search (DFS) or Breadth First Search (BFS) and collect the card IDs from the containers. I am going to use DFS here, purely because Networkx has a function that returns the visited nodes depending on visiting order, directly.
#dfs_preorder_nodes returns a generator, this is an efficient way of iterating very large collections in Python but I am casting it to a "list" here, so that we get the actual list of nodes back.
vis_nodes = list(networkx.dfs_preorder_nodes(G,"A")); #Start from node "A" and DFS downwards
cardIDs = []
#I could do the following with a one-line reduce but it might be clearer this way
for aNodeID in vis_nodes:
if G.node[aNodeID]:
cardIDs.extend(G.node[aNodeID])
In the end of the above iteration, cardIDs will contain all card IDs from branch "A" downwards in one convenient list.
Of course, this example is ultra simple, but since we are talking about trees, the tree can be as large as you like and you are still traversing it in the same way requiring only a single point of entry (the top level branch).
Finally, just as a note, the fact that you are using Access as your backend is not necessarily an impediment but relational databases do not handle graph type data with great ease. You might get away easily for something like a simple tree (like what you have here for example), but the hassle of supporting this probably justifies undertaking this process outside of the database (e.g, use the database just for retrieving the data and carry out the graph type data processing in a different environment. Doing a DFS on SQL is the sort of hassle I am referring to above.)
Hope this helps.

In this minimal perfect hashing function, what is meant by FirstLetter and Predecessor?

I'm implementing a Minimalistic Acyclic Finite State Automaton (MA-FSA; a specific kind of DAG) in Go, and would like to associate some extra data with nodes that indicate EOW (end-of-word). With MA-FSA, the traditional approach is not possible because there are multiple words that might end at that node. So I'm looking into minimal perfect hashing functions as an alternative.
In the "Correction" box at the top of his blog post, Steve Hanov says that he used the method described in this paper by Lucchesi and Kowaltowski. In looking at Figure 12 (page 19), it describes the hashing function.
On line 8, it refers to FirstLetter and Predecessor(), but it doesn't describe what they are. Or I'm not seeing it. What are they?
All I can figure out is that it's just traversing the tree, adding up Number from each node as it goes, but that can't possibly be right. It produces numbers that are too large and it's not one-to-one, like the paper says. Am I misreading something?
The paper says:
Let us assume that the representation of our automaton includes, for each state, an integer which gives the number of words that would be accepted by the automaton starting from that state.
So I believe this: for C <- FirstLetter to Predecessor(Word[I ]) do
Means: for (c = 'a'; c < word[i]; c++)
(They're just trying to be alphabet-independent.)
Think of it this way: enumerate all accepted words. Sort them. Find your word in the list. Its index is the word's hash value.
Their algorithm avoids storing the complete list by keeping track of how many words are reachable from a given node. So you get to a node, and check all the outgoing edges to other nodes that involve a letter of the alphabet before your next letter. All of the words reachable from those nodes must be on the list before your word, so you can calculate what position your word must occupy in the list.
I have updated my DAWG example to show using it as a Map from keys to values. Each node stores the number of final nodes reachable from it (including itself). Then when the trie is traversed, we add up the counts of any that we skip over. That way, each word in the trie has a unique number. You can then look up the number in an array to get the data associated with the word.
https://gist.github.com/smhanov/94230b422c2100ae4218

RapidMiner FP-growth operator not returning any results

i'm running into a problem with the fp-growth operator in rapidminer. i'm processing about 20 text files that are all in all <1MB in size. i used the process documents operator and within that tokenize, filter stop words, transform cases, generate n-grams, and filter tokens. from there i used the numerical to binominal operator. up until this point everything works fine, but when i run the fp-growth operator it just processes indefinitely with no result. i tried tweaking the min support parameter, but to no avail. would you have any suggestions on how to troubleshoot this? i'd really appreciate any guidance.
FP-Growth is an algorithm to find Frequent Item Sets within a number of transactions that contain multiple items. In your scenario, the items are probably the words occurring in the text, while each text is a transaction.
Unfortunately the problem of frequent item sets is exponentially: Say you have a frequent item set that contains the items {A, B, C}, that means that there are enough transactions that contain all three items. But that also means that all the subsets are frequent as well, because the subset {A, B} is at least contained in all transactions that contain {A, B, C}. So if {A, B, C} is frequent, {A, B}, {B, C}, {A, C} and {A}, {B} and {C} as well. The number of sets is (2^n) - 1. So for a set of four items, we already have 15 subsets, for five 31 and so on.
So the question is: What makes a set frequent and why might there be so many frequent sets that RapidMiner takes so long to compute them all?
The most important factor is of course the min_support. This defines the threshold in percent of in how many transactions a set must occur to be frequent. If you increase the min_support towards 1, then there will be much less item sets and the computation will be faster.
However, don't get tricked by the "find min number of itemsets". If this is checked, RapidMiner will try to always find the specified minimal number of itemsets and will automatically lower the min_support if it couldn't find any. My advice: Switch it off.
Another thing you should make sure is, that the right value is recognized as "positive", so as indicator that this item is part of the transaction. If you used a Numerical to Binominal operator before, this is "true". So you should enter "true" into the positive_value parameter of RapidMiner. This parameter is only visible in expert mode. If you are not in expert mode, a line will show up below the parameters telling you, that "4 hidden expert parameters" are available. You can click on the line to switch to expert mode.
In your specific scenario, where you are getting your 'transactions' from text files, you will have special problems:
You will have thousands of attributes, especially if you generated NGrams like in your case. A high number of attributes also will result in a massively increased runtime.
If you don't remove frequent words by applying a reasonable pruning in the Process Document operator, words that occur pretty frequently will be exploding the number of frequent item sets. Say you didn't filter Stopwords, then the words "a", "the", "is" will occur all over the place, causing the other frequent words to co-occur with them. So the frequent set {A, B, C}, will always be extended to {A, B, C, a, the, is} so we now have 2^6 -1 subsets instead of just 7...
Hope this helps!

Building ranking with genetic algorithm,

Question after BIG edition :
I need to built a ranking using genetic algorithm, I have data like this :
P(a>b)=0.9
P(b>c)=0.7
P(c>d)=0.8
P(b>d)=0.3
now, lets interpret a,b,c,d as names of football teams, and P(x>y) is probability that x wins with y. We want to build ranking of teams, we lack some observations P(a>d),P(a>c) are missing due to lack of matches between a vs d and a vs c.
Goal is to find ordering of team names, which the best describes current situation in that four team league.
If we have only 4 teams than solution is straightforward, first we compute probabilities for all 4!=24 orderings of four teams, while ignoring missing values we have :
P(abcd)=P(a>b)P(b>c)P(c>d)P(b>d)
P(abdc)=P(a>b)P(b>c)(1-P(c>d))P(b>d)
...
P(dcba)=(1-P(a>b))(1-P(b>c))(1-P(c>d))(1-P(b>d))
and we choose the ranking with highest probability. I don't want to use any other fitness function.
My question :
As numbers of permutations of n elements is n! calculation of probabilities for all
orderings is impossible for large n (my n is about 40). I want to use genetic algorithm for that problem.
Mutation operator is simple switching of places of two (or more) elements of ranking.
But how to make crossover of two orderings ?
Could P(abcd) be interpreted as cost function of path 'abcd' in assymetric TSP problem but cost of travelling from x to y is different than cost of travelling from y to x, P(x>y)=1-P(y<x) ? There are so many crossover operators for TSP problem, but I think I have to design my own crossover operator, because my problem is slightly different from TSP. Do you have any ideas for solution or frame for conceptual analysis ?
The easiest way, on conceptual and implementation level, is to use crossover operator which make exchange of suborderings between two solutions :
CrossOver(ABcD,AcDB) = AcBD
for random subset of elements (in this case 'a,b,d' in capital letters) we copy and paste first subordering - sequence of elements 'a,b,d' to second ordering.
Edition : asymetric TSP could be turned into symmetric TSP, but with forbidden suborderings, which make GA approach unsuitable.
It's definitely an interesting problem, and it seems most of the answers and comments have focused on the semantic aspects of the problem (i.e., the meaning of the fitness function, etc.).
I'll chip in some information about the syntactic elements -- how do you do crossover and/or mutation in ways that make sense. Obviously, as you noted with the parallel to the TSP, you have a permutation problem. So if you want to use a GA, the natural representation of candidate solutions is simply an ordered list of your points, careful to avoid repitition -- that is, a permutation.
TSP is one such permutation problem, and there are a number of crossover operators (e.g., Edge Assembly Crossover) that you can take from TSP algorithms and use directly. However, I think you'll have problems with that approach. Basically, the problem is this: in TSP, the important quality of solutions is adjacency. That is, abcd has the same fitness as cdab, because it's the same tour, just starting and ending at a different city. In your example, absolute position is much more important that this notion of relative position. abcd means in a sense that a is the best point -- it's important that it came first in the list.
The key thing you have to do to get an effective crossover operator is to account for what the properties are in the parents that make them good, and try to extract and combine exactly those properties. Nick Radcliffe called this "respectful recombination" (note that paper is quite old, and the theory is now understood a bit differently, but the principle is sound). Taking a TSP-designed operator and applying it to your problem will end up producing offspring that try to conserve irrelevant information from the parents.
You ideally need an operator that attempts to preserve absolute position in the string. The best one I know of offhand is known as Cycle Crossover (CX). I'm missing a good reference off the top of my head, but I can point you to some code where I implemented it as part of my graduate work. The basic idea of CX is fairly complicated to describe, and much easier to see in action. Take the following two points:
abcdefgh
cfhgedba
Pick a starting point in parent 1 at random. For simplicity, I'll just start at position 0 with the "a".
Now drop straight down into parent 2, and observe the value there (in this case, "c").
Now search for "c" in parent 1. We find it at position 2.
Now drop straight down again, and observe the "h" in parent 2, position 2.
Again, search for this "h" in parent 1, found at position 7.
Drop straight down and observe the "a" in parent 2.
At this point note that if we search for "a" in parent one, we reach a position where we've already been. Continuing past that will just cycle. In fact, we call the sequence of positions we visited (0, 2, 7) a "cycle". Note that we can simply exchange the values at these positions between the parents as a group and both parents will retain the permutation property, because we have the same three values at each position in the cycle for both parents, just in different orders.
Make the swap of the positions included in the cycle.
Note that this is only one cycle. You then repeat this process starting from a new (unvisited) position each time until all positions have been included in a cycle. After the one iteration described in the above steps, you get the following strings (where an "X" denotes a position in the cycle where the values were swapped between the parents.
cbhdefga
afcgedbh
X X X
Just keep finding and swapping cycles until you're done.
The code I linked from my github account is going to be tightly bound to my own metaheuristics framework, but I think it's a reasonably easy task to pull the basic algorithm out from the code and adapt it for your own system.
Note that you can potentially gain quite a lot from doing something more customized to your particular domain. I think something like CX will make a better black box algorithm than something based on a TSP operator, but black boxes are usually a last resort. Other people's suggestions might lead you to a better overall algorithm.
I've worked on a somewhat similar ranking problem and followed a technique similar to what I describe below. Does this work for you:
Assume the unknown value of an object diverges from your estimate via some distribution, say, the normal distribution. Interpret your ranking statements such as a > b, 0.9 as the statement "The value a lies at the 90% percentile of the distribution centered on b".
For every statement:
def realArrival = calculate a's location on a distribution centered on b
def arrivalGap = | realArrival - expectedArrival |
def fitness = Σ arrivalGap
Fitness function is MIN(fitness)
FWIW, my problem was actually a bin-packing problem, where the equivalent of your "rank" statements were user-provided rankings (1, 2, 3, etc.). So not quite TSP, but NP-Hard. OTOH, bin-packing has a pseudo-polynomial solution proportional to accepted error, which is what I eventually used. I'm not quite sure that would work with your probabilistic ranking statements.
What an interesting problem! If I understand it, what you're really asking is:
"Given a weighted, directed graph, with each edge-weight in the graph representing the probability that the arc is drawn in the correct direction, return the complete sequence of nodes with maximum probability of being a topological sort of the graph."
So if your graph has N edges, there are 2^N graphs of varying likelihood, with some orderings appearing in more than one graph.
I don't know if this will help (very brief Google searches did not enlighten me, but maybe you'll have more success with more perseverance) but my thoughts are that looking for "topological sort" in conjunction with any of "probabilistic", "random", "noise," or "error" (because the edge weights can be considered as a reliability factor) might be helpful.
I strongly question your assertion, in your example, that P(a>c) is not needed, though. You know your application space best, but it seems to me that specifying P(a>c) = 0.99 will give a different fitness for f(abc) than specifying P(a>c) = 0.01.
You might want to throw in "Bayesian" as well, since you might be able to start to infer values for (in your example) P(a>c) given your conditions and hypothetical solutions. The problem is, "topological sort" and "bayesian" is going to give you a whole bunch of hits related to markov chains and markov decision problems, which may or may not be helpful.