Storing DNA in Swift - optimization

I'm going to write an application for dealing with raw DNA data samples, as the files you get from MyHeritage, Ancestry, FamilyTreeDNA, 23&me etc. Each of these files are basically a CSV-file with some quirks, and asked about decoding them in another question I posted earlier.
Now for the next part. When I have parsed/decoded those files, I want to put the DNA data in a database, so that I can compare one persons DNA to that of another person. It's a lot of data, but not more than most computers can handle.
In memory, I can have the full DNA for both persons, and compare them, and then create ArraySlices for the segments of DNA data that overlap, but ArraySlices aren't suitable for storage, In memory the ArraySlice can't exist by itself. It's just a reference into the full array, so if I would flatten the ArraySlice I would still get the whole array, even for the segments that don't match.
Each person shall have their full DNA on backing store, and can be read into memory, but how would you store the matching segments?
I'm thinking of something like:
//Bucket is the term FamilyTreeDNA for indicating whether the DNA match is on the maternal or paternal chromosome.
enum Bucket {
case .maternal
case .paternal
}
//THe first 22 chromosome pairs are just numbered from 1 to 22, but the last two are X and Y, so I use a String for storing chromosome "number"
struct SharedSegment {
let onChromosome: String
let bucket: Bucket
let range: Range<UInt>
}
I don't care if it takes more disk space, but I want to have lightning fast comparisons of DNA, so that I can compare all the DNA for all the individuals in a datavase, without it taking months to do so. Also storage space for the full DNA to make comparisons.
At the first stage I'm just building an app for storing the DNA kits I administer, but I already have plans for a services of type Gedmatch and DNAPainter if you have tried them. This means it's a services where people can upload their DNA to be compared to other peoples DNA, and lets says a million people upload their DNA to this service, and each of them should have their DNA compared to the other 999'999 people. The number of comparison will be huge, so my primary focus is on performance. Each file with raw DNA data will contain about 400-950 thousand lines of DNA data.
Each line will contain the chromosome number, the RSID, the position within the chromosome and a genotype. The latter is two letters "AA", "AC", "CT" etc. There are four different letters A, C, G and T. The reason there are two letter for each position is that you have chromosome pairs, where there is one chromosome inherited from the father and one from the mother, and there is one letter from each of those two chromosomes. Of course I can store them as just a string of characters, but there are chances of errors, so I would like to represent them in code as
enum Aminoacid {
noCall = 0
case A
case C
case G
case T
}
When sequencing DNA there are sometimes problem, and the sequencing equipment can't determine which amino acid it is in a certain position. This is called a "no call", therefore the case noCall in the enum. in the raw DNA file this is represented by a dash, so it can say in the results "-A"m which means that one of the parents had an A in that position, and the other could not be determined.
Is there any possibility to squeeze them together in 4 bits (nybble), so that I can store two of these letter per byte?It's even possible to squeeze into 3 bits, but I can't get three letters into a byte anyway. It solve be two letter á 3 birs each and two bits wasted in every byte, so I could just as well use 4 bits per amino acid. There are Uint64, Uint32, Uint16 and Uint8 in Swift, but no Uint4, which would be ideal for this case. I'm also thinking about whether to store the two letter from the maternal and paternal chromosome together or if U should split them into separate array One array for maternal DNA and one for paternal). There is a problem with that approach, and that is it's impossible to tell if the first letter on each row is maternal or paternal, until you have the DNA from at least one of the parents to compare with. In absence of their DNA, I would have to have a third array to store both letters in, until I can determine swhich one is maternal end paternal respectively. I'm trying to come up with the most effective way of storing this, to make the comparisons super fast.
In one way I don't like using enums, and that is because I will have to convert them to rawValue, do I can do something like
var genotype = Aminoacid.A.rawValue << 4 + Aminoacid.G.rawValue
As far as I can see that's the best way to squeeze two of these into ont byte, since there's no UInt4.
I'm not so fond of having lots of .rawValue all over my code. I would like to have only Aminoacid.A << 4 + Aminoacid.G, but unfortunately I don't think this is possible. Maybe there is a beter way to store these sequences of amino acids in the database, like enums with associated values or something. I don't know how efficient associated values till be, when working with such large data sets.
If there is anyone out there, that wants to collaborate on this project, that is so far just a hobby project, but I have plans for making a business out of it eventually. This means I can't employ anyone to do this, but if you're working on similar projects then let me know. We can make better things together. Just be aware that I'm writing in Swift, and I'm going to deploy on macOS, but Swift is also available for other platforms, so coders for Linux and Windows are equally welcome to work on a joint project.
This became a little offtopic. My question was about storage of raw DNA and shared segments in a way that is optimal for fast search and comparison of huge amounts of DNA.I probably won't use CoreData for storage, since I would like to keep the options for porting to other platforms than Apples. At the moment I'm using CoreData to experiment a little with storing DNA in different ways.

Related

Oracle String Conversion - Alpha String to Numeric Score, Fuzzy Match

I'm working with a lot of name data where the following events are happening:
In one stream the data is submitted as "Sung" and in the other stream "Snug" my initial thought to this was to convert Sung and Snug to where each character equals a number then the sums would be the same, so even if they transverse a character, I'd be able to bucket these appropriately.
The other is where in one stream it comes in as "Lillly" as opposed to "Lilly" in the other stream. I'd like to figure out how to fuzzy match these such that I can identify them. I'm not sure if this is possible in Oracle.
I'm working with many millions of data points and trying to figure out how to write these classification buckets such that I can stop having so much noise in my primary task of finding where people are truly different people as opposed to a clerical error.
Any thoughts would be very appreciated.
A common measure for such distance is called Levenshtein distance (Wikipedia here). This measures the "edit" distance between two strings -- number of edit operations needed to convert one into the other.
That's the good news. More good news is that Oracle even has an implementation in the UTL_MATCH library.
The bad news is that it is really, really expensive on millions of data points. Unfortunately, I cannot help you there so much. One idea is to determine which names are "close enough" because they already share a certain minimum number of characters.
Another method is to convert the strings to what they sound like. That is called soundex. You may be able to use the two together -- assuming your names are predominantly English (the soundex algorithm was invented by the US Census Bureau, so it would work best on names in America).

Table VS xml / json / yaml - table requires less storage if data is any related? more efficient than compression

To add a field to a XML object it takes the length of the fieldname +
3 characters (or 7 when nested) and for JSON 4 (or 6 when nested)
<xml>xml</xml> xml="xml"
{"json":json,} "json": json,
Assume the average is 4 and fieldname average is 11 - to justify the use of XML/JSON over a table in use of storage, each field must in average only appear in less than 1/15 of objects, in other words there must be ~15 times more different fields within the whole related group of objects, than one object has in average.
(Yet a table may very well allows faster computation still when this ration is higher and its bigger in storage) I have not yet seen a use of XML/JSON with a very high ratio.
Aren't most real of XML/JSON forced and inefficient?
Shouldn't related data be stored and queried in relations (tables)?
What am i missing?
Example conversion XML to table
Object1
<aaaaahlongfieldname>1</aaaaahlongfieldname>
<b>B
<c>C</c>
</b>
Object2
<aaaaahlongfieldname>2</aaaaahlongfieldname>
<b><c><d>D</d></c></b>
<ba>BA</ba>
<ba "xyz~">BA</ba>
<c>C</c>
Both converted to a csv like table (delimiter declaration,head,line1,line2)
delimiter=,
aaaaahlongfieldname,b,b/c,b/c/d,ba,ba-xyz~,c
,B,C,,,,
,,,D,BA,BA,C
/ and - symbols in values will need to be escaped only in the head
but ,,,, could also be \4 escaped number of delimiters in a row (when an escape symbol or string is declared as well - worth it at large numbers of empty fields ) and since escape character and delimiter will need to be escaped when they appear in values, they could automatically be declared rare symbols that usually hardly appear
escape=~
delimiter=°
aaaaahlongfieldname°b°b/c°b/c/d°ba°ba-xyz~~°c
°B°C~4
°°°D°BA°BA°C
Validation/additional info: XML/json misses all empty fields so missing "fields in "rows can not be noticed. A line of a table is only valid when the number of fields is correct and (faulty) lines must be noticed. but through columns having different datatypes missing delimiters could usually easily be repaired.
Edit:
On readablity/editablity: Good thing of course, the first time one read xml and json it maybe was selfexplanatory having read html and js already but that's all? - most of the time it is machines reading it and sometimes developers, both of which may not be entertained by the verbosity
The CSV in your example is quite inefficient use of 8 bit encoding. You're hardly even using 5 bits of entropy, clearly wasting 3 bits. Why not compress it?
The answer to all of these is people make mistakes, and stronger typing trades efficiency for safety. It is impossible for machine or human to identify a transposed column in a CSV stream, however both JSON & XML would automatically handle it, and (assuming no hierarchy boundaries got crossed) everything would still work. 30 years ago when storage space was scarce & instructions per second were sometimes measured 100s per second, using minimal amounts of decoration in protocols made sense. These days even embedded systems have relatively vast amounts of power & storage, thus the tradeoff for a little extra safety is much easier to make.
For tightly controlled data transfer, say between modules that my development team is working on, JSON works great. But when data needs to go between different groups, I strongly prefer XML, simply because it helps both sides understand what is happening. If the data needs to go across a "slow" pipe, compression will remove 98% of the XML "overhead".
The designers of XML were well aware that there was a high level of redundancy in the representation, and they considered this a good thing (I'm not saying they were right). Essentially (a) redundancy costs nothing if you use data compression, (b) redundancy (within limits) helps human readability, and (c ) redundancy makes it easier to detect and diagnose errors, especially important when XML is being hand-authored.

Storing trillions of document similarities

I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)

VB.NET Comparing files with Levenshtein algorithm

I'd like to use the Levenshtein algorithm to compare two files in VB.NET. I know I can use an MD5 hash to determine if they're different, but I want to know HOW MUCH different the two files are. The files I'm working with are both around 250 megs. I've experimented with different ways of doing this and I've realized I really can't load both files into memory (all kinds of string-related issues). So I figured I'd just stream the bytes I need as I go. Fine. But the implementations that I've found of the Levenshtein algorithm all dimension a matrix that's length 1 * length 2 in size, which in this case is impossible to work with. I've heard there's a way to do this with just two vectors instead of the whole matrix.
How can I compute Levenshtein distance of two large files without declaring a matrix that's the product of their file sizes?
Note that the values in each row of the Levenshtein matrix depend only on the values in the row above it. This means that you only need two one-dimensional arrays: one contains the values of the current row; the other is populated with the new values that you can compute from the current row. Then, you swap their roles (the "new" row becomes the "current" row and vice versa) and continue.
Note that this approach only lets you compute the Levenshtein distance (which seems to be what you want); it cannot tell you which operations must be done in order to transform one string into the other. There exists a very clever modification of the algorithm that lets you reconstruct the edit operations without using nm memory, but I've forgotten how it works.

Problem 98 - Project Euler

The problem is as follows:
By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^(2). What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^(2). We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter.
Using words.txt (right click and 'Save Link/Target As...'), a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself).
What is the largest square number formed by any member of such a pair?
NOTE: All anagrams formed must be contained in the given text file.
I don't understand the mapping of CARE to 1296? How does that work? or are all permutation mappings meant to be tried i.e. all letters to 1-9?
All assignments of digits to letters are allowed. So C=1, A=2, R=3, E=4 would be a possible assignment ... except that 1234 is not a square, so that would be no good.
Maybe another example would help make it clear? If we assign A=6, E=5, T=2, then TEA = 256 = 16² and EAT = 625 = 25². So (TEA=256, EAT=625) is a square anagram word pair.
(Just because all assignments of digits to letters are allowed, does not mean that actually trying out all such assignments is the best way to solve the problem. There may be some other, cleverer, way to do it.)
In short: yes, all permutations need to be tried.
If you test all substitutions letter for digit, than you are looking for pairs of squares with properties:
have same length
have same digits with number of occurrences as in input string.
It is faster to find all these pairs of squares. There are 68 squares with length 4, 216 squares with length 5, ... Filtering all squares of same length by upper properties will generate 'small' number of pairs, which are solutions you are looking for.
These data is 'static', and doesn't depend on input strings. It can be calculated once and used for all input strings.
Hmm. How to put this. The people who put together Project Euler promise that there is a solution that is under one minute for every problem, and there is only one problem that I think might fail this promise, but this is not it.
Yes, you could permute the digits, and try all permutations against all squares, but that would be a very large search space, not at all likely to be the (TM) right thing. In general, when you see that your "look" at the problem is going to generate a search that will take too long, you need to search something else.
Like, suppose you were asked to determine what numbers would be the result of multiplying two primes between 1 and a zillion. You could factor every number between 1 and a zillion, but it might be faster to take all combinations of two primes and multiply them. Since you are looking at combinations, you can start with two and go until your results are too large, then do the same with three, etc. By comparison, this should be much faster - and you don't have to multiply all the numbers out, you could take logs of all the primes and then just add them and find the limit for every prime, giving you a list of numbers you could add up.
There are a bunch of innovative solutions, but the first one you think of - especially the one you think of when Project Euler describes the problem, is likely to be wrong.
So, how can you approach this problem? There are probably too many permutations to look at, but maybe you can figure out something with mappings and comparing mappings?
(Trying to avoid giving it all away.)