How to obtain an inode's data block numbers - block

I was wondering if there is an easy way to obtain the numbers of the blocks that a specific file is stored to in minix. eg stat() has the information of how many blocks a specific inode has but not the addresses of the actual blocks.

Related

Storing DNA in Swift

I'm going to write an application for dealing with raw DNA data samples, as the files you get from MyHeritage, Ancestry, FamilyTreeDNA, 23&me etc. Each of these files are basically a CSV-file with some quirks, and asked about decoding them in another question I posted earlier.
Now for the next part. When I have parsed/decoded those files, I want to put the DNA data in a database, so that I can compare one persons DNA to that of another person. It's a lot of data, but not more than most computers can handle.
In memory, I can have the full DNA for both persons, and compare them, and then create ArraySlices for the segments of DNA data that overlap, but ArraySlices aren't suitable for storage, In memory the ArraySlice can't exist by itself. It's just a reference into the full array, so if I would flatten the ArraySlice I would still get the whole array, even for the segments that don't match.
Each person shall have their full DNA on backing store, and can be read into memory, but how would you store the matching segments?
I'm thinking of something like:
//Bucket is the term FamilyTreeDNA for indicating whether the DNA match is on the maternal or paternal chromosome.
enum Bucket {
case .maternal
case .paternal
}
//THe first 22 chromosome pairs are just numbered from 1 to 22, but the last two are X and Y, so I use a String for storing chromosome "number"
struct SharedSegment {
let onChromosome: String
let bucket: Bucket
let range: Range<UInt>
}
I don't care if it takes more disk space, but I want to have lightning fast comparisons of DNA, so that I can compare all the DNA for all the individuals in a datavase, without it taking months to do so. Also storage space for the full DNA to make comparisons.
At the first stage I'm just building an app for storing the DNA kits I administer, but I already have plans for a services of type Gedmatch and DNAPainter if you have tried them. This means it's a services where people can upload their DNA to be compared to other peoples DNA, and lets says a million people upload their DNA to this service, and each of them should have their DNA compared to the other 999'999 people. The number of comparison will be huge, so my primary focus is on performance. Each file with raw DNA data will contain about 400-950 thousand lines of DNA data.
Each line will contain the chromosome number, the RSID, the position within the chromosome and a genotype. The latter is two letters "AA", "AC", "CT" etc. There are four different letters A, C, G and T. The reason there are two letter for each position is that you have chromosome pairs, where there is one chromosome inherited from the father and one from the mother, and there is one letter from each of those two chromosomes. Of course I can store them as just a string of characters, but there are chances of errors, so I would like to represent them in code as
enum Aminoacid {
noCall = 0
case A
case C
case G
case T
}
When sequencing DNA there are sometimes problem, and the sequencing equipment can't determine which amino acid it is in a certain position. This is called a "no call", therefore the case noCall in the enum. in the raw DNA file this is represented by a dash, so it can say in the results "-A"m which means that one of the parents had an A in that position, and the other could not be determined.
Is there any possibility to squeeze them together in 4 bits (nybble), so that I can store two of these letter per byte?It's even possible to squeeze into 3 bits, but I can't get three letters into a byte anyway. It solve be two letter รก 3 birs each and two bits wasted in every byte, so I could just as well use 4 bits per amino acid. There are Uint64, Uint32, Uint16 and Uint8 in Swift, but no Uint4, which would be ideal for this case. I'm also thinking about whether to store the two letter from the maternal and paternal chromosome together or if U should split them into separate array One array for maternal DNA and one for paternal). There is a problem with that approach, and that is it's impossible to tell if the first letter on each row is maternal or paternal, until you have the DNA from at least one of the parents to compare with. In absence of their DNA, I would have to have a third array to store both letters in, until I can determine swhich one is maternal end paternal respectively. I'm trying to come up with the most effective way of storing this, to make the comparisons super fast.
In one way I don't like using enums, and that is because I will have to convert them to rawValue, do I can do something like
var genotype = Aminoacid.A.rawValue << 4 + Aminoacid.G.rawValue
As far as I can see that's the best way to squeeze two of these into ont byte, since there's no UInt4.
I'm not so fond of having lots of .rawValue all over my code. I would like to have only Aminoacid.A << 4 + Aminoacid.G, but unfortunately I don't think this is possible. Maybe there is a beter way to store these sequences of amino acids in the database, like enums with associated values or something. I don't know how efficient associated values till be, when working with such large data sets.
If there is anyone out there, that wants to collaborate on this project, that is so far just a hobby project, but I have plans for making a business out of it eventually. This means I can't employ anyone to do this, but if you're working on similar projects then let me know. We can make better things together. Just be aware that I'm writing in Swift, and I'm going to deploy on macOS, but Swift is also available for other platforms, so coders for Linux and Windows are equally welcome to work on a joint project.
This became a little offtopic. My question was about storage of raw DNA and shared segments in a way that is optimal for fast search and comparison of huge amounts of DNA.I probably won't use CoreData for storage, since I would like to keep the options for porting to other platforms than Apples. At the moment I'm using CoreData to experiment a little with storing DNA in different ways.

Elm: Search Number in Bytes

I'm trying to find some exif data in an image.
So first I need to find the number 0x45786966 ('Exif' as unsignedInt32) and store the offset.
The next two bytes should be zeros and after that the endianness as unsignedInt16 (either 0x4d4d or 0x4949) which should be stored too.
I can get the image as Bytes with the elm/file module.
But how do I search the 'Exif' start and parse the endianness in those Bytes?
I looked at the loop-example from elm/bytes but do not fully understand it.
First it reads the length of a list (unsignedInt32) and then it reads byte by byte?
How would this work if I want to read unsignedInt32s instead of bytes?
How do I set an offset to indicate where functions like unsignedInt32 should read next?
The example is talking about structured data with a known size field at the start. In your case, what you want to do is a search, so it is a rather different problem.
The problem is elm/bytes isn't really designed to handle searching. If you can guarantee the part you are looking for will be byte aligned, it may well be possible to do this, but given just what you have said, there isn't an easy way, as you can't iterate bit-by-bit.
You would have to read in values without alignment and then manually search for the part of the number you want within that. Given the difficulty and inefficiency of that approach, I would recommend using ports instead for that use case.
If you can guarantee that what you are searching for will be byte-aligned (or better yet, aligned to the length of your number), you can decode a byte at a time until you find what you are looking for. There is no way to read from a given offset, if you want to read to a certain point, you'd need to read and throw away values.
To do this, you would want to set up a loop where your state contains how much of the value you are looking for you have found. Each step, you check if you have the whole thing (success), you have the next part (continue), or you have something different (reset the state to search from the start again). If you reach the end without finding it, you have failed.

How to know the configured size of the Chronicle Map?

We use Chronicle Map as a persisted storage. As we have new data arriving all the time, we continue to put new data into the map. Thus we cannot predict the correct value for net.openhft.chronicle.map.ChronicleMapBuilder#entries(long). Chronicle 3 will not break when we put more data than expected, but will degrade performance. So we would like to recreate this map with new configuration from time to time.
Now it the real question: given a Chronicle Map file, how can we know which configuration was used for that file? Then we can compare it with actual amount of data (source of this knowledge is irrelevant here) and recreate a map if needed.
entries() is a high-level config, that is not stored internally. What is stored internally is the number of segments, expected number of entries per segment, and the number of "chunks" allocated in the segment's entry space. They are configured via ChronicleMapBuilder.actualSegments(), entriesPerSegment() and actualChunksPerSegmentTier() respectively. However, there is no way at the moment to query the last two numbers from the created ChronicleMap, so it doesn't help much. (You can query the number of segments via ChronicleMap.segments().)
You can contribute to Chronicle-Map by adding getters to ChronicleMap to expose those configurations. Or, you need to store the number of entries separately, e. g. in a file along with the ChronicleMap persisted file.

Notating large batch of files

I have about 30,000 different files all with different file formats names. I want to put together a list of "unique" files given that the dates/etc. are replaced by generic characters/symbols.
For example:
20160105asdf_123456_CODE.txt
Would be notated into:
YYYYMMDD*_######_XXXX.txt
Any ideas on how to do this efficiently on a large scale? I thought about parsing it out per delimiter ("_"), but I'm sure there's something a lot easier out there.

Reading lines from a data file

Problem: Reading data file with multiple entries on a single line
The easiest way I have found to do this is to read the whole line as string and then use internal reads to extract the non-blank values.
~~~~~~~~~~~~
Problems with that solution:
Requires you to know the maximum length of any given line in the data file which is often not possible.
or
Requires you to create an arbitrary and excessively long string variable which wastes memory.
~~~~~~~~~~~~
Is there any other way of doing this?
You can directly read multiple items from a one or multiple lines. For example:
read (5, *) a, b, c, d
will read four values from one to many lines.
Using deferred length character and non-advancing reads avoids the problems you mention in your question.
Continuing to parse of the resulting line using internal IO with explicit formats then avoids the potential for user "surprise" associated with the more obscure features of list directed formatting and allows far more scope and control over input error detection and reporting.