why ext4 filesystem choose to use HTree as there extent's tree structure - ext4

From ext4 wikipedia introduction, I found the Htree was used in ext4 for both Directory organization and the extents organization.
In directory organization scenario, Hash Table Tree can help to balance and improve search.
but what is the benefit to use Htree in extents organization ?
Tanks for your wisdom :)

i_block field in ext4_inode structure can only contain 60 bytes. Each extent are 12 bytes long. i_block can contain only 4 extents + 1 header. Ext4 uses tree if need to store more than 4 extents.

Related

Locality sensitive hashing - what happens when a bucket is empty?

Assume I've constructed an LSH database according to some set of hashes, and I'm now beginning to query the database to find approximate nearest neighbors.
Are there any guidelines to what happens when you compute the hash for a query point, and the corresponding bucket is empty? Similarly, say I want to find the 5 approximate nearest neighbors, and the bucket has only 4 other data points?
I believe getting too few points for a retrieval means that you have too many buckets for your training data. And that is application dependent of course. Take a look at LSH toolbox by Greg Shakhnarovich implementation and his README file. In this implementation, fewer hash functions (smaller k) means fuller buckets, and that in turn means slower LSH.

Best Firebird blob size page size relation

I have a small Firebird 2.5 database with a blob field called "note" declared as this:
BLOB SUB_TYPE 1 SEGMENT SIZE 80 CHARACTER SET UTF8
The data base page size is:
16.384 (That I'm suspecting is too high)
I have ran this select in order to discover the average size of the blob fields available:
select avg(octet_length(items.note)) from items
and got this information:
2.671
As a beginner, I would like to know the better segment size for this blob field and the best database page size in your opinion (I know that this depends of other information, but I still don't know how to figure it out).
Blobs in Firebird are stored in separate pages of your database. The exact storage format depends on the size of your blob. As described in Blob Internal Storage:
Blobs are created as part of a data row, but because a blob could be
of unlimited length, what is actually stored with the data row is a
blobid, the data for the blob is stored separately on special blob
pages elsewhere in the database.
[..]
A blob page stores data for a blob. For large blobs, the blob page
could actually be a blob pointer page, i.e. be used to store pointers
to other blob pages. For each blob that is created a blob record is
defined, the blob record contains the location of the blob data, and
some information about the blobs contents that will be useful to the
engine when it is trying to retrieve the blob. The blob data could be
stored in three slightly different ways. The storage mechanism is
determined by the size of the blob, and is identified by its level
number (0, 1 or 2). All blobs are initially created as level 0, but
will be transformed to level 1 or 2 as their size increases.
A level 0 blob is a blob that can fit on the same page as the blob
header record, for a data page of 4096 bytes, this would be a blob of
approximately 4052 bytes (Page overhead - slot - blob record header).
In other words, if your average size of blobs is 2671 bytes (and most larger ones are still smaller than +/- 4000 bytes), then likely a page size of 4096 is optimal as it will reduce wasted space from on average 16340 - 2671 = 13669 bytes to 4052 - 2671 = 1381 bytes.
However for performance itself this likely hardly going to matter, and smaller page sizes have other effects that you will need to take into account. For example a smaller page size will also reduce the maximum size of a CHAR/VARCHAR index key, indexes might become deeper (more levels), and less records fit in a single page (or wider records become split over multiple pages).
Without measuring and testing it is hard to say if using 4096 for the page size is the right size for your database.
As to segment sizes: it is a historic artifact that is best ignored (and left off). Sometimes applications or drivers incorrectly assume that blobs need to be written or read in the specified segment size. In those rare cases specifying a larger segment size might improve performance. If you leave it off, Firebird will default to a value of 80.
From Binary Data Types:
Segment Size: Specifying the BLOB segment is throwback to times past,
when applications for working with BLOB data were written in C
(Embedded SQL) with the help of the gpre pre-compiler. Nowadays, it is
effectively irrelevant. The segment size for BLOB data is determined
by the client side and is usually larger than the data page size, in
any case.

is SHA-512 collision resistant?

According to the books that i have read, it says that S.H.A(Secure Hash Algorithm) is collision resistant.But if the input space is a 1024 bit number and the output space is a 512 bit message digest then shouldn't it be colliding for
(2^1024)/(2^512) times? As the range is lesser than the domain being mapped there should have been collisions. please explain where i am going wrong.
The chance for a collision does not depend on the input size. The chance to a 512-bit hash collision is 1.4×10^77, see Probability table
Maybe your book has also mentioned the definition of collision resistance? It does not mean that no collisions are created (which is clearly not the case), but that given a hash you are not able to create a message easily that produces this hash.
a hash function H is collision resistant if it is hard to find two
inputs that hash to the same output; that is, two inputs a and b such
that H(a) = H(b), and a ≠ b
From Wikipedia
As you describe: Since the input space (arbitrary size) is larger than the output space (e.g. 512bit for sha512), there always exist collisions.
"Collision resistant" means, it is adequately unlikely for a collision to be found.
Your confusion is answered when considering how large the output space "512 bits" really is:
2^512 (the number of possible configurations of a 512 bit array) is of the order 10^154.
For comparison: The number of atoms in the visible universe is somewhere in the range of 10^80.
A million is 10^6.
So a million of our 'visible universes' has 10^86 atoms.
A million times a million universes has 10^92 atoms.
If you could store a single 512 bit value on a single atom, how many universes would you need to have all possible 512 bit has values stored?
Starting with a specific 512bit number (and assuming the has function is not broken), the probability p to obtain a collision is assuming you can produce new hashes with a rate R and have the total time of t to do this is:
p = R*t/(2^(512/2))
(The exponent is halved, see "birthday attach". The expected search space for a success is to find a collision in n bits is n/2.)
Let's plugin in some example numbers:
The has rate of the bitcoin network is currently about R = 200*10^15 / s (200 million terrahashes per second).
Consider the situation that since the beginning of the universe the bitcoin network's current hashing capacity would have been available for the sole purpose of finding a collision for a specific hash value, i.e. for an available time of t=13.787*10^9 years,
then the probability that a collision would have been found by now is about 7 × 10^-41 %
Again, it is hard to appreciate how small this number is.
Edit: A similar question with a good answer is found here: https://crypto.stackexchange.com/questions/89558/are-sha-256-and-sha-512-collision-resistant

How to get LBA(logical block addressing) of a file from MFT on NTFS file system?

I accessed the $MFT file and extracted file attributes.
Given the file attributes from MFT, how to get a LBA of file from the MFT record on NTFS file system?
To calculate LBA, I know that cluster number of file.
It that possible using cluster number to calculate?
I'm not entirely sure of your question-- But if you're simply trying to find the logical location on disk of a file, there are various IOCTLs that will achieve this.
For instance, MFT File records: FSCTL_GET_NTFS_FILE_RECORD
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364568(v=vs.85).aspx
Location on disk of a specific file via HANDLE: FSCTL_GET_RETRIEVAL_POINTERS
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364572(v=vs.85).aspx
If you're trying to parse NTFS on your own, you'll need to follow the $DATA attribute-- Which will always be non-resident data runs (unless it's a small file that can be resident within the MFT). Microsoft's data runs are fairly simply structures of data contained in the first two nibbles, which specify offset and length for the next run of data.
IMHO you should write the code by doing some basic arithmetic rather than using IOCTLs and FSCTLs for everything. You should know the size of your disk and the offset from which a volume starts (or every extent by using IOCTL_VOLUME_GET_VOLUME_DISK_EXTENTS) and store those values somewhere. Then just add the LCN times the size of a cluster to the offset of the extent on the disk.
Most of the time you just have to deal with one extent. When you have multiple extents you can figure out on which extent the cluster is by multiplying the LCN with the size of a cluster and then subtracting the size of each extent returned by the IOCTL in the order they are returned, if the next number to subtract is greater than your current number, that particular LCN is on that extent.
A file is a single virtually contiguous unit consisting of virtual clusters. These virtual clusters map onto extents (fragments) of logical clusters where LCN 0 is the boot sector of the volume. The logical clusters map onto different logical clusters if there are bad clusters. The actual logical cluster is then translated to a physical cluster, PCN, or LBA (the first sector of the physical cluster) by summing the number of hidden sectors (the sector number of the boot sector relative to the start of the disk) and then adding it to LCN*(sectors per cluster in the volume). PCN = hidden sectors / (sectors per cluster in the volume) + LCN. LBA = hidden sectors + LCN*(sectors per cluster in the volume)

DAG: Minimizing distance between entries in grouped nodes

I have a directed acyclic graph with nodes that are lists of entries that connect to entries in other nodes. Kind of like this:
entry ]
entry--| ] node 1
entry | ]
----- |
entry<-| ] node 2
entry | ]
----- |
entry | ] node 3
entry--| ]
The ordering of entries within a node is fixed. The entries are stored in an array with absolute indexes to the entries they link to. There is a maximum of 1 link per entry, and every node has at least 1 link. (in other words, this is a highly connected graph). The graph contains approximately 100,000 entries grouped in 40,000 nodes.
What I need to do is minimize the maximum distance between entries by reordering nodes so that I can use relative indexes for the links and compress the underlying data structure.
Because compression and performance is the goal, solutions that add external data (jump tables, special jump elements in the list) are unacceptable. I really need an algorithm for reordering nodes that minimizes the maximum distance between entries. Any thoughts?
The problem you are describing is how to minimize the maximum distance. I think it's NP-hard so a simple solution won't be very good. You could however model it as an ILP problem and use some solver for that.
You would then minimize M as an objective.
constraints would be M>= abs(s_i-e_i) for all links l_i. s_i and e_i represent the absolute indices of the start and end entry of your link.
These entries can be rewritten in terms of the node they belong to as s_i=n_i+c_i with n_i the index of the node s_i belongs to and c_i the fixed offset within that node (among the other entries). e_i is similarly rewritten. Then you're set to optimize n_i with the solver