how do range queries work on a LSM (log structure merge tree)? - indexing

Recently I've been studying common indexing structures in databases, such as B+-trees and LSM. I have a solid handle on how point reads/writes/deletes/compaction would work in an LSM.
For example (in RocksDB/levelDB), on a point query read we would first check an in-memory index (memtable), followed by some amount of SST files starting from most to least recent. On each level in the LSM we would use binary search to help speed up finding each SST file for the given key. For a given SST file, we can use bloom filters to quickly check if the key exists, saving us further time.
What I don't see is how a range read specifically works. Does the LSM have to open an iterator on every SST level (including the memtable), and iterate in lockstep across all levels, to return a final sorted result? Is it implemented as just a series of point queries (almost definitely not). Are all potential keys pulled first and then sorted afterwards? Would appreciate any insight someone has here.
I haven't been able to find much documentation on the subject, any insight would be helpful here.

RocksDB has a variety of iterator implementations like Memtable Iterator, File Iterator, Merging Iterator, etc.
During range reads, the iterator will seek to the start range similar to point lookup (using Binary search with in SSTs) using SeekTo() call. After seeking to start range, there will be series of iterators created one for each memtable, one for each Level-0 files (because of overlapping nature of SSTs in L0) and one for each level later on. A merging iterator will collect keys from each of these iterators and gives the data in sorted order till the End range is reached.
Refer to this documentation on iterator implementation.


Natural way of indexing elements in Flink

Is there a built-in way to index and access indices of individual elements of DataStream/DataSet collection?
Like in typical Java collections, where you know that e.g. a 3rd element of an ArrayList can be obtained by ArrayList.get(2) and vice versa ArrayList.indexOf(elem) gives us the index of (the first occurence of) the specified element. (I'm not asking about extracting elements out of the stream.)
More specifically, when joining DataStreams/DataSets, is there a "natural"/easy way to join elements that came (were created) first, second, etc.?
I know there is a zipWithIndex transformation that assigns sequential indices to elements. I suspect the indices always start with 0? But I also suspect that they aren't necessarily assigned in the order the elements were created in (i.e. by their Event Time). (It also exists only for DataSets.)
This is what I currently tried:
DataSet<Tuple2<Long, Double>> tempsJoIndexed = DataSetUtils.zipWithIndex(tempsJo);
DataSet<Tuple2<Long, Double>> predsLinJoIndexed = DataSetUtils.zipWithIndex(predsLinJo);
DataSet<Tuple3<Double, Double, Double>> joinedTempsJo = tempsJoIndexed
And it seems to create wrong pairs.
I see some possible approaches, but they're either non-Flink or not very nice:
I could of course assign an index to each element upon the stream's
creation and have e.g. a stream of Tuples.
Work with event-time timestamps. (I suspect there isn't a way to key by timestamps, and even if there was, it wouldn't be useful for
joining multiple streams like this unless the timestamps are
actually assigned as indices.)
We could try "collecting" the stream first but then we wouldn't be using Flink anymore.
The 1. approach seems like the most viable one, but it also seems redundant given that the stream should by definition be a sequential collection and as such, the elements should have a sense of orderliness (e.g. `I'm the 36th element because 35 elements already came before me.`).
I think you're going to have to assign index values to elements, so that you can partition the data sets by this index, and thus ensure that two records which need to be joined are being processed by the same sub-task. Once you've done that, a simple groupBy(index) and reduce() would work.
But assigning increasing ids without gaps isn't trivial, if you want to be reading your source data with parallelism > 1. In that case I'd create a RichMapFunction that uses the runtimeContext sub-task id and number of sub-tasks to calculate non-overlapping and monotonic indexes.

Multithreaded grouping algorithm

I have a collection of circles, each of which may or may not intersect one or more other circles in the collection. I want to group these circles such that each "group" contains all circles such that every member of the group intersects at least one other member of the group, and such that no member of any group intersects any member of any other group. I have come up with the following VB.NET/pseudocode algorithm to solve this problem on a single thread:
Dim groups As New List(Of List(Of Circle))
For Each circleToClassify In allCircles
Dim added As Boolean
For Each group In groups
For Each circle In group
If circleToClassify.Intersects(circle) Then
added = True
Exit For
End If
If added Then
Exit For
End If
If Not added Then
Dim newGroup As New List(Of Circle)
End If
Return groups
Or in English
Take each item from the collection of circles
Check if it intersects with any member of any existing group (Bear in mind a "group" may only contain a single circle)
If the circle does intersect in the aforementioned manner add it to the appropriate group
Otherwise create a new group with this circle as its only member
Go to step 1.
What I want to be able to do is perform this task using an arbitrary number of threads. However, I haven't got very far at all as all solutions I've come up with so far will just end up executing serially due to locking.
Can anyone provide any tips on what I want to be thinking about to achieve this multithreading?
The best multithreaded solutions avoid sharing or perform read-only sharing. (And hence don't need locks.)
Consider partitioning your work so that threads don't share result data, and then merging each thread's results.
Note that when you strip away the detail of detecting whether groups of circles intersect, you are really dealing with a connected components graph theory problem. There's plenty of useful material on this subject online. And in fact you may find it much easier and sufficiently fast to simply apply a breadth first search algorithm to find connected components.
When doing multi-threaded development, first prize is to implement the threads in such a way as to minimise the number of locks. In the most trivial case: if they don't share any data, they don't need locks at all. However, if you can guarantee that the shared data won't be modified while the threads are running: then you don't need locks in this case either.
In your question, there's no need for your input list of circles to be modified. The problem you have is that you're building up a shared list of circle groups. Basically you're sharing your result space and need locks to ensure the integrity of the results.
One technique in this situation is to "partition and merge". As a trivial example consider finding the maximum of a large list of numbers. The naive (and ideal single-threaded solution) is to:
keep a single "current maximum" found;
compare each element to this value;
and update the "current maximum" if it's higher.
The problem for multithreading occurs in updating of the shared result. One solution is to:
partition the list for each of p threads;
find the maximum within each partition;
once all threads finish their work, the final result is trivially obtained by finding the maximum of the p partitioned maximums.
The trade-off against a single-threaded solution involves weighing up the ease with which the workload can be partitioned and the per-thread results merged versus the often much simpler single-threaded approach.
Applying partition and merge to circle clusters
As a side note: Observe that your question is essentially a graph theory question such that: Each circle is a node; where if any 2 circles intersect, there's an undirected edge between them; and you're trying to determine the connected components of the graph.
Obviously this provides an area that you can research for more ideas/information. But more importantly it makes easier to analyse the problem with simple boolean assessment of whether 2 circles intersect.
Also note the potential performance improvements by first pre-processing your circles into a suitable graph structure.
Assume you have 8 circles (A-H) where 1's in the table below indicate the 2 circles intersect.
One partitioning idea involves determining what's connected by only considering a subset of circles and all their immediate connections.
A11000110 p1 [AB]
C00100000 p2 [CD]
E00001110 p3 [EF]
G10001010 p4 [GH]
NB Even though threads are sharing data (e.g. 2 threads may consider the intersection between circles A and F concurrently), the share is read-only and doesn't require a lock.
Assume 4 partitions (and 4 threads) of [AB][CD][EF][GH]. Connected components per partition would be broken down as follows:
You now have a list of potentially overlapping connected components. Merging involves iterating the list to find overlaps. If found, take union of the 2 sets is a new connected component. This will finally produce: ABFGDHE and C
Some optimisation techniques to consider:
The bottom left of the matrix mirrors the top-right. So you should be able to avoid duplicating processing of the inverse connections.
The merging of partitions can itself be partitioned and merged.
In fact in the extreme case you could start out partitioning a single circle per partition.
Connected(A) = ABFG
Connected(B) = B
Connected(AB) = ABFG
Connected(C) = C
Connected(D) = DFH
Connected(CD) = C,DFH
Connected(ABCD) = ABFGDH,C
Connected(E) = EFG
Connected(F) = F
Connected(EF) = EFG
Connected(G) = G
Connected(H) = H
Connected(GH) = G,H
Connected(EFGH) = EFG,H
Very NB You need to ensure appropriate selection of data structures and algorithms or suffer extremely poor performance. E.g. A naive intersection implementation might require O(n^2) operations to determine if two intermediate connected components intersect and totally destroy your goal that lead to all this additional complexity.
One approach is to divide the image into blocks, run the algorithm for each block independently, on different threads (i.e. considering only the circles whose center is in that block), and afterwards join the groups from different blocks that have intersecting circles.
Another approach is to formulate the problem using a graph, where the nodes represent circles, and an edge exists between two nodes if the corresponding circles are intersecting. We need to find the connected components of this graph. This disregards the geometric aspects of the problem, however, there are general algorithms which may be useful (e.g. you could consider the last slides from this link).

The best way to search millions of fuzzy hashes

I have the spamsum composite hashes for about ten million files in a database table and I would like to find the files that are reasonably similar to each other. Spamsum hashes are composed of two CTPH hashes of maximum 64 bytes and they look like this:
They can be broken down into three sections (split the string on the colons):
Block size: 384 in the hash above
First signature: w2mhnFnJF47jDnunEk3SlbJJ+SGfOypAYJwsn3gdqymefD4kkAGxqCfOTPi0ND
Second signature: wemfOGxqCfOTPi0ND
Block size refers to the block size for the first signature, and the block size for the second signature is twice that of the first signature (here: 384 x 2 = 768). Each file has one of these composite hashes, which means each file has two signatures with different block sizes.
The spamsum signatures can be compared only if their block sizes correspond. That is to say that the composite hash above can be compared to any other composite hash that contains a signature with a block size of 384 or 768. The similarity of signature strings for hashes with similar block size can be takes as a measure of similarity between the files represented by the hashes.
So if we have:
file1.blk2 = 768
file1.sig2 = wemfOGxqCfOTPi0ND
file2.blk1 = 768
file2.sig1 = LsmfOGxqCfOTPi0ND
We can get a sense of the degree of similarity of the two files by calculating some weighted edit distance (like Levenshtein distance) for the two signatures. Here the two files seem to be pretty similar.
leven_dist(file1.sig2, file2.sig1) = 2
One can also calculate a normalized similarity score between two hashes (see the details here).
I would like to find any two files that are more than 70% similar based on these hashes, and I have a strong preference for using the available software packages (or APIs/SDKs), although I am not afraid of coding my way through the problem.
I have tried breaking the hashes down and indexing them using Lucene (4.7.0), but the search seems to be slow and tedious. Here is an example of the Lucene queries I have tried (for each single signature -- twice per hash and using the case-sensitive KeywordAnalyzer):
(blk1:768 AND sig1:wemfOGxqCfOTPi0ND~0.7) OR (blk2:768 AND sig2:wemfOGxqCfOTPi0ND~0.7)
It seems that Lucene's incredibly fast Levenshtein automata does not accept edit distance limits above 2 (I need it to support up to 0.7 x 64 ≃ 19) and that its normal editing distance algorithm is not optimized for long search terms (the brute force method used does not cut off calculation once the distance limit is reached.) That said, it may be that my query is not optimized for what I want to do, so don't hesitate to correct me on that.
I am wondering whether I can accomplish what I need using any of the algorithms offered by Lucene, instead of directly calculating the editing distance. I have heard that BK-trees are the best way to index for such searches, but I don't know of the available implementations of the algorithm (Does Lucene use those at all?). I have also heard that a probable solution is to narrow down the search list using n-gram methods but I am not sure how that compares to editing distance calculation in terms of inclusiveness and speed (I am pretty sure Lucene supports that one). And by the way, is there a way to have Lucene run a term search in the parallel mode?
Given that I am using Lucene only to pre-match the hashes and that I calculate the real similarity score using the appropriate algorithm later, I just need a method that is at least as inclusive as Levenshtein distance used in similarity score calculation -- that is, I don't want the pre-matching method to exclude hashes that would be flagged as matches by the scoring algorithm.
Any help/theory/reference/code or clue to start with is appreciated.
This is not a definitive answer to the question, but I have tried a number of methods ever since. I am assuming the hashes are saved in a database, but the suggestions remain valid for in-memory data structures as well.
Save all signatures (2 per hash) along with their corresponding block sizes in a separate child table. Since only signatures of the same size can be compared with each other, you can filter the table by block size before starting to compare the signatures.
Reduce all the repetitive sequences of more than three characters to three characters ('bbbbb' -> 'bbb'). Spamsum's comparison algorithm does this automatically.
Spamsum uses a rolling window of 7 to compare signatures, and won't compare any two signatures that do not have a 7-character overlap after eliminating excessive repetitions. If you are using a database that support lists/arrays as fields, create a field with a list of all possible 7-character sequences extracted from each signature. Then create the fastest exact match index you have access to on this field. Before trying to find the distance of two signatures, first try to do exact matches over this field (any seven-gram in common?).
The last step I am experimenting with is to save signatures and their seven-grams as the two modes of a bipartite graph, projecting the graph into single mode (composed of hashes only), and then calculating Levenshtein distance only on adjacent nodes with similar block sizes.
The above steps do a good pre-matching and substantially reduce the number of signatures each signature has to be compared with. It is only after these that the the modified Levenshtein/Damreau distance has to be calculated.

Additional PlanningEntity in CloudBalancing - bounded-space situation

I successfully amended the nice CloudBalancing example to include the fact that I may only have a limited number of computers open at any given time (thanx optaplanner team - easy to do). I believe this is referred to as a bounded-space problem. It works dandy.
The processes come in groupwise, say 20 processes in a given order per group. I would like to amend the example to have optaplanner also change the order of these groups (not the processes within one group). I have therefore added a class ProcessGroup in the domain with a member List<Process>, the instances of ProcessGroup being stored in a List<ProcessGroup>. The desired optimisation would shuffle the members of this List, causing the instances of ProcessGroup to be placed at different indices of the List List<ProcessGroup>. The index of ProcessGroup should be ProcessGroup.index.
The documentation states that "if in doubt, the planning entity is the many side of the many-to-one relationsship." This would mean that ProcessGroup is the planning entity, the member index being a planning variable, getting assigned to (hopefully) different integers. After every new assignment of indices, I would have to resort the list List<ProcessGroup in ascending order of ProcessGroup.index. This seems very odd and cumbersome. Any better ideas?
Thank you in advance!
The current design has a few disadvantages:
It requires 2 (genuine) entity classes (each with 1 planning variable): probably increases search space (= longer to solve, more difficult to find a good or even feasible solution) + it increases configuration complexity. Don't use multiple genuine entity classes if you can avoid it reasonably.
That Integer variable of GroupProcess need to be all different and somehow sequential. That smelled like a chained planning variable (see docs about chained variables and Vehicle Routing example), in which case the entire problem could be represented as a simple VRP with just 1 variable, but does that really apply here?
Train of thought: there's something off in this model:
ProcessGroup has in Integer variable: What does that Integer represent? Shouldn't that Integer variable be on Process instead? Are you ordering Processes or ProcessGroups? If it should be on Process instead, then both Process's variables can be replaced by a chained variable (like VRP) which will be far more efficient.
ProcessGroup has a list of Processes, but that a problem property: which means it doesn't change during planning. I suspect that's correct for your use case, but do assert it.
If none of the reasoning above applies (which would surprise me) than the original model might be valid nonetheless :)

VBA: Performance of multidimensional List, Array, Collection or Dictionary

I'm currently writing code to combine two worksheets containing different versions of data.
Hereby I first want to sort both via a Key Column, combine 'em and subsequently mark changes between the versions in the output worksheet.
As the data amounts to already several 10000 lines and might some day exceed the lines-per-worksheet limit of excel, I want these calculations to run outside of a worksheet. Also it should perform better.
Currently I'm thinking of a Quicksort of first and second data and then comparing the data sets per key/line. Using the result of the comparison to subsequently format the cells accordingly.
I'd just love to know, whether I should use:
List OR Array OR Collection OR Dictionary
OF Lists OR Arrays OR Collections OR Dictionaries
I have as of now been unable to determine the differences in codability and performance between this 16 possibilities. Currently I'm implementing an Array OF Arrays approach, constantly wondering whether this makes sense at all?
Thanks in advance, appreciate your input and wisdom!
Some time ago, I had the same problem with the macro of a client. Additionally to the really big number of rows (over 50000 and growing), it had the problem of being tremendously slow from certain row number (around 5000) when a "standard approach" was taken, that is, the inputs for the calculations on each row were read from the same worksheet (a couple of rows above); this process of reading and writing was what made the process slower and slower (apparently, Excel starts from row 1 and the lower is the row, the longer it takes to reach there).
I improved this situation by relying on two different solutions: firstly, setting a maximum number of rows per worksheet, once reached, a new worksheet was created and the reading/writing continued there (from the first rows). The other change was moving the reading/writing in Excel to reading from temporary .txt files and writing to Excel (all the lines were read right at the start to populate the files). These two modifications improved the speed a lot (from half an hour to a couple of minutes).
Regarding your question, I wouldn't rely too much on arrays with a macro (although I am not sure about how much information contains each of these 10000 lines); but I guess that this is a personal decision. I don't like collections too much because of being less efficient than arrays; and same thing for dictionaries.
I hope that this "short" comment will be of any help.