How to use bitmask operations in SQL - sql

I'm already quite familiar with the concept of bitmasks, and also with the bitwise operators used to modify them. However, I have one specific problem that I can't solve. Here's the problem:
I have a large number of relatively large bitmasks (something around 10,000,000 bitmasks, each 256 bits long). Generating an SQL index that will allow me to search for a specific one in log(n) time is simple enough. However, what I need to do is to match a given 256-bit query against the entire dataset and find N (variable) data items that are "least different" from the given query, least different meaning the number of bits that don't match should be minimal. For example, if the database contains {0110, 1101, 0000, 1110} then the closest match to 0100 is either one of 0110 and 0000.
Given the number of entries, a linear search would be very inefficient, which is, I believe, what would happen if I were to use aggregate operators. I'm looking for a way to improve the search, but have found no way to do it as of now. Any ideas would be highly appreciated.

Related

Optimising table assignment to guests for an event based on a criteria

66 guests at an event, 8 tables. Each table has a "theme". We want to optimize various criteria: e.g., even number of men/women at the table, people get to discuss the topic they selected, etc.
I formulated this as a gradient-free optimisation problem: I wrote a function that calculates the goodness of the arrangement (i.e., cost of difference of men women, cost of non-preferred theme, etc.) and I am basically randomly perturbing the arrangement by swapping tables and keeping the "best so far" arrangement. This seems to work, but cannot guarantee optimality.
I am wondering if there is a more principled way to go about this. There (intuitively) seems to be no useful gradient in the operation of "swapping" people between tables, so random search is the best I came up with. However, brute-forcing by evaluating all possibilities seems to be difficult; if there are 66 people, there are factorial(66) possible orders, which is a ridiculously large number (10^92 according to Python). Since swapping two people at the same table is the same, there are actually fewer, which I think can be calculated by dividing out the repeats, e.g. fact(66)/(fact(number of people at table 1) * fact(number of people at table 2) * ...), which in my problem still comes out to about 10^53 possible arrangements, way too many to consider.
But is there something better that I can do than random search? I thought about evolutionary search but I don't know if it would provide any advantages.
Currently I am swapping a random number of people on each evaluation and keeping it only if it gives a better value. The random number of people is selected from an exponential distribution to make it more probable to swap 1 person than 6, for example, to make small steps on average but to keep the possibility of "jumping" a bit further in the search.
I don't know how to prove it but I have a feeling this is an NP-hard problem; if that's the case, how could it be reformulated for a standard solver?
Update: I have been comparing random search with a random "greedy search" and a "simulated annealing"-inspired approach where I have a probability of keeping swaps based on the measured improvement factor, that anneals over time. So far the greedy search surprisingly strongly outperforms the probabilistic approach. Adding the annealing schedule seems to help.
What I am confused by is exactly how to think about the "space" of the domain. I realize that it is a discrete space, and that distance are best described in terms of Levenshtein edit distance, but I can't think how I could "map" it to some gradient-friendly continuous space. Possibly if I remove the exact number of people-per-table and make this continuous, but strongly penalize it to incline towards the number that I want at each table -- this would make the association matrix more "flexible" and possibly map better to a gradient space? Not sure. Seating assignment could be a probability spread over more than one table..

Match a large number of records in a database?

I need match about 25 million address records with about 200,000 other address records. I also would like to have a small degree of fuzziness so comparing for exact matches is kind of out. The addresses are parsed into components in both data sets. And they are both stored in a SQL Server 2008 database.
I had an idea to do comparisons in batches (grouping the batches by state) until I reached the end, dumping matches into a temporary database. This would be done in a .NET application, but I don't think this is too efficient since I have to pull the data from SQL into the application and iterate over it one by one. Threading could speed up the process, but I don't know by how much.
I also thought about indexing the 25 million records into a Lucene index and utilizing the filtering in that to trim narrow down potential matches.
Are either of these a good approach? What other options are there?
For a first pass do an exact
For fuzzy you can use Levenstien Distance
Levenstien Distance TSQL
You can also run Levenstien in .NET
It might make sense to bring the 200,000 into a .NET collection
And then compare the 25 million one at a time to the 200,000
I assume the .NET implementation is faster just don't know how much faster.
C# Levenshtein
MSSQL has SOUNDEX but it is way to fuzzy.
Hopefully you have a valid state to filter on
Hopefully you have a some valid zip code
If the zip code is valid then filter to only that zip code

Postgresql -- All else equal, is querying for (small) integer or float values faster than querying for (small) string values?

I'm about to mark maybe 100,000 records retroactively/posthoc-wise with category-indicating string or integer values. There are more to come. The categories to be marked by this column reflect a scalar continuum of different category types, going anywhere from "looser" to "tighter" essentially. I was thinking about using string values though, instead of integers, in case one day I come back to it and not know what means what.
So that's the reasoning for using strings, readability.
But I'll be relying on these columns pretty significantly, selecting swaths of records based off this criteria.
Obviously whatever it is I'm going to put an index on it, but with an index, I'm not sure how much faster querying on integers is than using strings. I've noticed the speediness of using booleans, and can reasonably assume small integers can be queried on more quickly than strings based off this.
I've been pondering this trade off for some time now so thought I'd fire off a question. Thanks
If it's really a string representing some ordered level between "looser" and "tighter", consider using an enum:
http://www.postgresql.org/docs/current/static/datatype-enum.html
That way, you'll get the best of both worlds.
One tiny note, though: ideally, make sure you nail all possible values in advance. Changing an enum is of course possible, but doing so adds an extra lookup and sort step internally (on a 32-bit float field) when the order of its numeric representation (its oid, which is a 32-bit integer) no longer matches its final order. (The performance difference is minor, but one to keep in mind should your data ever grow to billions of rows. And, again: it only applies when you alter the order of an existing enum.)
Regarding the second part of your question, sorting small integers (16-bit) is, in my own admittedly limited testing from a few years back, a bit slower than normal integers (32-bit). I imagine it's because they're manipulated as 32-bit integers anyway. And sorting or querying integers, as in the case of enums, is faster than sorting arbitrary strings. Ergo, use enums if you don't need the flexibility of adding arbitrary values down thhe road: they'll give you the best of each world.

Best practices for schemas with many boolean variables

I'm creating a Postgresql database where we have many (10-40) variables that will have boolean values. I'd like to figure out what the best way to store this data is, given moderate numbers of updates and lots of multi-column searches.
It seems pretty straightforward to just create the 30 or so boolean columns and create multi-column indexes where necessary. Alternatively, someone suggested creating a bit string that combines all of the booleans. It seems like the second option should be faster, but the answers other people have given online seem to be contradictory (see below).
Any suggestions or explanations would be helpful. The data is tens of millions of rows, but not larger, and I expect selects to return somewhere between 1/100 to 1/4 of the data.
https://stackoverflow.com/questions/14067969/optimized-sql-using-bitwise-operator
alternative to bitmap index in postgresql
UPDATE:
I found one resource that suggests using ints or big ints if you have more than a few variables (where you should use separate columns) and fewer than 33 or so (where you switch to bitstrings). This seems to be motivated more by storage size than ease of search.
https://dba.stackexchange.com/questions/25073/should-i-use-the-postgresql-bit-string
I've found a related discussion at the Database Administrators site.
First, I would define/analyze what is "best" in your context. Are you just looking for speed? What is your search pattern? Is data/disk volume an issue?
What alternatives do you have? Besides of bitstrings, one could use ordinary text strings, integer arrays, and separate columns. To get the data fast, you have to think about indexing. You mentioned multi-column indexes. Would it make sense to store/index the same bit variable in several indices?
40 bits without too many duplicate record means up to 2^20 = 1.1E12 records. This makes a full-table scan a lengthy affair. On the other hand, indexing is not really helpful, if you have a lot of duplicate keys.
If you are expecting a resultset of some 25%, you would have to transfer 2.7E11 (partial) records between database and application. Assuming 10,000 records/s, this would take 7,736 hours or 10 months.
My conclusion is that you should think about storing the data in big BLOBs (1.1E12 x 40 bits is just 40 GByte). You could partition your data, read the interesting part into memory and do the search there. This is more or less what a BigData or Datawarehouse system is doing.

Lucene: Query at least

I'm trying to find if there's a way to search in lucene to say find all documents where there is at least one word that does not match a particualar word.
E.g. I want to find all documents where there is at least one word besides "test". i.e. "test" may or may not be present but there should be at least one word other than "test". Is there a way to do this in Lucene?
thanks,
Purushotham
Lucene could do this, but this wouldn't be a good idea.
The performance of query execution is bound to two factors:
the time to intersect the query with the term dictionary,
the time to retrieve the docs for every matching term.
Performant queries are the ones which can be quickly intersected with the term dictionary, and match only a few terms so that the second step doesn't take too long. For example, in order to prohibit too complex boolean queries, Lucene limits the number of clauses to 1024 by default.
With a TermQuery, intersecting the term dictionary requires (by default) O(log(n)) operations (where n is the size of the term dictionary) in memory and then one random access on disk plus the streaming of at most 16 terms. Another example is this blog entry from Lucene committer Mike McCandless which describes how FuzzyQuery performance improved when a brute-force implementation of the first step was replaced by something more clever.
However, the query you are describing would require to examine every single term of the term dictionary and dismiss documents which are in the "test" document set only!
You should give more details about your use-case so that people can think about a more efficient solution to your problem.
If you need a query with a single negative condition, then use a BooleanQuery with the MatchAllDocsQuery and a TermQuery with occurs=MUST_NOT. There is no way to additionaly enforce the existential constraint ("must contain at least one term that is not excluded"). You'll have to check that separately, once you retrieve Lucene's results. Depending on the ratio of favorable results to all the results returned from Lucene, this kind of solution can range from perfectly fine to a performance disaster.