Optimising table assignment to guests for an event based on a criteria - optimization

66 guests at an event, 8 tables. Each table has a "theme". We want to optimize various criteria: e.g., even number of men/women at the table, people get to discuss the topic they selected, etc.
I formulated this as a gradient-free optimisation problem: I wrote a function that calculates the goodness of the arrangement (i.e., cost of difference of men women, cost of non-preferred theme, etc.) and I am basically randomly perturbing the arrangement by swapping tables and keeping the "best so far" arrangement. This seems to work, but cannot guarantee optimality.
I am wondering if there is a more principled way to go about this. There (intuitively) seems to be no useful gradient in the operation of "swapping" people between tables, so random search is the best I came up with. However, brute-forcing by evaluating all possibilities seems to be difficult; if there are 66 people, there are factorial(66) possible orders, which is a ridiculously large number (10^92 according to Python). Since swapping two people at the same table is the same, there are actually fewer, which I think can be calculated by dividing out the repeats, e.g. fact(66)/(fact(number of people at table 1) * fact(number of people at table 2) * ...), which in my problem still comes out to about 10^53 possible arrangements, way too many to consider.
But is there something better that I can do than random search? I thought about evolutionary search but I don't know if it would provide any advantages.
Currently I am swapping a random number of people on each evaluation and keeping it only if it gives a better value. The random number of people is selected from an exponential distribution to make it more probable to swap 1 person than 6, for example, to make small steps on average but to keep the possibility of "jumping" a bit further in the search.
I don't know how to prove it but I have a feeling this is an NP-hard problem; if that's the case, how could it be reformulated for a standard solver?
Update: I have been comparing random search with a random "greedy search" and a "simulated annealing"-inspired approach where I have a probability of keeping swaps based on the measured improvement factor, that anneals over time. So far the greedy search surprisingly strongly outperforms the probabilistic approach. Adding the annealing schedule seems to help.
What I am confused by is exactly how to think about the "space" of the domain. I realize that it is a discrete space, and that distance are best described in terms of Levenshtein edit distance, but I can't think how I could "map" it to some gradient-friendly continuous space. Possibly if I remove the exact number of people-per-table and make this continuous, but strongly penalize it to incline towards the number that I want at each table -- this would make the association matrix more "flexible" and possibly map better to a gradient space? Not sure. Seating assignment could be a probability spread over more than one table..

Related

Fuzzy Matching in Different Tables with No Cross Join(Snowflake)

There are two tables A and B.
They both contain titles referencing the same thing, but the naming conventions are always different and cannot be predicted.
The only way to match titles is to find low difference scores on a number of columns, but for now only the title is important.
There are only about 10,000 records in each table currently. Using the standard Cross Join and EditDistance combination works fine now. But I've already noticed performance decreases as the number of records grow.
Is there a more performant way of achieving the desired result of finding partial matches between strings in different tables?
I apologize if there is an obvious answer. The few posts that deviate from the editdistance solution still assume cross joining : https://community.snowflake.com/s/question/0D50Z00008zPLLxSAO/join-with-partial-string-match
You should use a blocking key strategy to help cut down on the number of pairs generated. This document explains this strategy and other techniques for Fuzzy Matching on Snowflake. https://drive.google.com/file/d/1FuxZnXojx71t-1kNOaqg1ErrEiiATdsM/view?usp=sharing
As per Ryan point, the way to avoid comparing all values is to prune "what values are joined".
In other domains (spatial) we found quantizing the GPS down and then joining the 8 surrounding buckets, while made for "more comparisons for things a human could see where near" eliminated all the compares for the things that "clearly are very far away".
Like most expensive computation, you want to prune as much as you can without missing things you want to include. Which is to say false positives are fine, but false negatives are very bad.
So how you batch/bucket/prune you data is very application data specific.

How to use bitmask operations in SQL

I'm already quite familiar with the concept of bitmasks, and also with the bitwise operators used to modify them. However, I have one specific problem that I can't solve. Here's the problem:
I have a large number of relatively large bitmasks (something around 10,000,000 bitmasks, each 256 bits long). Generating an SQL index that will allow me to search for a specific one in log(n) time is simple enough. However, what I need to do is to match a given 256-bit query against the entire dataset and find N (variable) data items that are "least different" from the given query, least different meaning the number of bits that don't match should be minimal. For example, if the database contains {0110, 1101, 0000, 1110} then the closest match to 0100 is either one of 0110 and 0000.
Given the number of entries, a linear search would be very inefficient, which is, I believe, what would happen if I were to use aggregate operators. I'm looking for a way to improve the search, but have found no way to do it as of now. Any ideas would be highly appreciated.

Performance Impact of turning Columns into Rows

I'm planning to use JavaDB (Derby) or PostgreSQL.
I have the following problem: I need to store a large set of vectors. Currently all vectors contain a fixed number of elements. Hence the appropriate way of storing the set is using one row per vector and a column per element. However, the number of elements might change over time. Additionally, in my case, from a software engineering perspective, having a fixed number of columns reflects knowledge about a software component which the general model should be unaware of.
Therefore I'm thinking about "linearizing" the layout and use a general table that stores elements instead of vectors.
The first element of the vector 5 could then be queried like this:
SELECT value FROM elements where v_id = 5 and e_id = 1;
In general, I do not need full table reads, and only a relatively small subset of the vectors is accessed.
Maybe database-savvy people can judge what the performance impact will be?
Many thanks in advance.
This is a variant of what's referred to in general database terms as Entity-Attribute-Value or EAV design. It's a bit of a relational database design anti-pattern and should be avoided in most cases. Performance tends to be poor due to the need for many self-joins, and queries are ugly at best.
In PostgreSQL look into the intarray extension, it should solve your problem pretty ideally if the values are simple integers. Otherwise consider PostgreSQL's standard array types. They've got their own issues, but are generally a lot better than EAV, though they're not lovely to work with from JDBC.
Otherwise, if all you're storing is these vectors, maybe consider a non-relational DB.

Determining search results quality in Lucene

I have been searching about score normalization for few days (now i know this can't be done) in Lucene using mailing list, wiki, blogposts, etc. I'm going to expose my problem because I'm not sure that score normalization is what our project need.
Background:
In our project, we are using Solr on top of Lucene with custom RequestHandlers and SearchComponents. For a given query, we need to detect when a query got poor results to trigger different actions.
Assumptions:
Inmutable index (once indexed, it is not updated) and Same query tipology (dismax qparser with same field boosting, without boost functions nor boost queries).
Problem:
We know that score normalization is not implementable. But is there any way to determine (using TF/IDF and boost field assumptions) when search results match quality are poor?
Example: We've got an index with science papers and other one with medcare centre's info. When a user query against first index and got poor results (inferring it from score?), we want to query second index and merge results using some threshold (score threshold?)
Thanks in advance
You're right that normalization of scores across different queries doesn't make sense, because nearly all similarity measures base on term frequency, which is of course local to a query.
However, I think that it is viable to compare the scores in this very special case that you are describing, if only you would override the default similarity to use IDF calculated jointly for both indexes. For instance, you could achieve it easily by keeping all the documents in one index and adding an extra (and hidden to the users) 'type' field. Then, you could compare the absolute values returned by these queries.
Generally, it could be possible to determine low quality results by looking at some features, like for example very small number of results, or some odd distributions of scores, but I don't think it actually solves your problem. It looks more similar to the issue of merging of isolated search results, which is discussed for example in this paper.

How does OLAP address dimensions that are numeric ranges?

To preface this, I'm not familiar with OLAP at all, so if the terminology is off, feel free to offer corrections.
I'm reading about OLAP and it seems to be all about trading space for speed, wherein you precalculate (or calculate on demand) and store aggregations about your data, keyed off by certain dimensions. I understand how this works for dimensions that have a discrete set of values, like { Male, Female } or { Jan, Feb, ... Dec } or { #US_STATES }. But what about dimensions that have completely arbitrary values like (0, 1.25, 3.14156, 70000.23, ...)?
Does the use of OLAP preclude the use of aggregations in queries that hit the fact tables, or is it merely used to bypass things that can be precalculated? Like, arbitrary aggregations on arbitrary values still need to be done on the fly?
Any other help regarding learning more about OLAP would be much appreciated. At first glance, both Google and SO seem to be a little dry (compared to other, more popular topics).
Edit: Was asked for a dimension on which there are arbitrary values.
VELOCITY of experiments: 1.256 m/s, -2.234 m/s, 33.78 m/s
VALUE of transactions: $120.56, $22.47, $9.47
Your velocity and value column examples are usually not the sort of columns that you would query in an OLAP way - they are the values you're trying to retrieve, and would presumably be in the result set, either as individual rows or aggregated.
However, I said usually. In our OLAP schema, we have a good example of a column you're thinking of: event_time (a date-time field, with granualarity to the second). In our data, it will be nearly unique - no two events will be happening during the same second, but since we have years of data in our table, that still means there are hundreds of millions of potentially discrete values, and when we run our OLAP queries, we almost always want to constrain based on time ranges.
The solution is to do what David Raznick has said - you create a "bucketed" version of the value. So, in our table, in addition to the event_time column, we have an event_time_bucketed column - which is merely the date of the event, with the time part being 00:00:00. This reduces the count of distinct values from hundreds of millions to a few thousand. Then, in all queries that constrain on date, we constrain on both the bucketed and the real column (since the bucketed column will not be accurate enough to give us the real value), e.g.:
WHERE event_time BETWEEN '2009.02.03 18:32:41' AND '2009.03.01 18:32:41'
AND event_time_bucketed BETWEEN '2009.02.03' AND '2009.03.01'
In these cases, the end user never sees the event_time_bucketed column - it's just there for query optimization.
For floating point values like you mention, the bucketing strategy may require a bit more thought, since you want to choose a method that will result in a relatively even distribution of the values and that preserves contiguity. For example, if you have a classic bell distribution (with tails that could be very long) you'd want to define the range where the bulk of the population lives (say, 1 or 2 standard deviations from mean), divide it into uniform buckets, and create two more buckets for "everything smaller" and "everything bigger".
I have found this link to be handy http://www.ssas-info.com/
Check out the webcasts section where in they walk you through different aspects starting from, what is BI, Warehousing TO designing a cube, dimensions, calculations, aggregations, KPIs, perspectives etc.
In OLAP aggregations help in reducing the query response time by having pre-calculated values which would be used by the query. However, the flip side is increase in storage space as more space would be needed to store aggregations apart from the base data.
SQL Server Analysis Services has Usage Based Optimization Wizard which helps in aggregation design by analyzing queries that have been submitted by clients (reporting clients like SQL Server Reporting Services, Excel or any other) and refining the aggregation design accordingly.
I hope this helps.
cheers