Greedy Vertex Cover on Bipartite Graphs - sql

I have the following problem. All suggestions appreciated.
Given a set of questions Q and subjects S, select q questions such that all subjects S are covered. A question might cover multiple subjects.
I built a greedy algorithm over SQL Server. I pick the first question associated with the subject s1 and remove all other subjects that are covered by it. I continue and see if there is a solution.
Here's the catch. Now I want to improve my algorithm and create question sets that are still covering the subjects but are not similar to each other more than 50%.
Any advice on the formal name of this problem. Any algorithms. Should this problem be solved for all sets together, or can it be solved one set at a time.

Related

What is the general name of this vector sum problem?

Given the USDA nutrient database: n vectors where each dimension is a particular nutrient,
find a set S of foods F whose vectors sum to .ge. RDA and .lt. any toxic value. Add various other constraints to the model, e.g., calories, mass.
Solve for any combination of vectors that meet the constraints.
Currently available websites allow one to choose foods one at a time and build a "recipe". I'm looking for a computational solution. I suspect that this is a trivial problem that someone has already solved. I am looking for the search terms that describe this sort of scenario.
"Deep learning" looks for patterns, but the goal "pattern" is an input. Probability is not involved, so a sizable chunk of ML is not applicable. I intuit that some sort of tree-traversal might be useful.
This is a combination of set theory and vector math. I expect that there exists a large solution set of sets.
I can set up the input vectors as parameterized SQL queries. I have downloaded the USDA nutrient database and loaded it into mariadb.
pseudocode:
Select *
from subset_nutrients
join rda_nutrs on nutrients.nut-1 = rda-nuts.nut-1 join toxicity on nutrients.nut-1 = toxicity.nut-t1
where sum(nut-1scalar) >= rda-1scalar and sum(nut-2scalar) >=rda-2scalar {etc} and sum(nut-1scalar) < toxicity.nut1_t_scalar and sum(nut-2scalar) < toxicity.nut2_t_scalar {etc}
SQL might actually solve the problem all by itself?
I am looking for human-suggested search terms to find original sources of information. Thank you for your suggestions.

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Limiting chosen variables solved for in opensolver [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I've got a linear system of 17 equations, 506 variables that solve for a minimum summation of the total variables. This works fine, so far, but the solution is a result of a combination of 19 variables.
But in the end I want to limit the amount of chosen variables to 10, without knowing in advance which ones are the optimal ones (The solver figures that out for me, as well as their ratio).
I figured I can set a boolean = 1 if the value becomes larger than 0: (meaning the variable is picked), and 0 if the variable is not picked for an optimal solution.
And then have the sum of the booleans be 10 at most.
However this seems a bit elaborate, and I was wondering whether there was a built in option in the opensolver, for I think it is quite a common problem to solve a large set with a subset.
So does anyone have a suggestion on:
How my elaborate way drastically decreases performance? (*I have no intrinsic comprehension of the opensolver algorithms, yet.)
A suggestion to more easily/within the opensolver options account for my desire of max. 10 solution variables?
Based on the information provided below, I first scaled down the size of the problem:
I have three lists of data with 18 entries in columns:
W7:W23,AC7:AD23
which manually (with: W28 = 6000, AC28=600,W29 = 1,AC29 =1), in a linear combination,equal/exceed the target list:
EGM34:EG50
So what I did was put the descion variables in W28:W29, AC28:AD29
Where I added the constraint W28,AC28:AD28 = integer in the solver (both the original excel solver as in opensolver)
And I added the constraint W29,AC29:AD29 = Boolean in the solver (both the original excel solver as in opensolver)
Then I have a multiplication of the integer*boolean = the actual multiplication factor for the above lists in (W7:W23 etc)
In order to limit the nr of chosen variables I have also tried, in addition to the described constraints, to limit the cell with =sum(W29,AC29:AD29) to <= 10 (effectively reducing the amount of booleans set to true below 11, or so I thought, but the booleans aren't evaluated as booleans by the solver).
These new multiplied lists are placed in W34:W50,AC34:AD50, and the summation is situated in: EGY34:EGY50 Hence the final check is added as a constraint as:
EGY34:EGY50 =>EGM34:EGM50
And I had a question about how the linear solver evaluates these constraints, does it:
a. Think the sum of EGY34:EGY50 must be larger or equal than/to EGM34:EGM50
or
b. Does it think: "for every row x EGYx must be larger or equal than/to EGMx
So far I've noted b. but I would like to make sure.
But my main question concerns:
After using the Evolutionary algorithm as was kindly suggested in the comments below, how/why does it try values as 0.99994 for the desicion variables designated as booleans?
The introduction of binary variables is indeed the standard way to implement such constraints. Unfortunately, it transforms the problem from being a linear programming problem to being an integer programming problem (specifically a mixed integer linear programming problem). A standard approach to such problems is the branch and bound algorithm. This is what Excel's built-in solver seems to use, I'm not sure about the open solver that you are using. In the best case (where there is a lot of bounding) it will run fairly rapidly, even with problems of your size. In the worst case, for your problem it could be little better than what you would get by running the simplex algorithm C(506,10) = 2.8 x 10^20 times (once for each possible set of 10 decision variables). In other words, it might be infeasible. Integer programming is known to be NP-hard.
If an exact solution is infeasible, you could always use a heuristic algorithm such as an evolutionary approach.

Text questions with multiple choices randomisation

I have a text file that contains over 11 thousand multiple choice and matching questions. The questions have different sizes, besides having different number of given choices. The following block is a sample of matching question with five given choices taken from that text file:
Type: MT
1) Can you match each of these cities to their location? Drag the cities on the right to match them with the locations on the left.
~ Correct. You got all these matches correct.
# Incorrect. You got some of these wrong.
a. North = Turin
b. Center = Rome
c. South = Naples
d. Sicily = Palermo
e. Sardinia = Cagliari
Before processing this file into a HTML generating engine, I need to shuffle all those questions, i.e. to randomly change the position of each question as a block in the file, so the final product will be extremely unpredictable. Each question number (as mentioned under Type:) is insignificant.
I found a Word vba code at this link, but it does need lots of expert alterations to accommodate variant sizes of questions.
Expert assistance in this matter is deeply appreciated. Thanks in advance.
First, I agree with Tim Williams in the comments above that this is not exactly the level of specificity that is expected in a StackOverflow posting.
That said, if I were you, I would break this question down into two components.
First - figure out if there is a text string that can be used to identify the blocks that constitute the "question." For example, if each question starts with "Type:", then you can find the first instance of this in the file, then find the second, and everything between them constitutes a "question". Then, you can place that question in an array.
Second - randomize the array. There are probably a ton of ways to do this. One might be to use a randbetween function between 0 and the length of the array of questions twice, and switch the questions for each of the random numbers. Then, repeat that a number of times relative to the total number of items in the array (for example, if you have 100 questions, perform the "switch" 125 times to sufficiently randomize the output. Then print the array back to the original file.
For the approach above, you need some delimiter in your file (I assumed the delimiter was "Type:") to break the questions above. If a delimiter like this doesn't exist, you may need some more complicated logic.

Building ranking with genetic algorithm,

Question after BIG edition :
I need to built a ranking using genetic algorithm, I have data like this :
P(a>b)=0.9
P(b>c)=0.7
P(c>d)=0.8
P(b>d)=0.3
now, lets interpret a,b,c,d as names of football teams, and P(x>y) is probability that x wins with y. We want to build ranking of teams, we lack some observations P(a>d),P(a>c) are missing due to lack of matches between a vs d and a vs c.
Goal is to find ordering of team names, which the best describes current situation in that four team league.
If we have only 4 teams than solution is straightforward, first we compute probabilities for all 4!=24 orderings of four teams, while ignoring missing values we have :
P(abcd)=P(a>b)P(b>c)P(c>d)P(b>d)
P(abdc)=P(a>b)P(b>c)(1-P(c>d))P(b>d)
...
P(dcba)=(1-P(a>b))(1-P(b>c))(1-P(c>d))(1-P(b>d))
and we choose the ranking with highest probability. I don't want to use any other fitness function.
My question :
As numbers of permutations of n elements is n! calculation of probabilities for all
orderings is impossible for large n (my n is about 40). I want to use genetic algorithm for that problem.
Mutation operator is simple switching of places of two (or more) elements of ranking.
But how to make crossover of two orderings ?
Could P(abcd) be interpreted as cost function of path 'abcd' in assymetric TSP problem but cost of travelling from x to y is different than cost of travelling from y to x, P(x>y)=1-P(y<x) ? There are so many crossover operators for TSP problem, but I think I have to design my own crossover operator, because my problem is slightly different from TSP. Do you have any ideas for solution or frame for conceptual analysis ?
The easiest way, on conceptual and implementation level, is to use crossover operator which make exchange of suborderings between two solutions :
CrossOver(ABcD,AcDB) = AcBD
for random subset of elements (in this case 'a,b,d' in capital letters) we copy and paste first subordering - sequence of elements 'a,b,d' to second ordering.
Edition : asymetric TSP could be turned into symmetric TSP, but with forbidden suborderings, which make GA approach unsuitable.
It's definitely an interesting problem, and it seems most of the answers and comments have focused on the semantic aspects of the problem (i.e., the meaning of the fitness function, etc.).
I'll chip in some information about the syntactic elements -- how do you do crossover and/or mutation in ways that make sense. Obviously, as you noted with the parallel to the TSP, you have a permutation problem. So if you want to use a GA, the natural representation of candidate solutions is simply an ordered list of your points, careful to avoid repitition -- that is, a permutation.
TSP is one such permutation problem, and there are a number of crossover operators (e.g., Edge Assembly Crossover) that you can take from TSP algorithms and use directly. However, I think you'll have problems with that approach. Basically, the problem is this: in TSP, the important quality of solutions is adjacency. That is, abcd has the same fitness as cdab, because it's the same tour, just starting and ending at a different city. In your example, absolute position is much more important that this notion of relative position. abcd means in a sense that a is the best point -- it's important that it came first in the list.
The key thing you have to do to get an effective crossover operator is to account for what the properties are in the parents that make them good, and try to extract and combine exactly those properties. Nick Radcliffe called this "respectful recombination" (note that paper is quite old, and the theory is now understood a bit differently, but the principle is sound). Taking a TSP-designed operator and applying it to your problem will end up producing offspring that try to conserve irrelevant information from the parents.
You ideally need an operator that attempts to preserve absolute position in the string. The best one I know of offhand is known as Cycle Crossover (CX). I'm missing a good reference off the top of my head, but I can point you to some code where I implemented it as part of my graduate work. The basic idea of CX is fairly complicated to describe, and much easier to see in action. Take the following two points:
abcdefgh
cfhgedba
Pick a starting point in parent 1 at random. For simplicity, I'll just start at position 0 with the "a".
Now drop straight down into parent 2, and observe the value there (in this case, "c").
Now search for "c" in parent 1. We find it at position 2.
Now drop straight down again, and observe the "h" in parent 2, position 2.
Again, search for this "h" in parent 1, found at position 7.
Drop straight down and observe the "a" in parent 2.
At this point note that if we search for "a" in parent one, we reach a position where we've already been. Continuing past that will just cycle. In fact, we call the sequence of positions we visited (0, 2, 7) a "cycle". Note that we can simply exchange the values at these positions between the parents as a group and both parents will retain the permutation property, because we have the same three values at each position in the cycle for both parents, just in different orders.
Make the swap of the positions included in the cycle.
Note that this is only one cycle. You then repeat this process starting from a new (unvisited) position each time until all positions have been included in a cycle. After the one iteration described in the above steps, you get the following strings (where an "X" denotes a position in the cycle where the values were swapped between the parents.
cbhdefga
afcgedbh
X X X
Just keep finding and swapping cycles until you're done.
The code I linked from my github account is going to be tightly bound to my own metaheuristics framework, but I think it's a reasonably easy task to pull the basic algorithm out from the code and adapt it for your own system.
Note that you can potentially gain quite a lot from doing something more customized to your particular domain. I think something like CX will make a better black box algorithm than something based on a TSP operator, but black boxes are usually a last resort. Other people's suggestions might lead you to a better overall algorithm.
I've worked on a somewhat similar ranking problem and followed a technique similar to what I describe below. Does this work for you:
Assume the unknown value of an object diverges from your estimate via some distribution, say, the normal distribution. Interpret your ranking statements such as a > b, 0.9 as the statement "The value a lies at the 90% percentile of the distribution centered on b".
For every statement:
def realArrival = calculate a's location on a distribution centered on b
def arrivalGap = | realArrival - expectedArrival |
def fitness = Σ arrivalGap
Fitness function is MIN(fitness)
FWIW, my problem was actually a bin-packing problem, where the equivalent of your "rank" statements were user-provided rankings (1, 2, 3, etc.). So not quite TSP, but NP-Hard. OTOH, bin-packing has a pseudo-polynomial solution proportional to accepted error, which is what I eventually used. I'm not quite sure that would work with your probabilistic ranking statements.
What an interesting problem! If I understand it, what you're really asking is:
"Given a weighted, directed graph, with each edge-weight in the graph representing the probability that the arc is drawn in the correct direction, return the complete sequence of nodes with maximum probability of being a topological sort of the graph."
So if your graph has N edges, there are 2^N graphs of varying likelihood, with some orderings appearing in more than one graph.
I don't know if this will help (very brief Google searches did not enlighten me, but maybe you'll have more success with more perseverance) but my thoughts are that looking for "topological sort" in conjunction with any of "probabilistic", "random", "noise," or "error" (because the edge weights can be considered as a reliability factor) might be helpful.
I strongly question your assertion, in your example, that P(a>c) is not needed, though. You know your application space best, but it seems to me that specifying P(a>c) = 0.99 will give a different fitness for f(abc) than specifying P(a>c) = 0.01.
You might want to throw in "Bayesian" as well, since you might be able to start to infer values for (in your example) P(a>c) given your conditions and hypothetical solutions. The problem is, "topological sort" and "bayesian" is going to give you a whole bunch of hits related to markov chains and markov decision problems, which may or may not be helpful.