Discrete optimisation: large number of optimal solutions - optimization

TL;DR version: Is there a way to cope with optimisation problems where there exists a large number of optimal solutions (solutions that find the best objective value)? That is, finding an optimal solution is pretty quick (but highly dependent on the size of the problem, obviously) but many such solutions exists so that the solver runs endlessly trying to find a better solution (endlessly because it does find other feasible solutions but with an objective value equals to the current best).
Not TL;DR version:
For a university project, I need to implement a scheduler that should output the schedule for every university programme per year of study. I'm provided some data and for the matter of this question, will simply stick to a general but no so rare example.
In many sections, you have mandatory courses and optional courses. Sometimes, those optional courses are divided in modules and the student needs to choose one of these modules. Often, they have to select two modules, but some combinations arise more often than others. Clearly, if you count the number of courses (mandatory + optional courses) without taking into account the subdivision into modules, you happen to have more courses than time slots in which they need to be scheduled. My model is quite simple. I have constraints stating that every course should be scheduled to one and only one time slot (period of 2 hours) and that a professor should not give two courses at the same time. Those are hard constraints. The thing is, in a perfect world, I should add hard constraints stating that a student cannot have two courses at the same time. But because I don't have enough data and that every combination of modules is possible, there is no point in creating one student per combination mandatory + module 1 + module 2 and apply the hard constraints on each of these students, since it is basically identical to have one student (mandatory + all optionals) and try to fit the hard constraints - which will fail.
This is why, I decided to move those hard constraints in an optimisation problem. I simply define my objective function minimising for each student the number of courses he/she takes that are scheduled simultaneously.
If I run this simple model with only one student (22 courses) and 20 time slots, I should have an objective value of 4 (since 2 time slots embed each 2 courses). But, using Gurobi, the relaxed objective is 0 (since you can have fraction of courses inside a time slot). Therefore, when the solver does reach a solution of cost 4, it cannot prove optimality directly. The real trouble, is that for this simple case, there exists a huge number of optimal solutions (22! maybe...). Therefore, to prove optimality, it will go through all other solutions (which share the same objective) desperately trying to find a solution with a smaller gap between the relaxed objective (0) and the current one (4). Obviously, such solution doesn't exist...
Do you have any idea on how I could tackle this problem? I thought of analysing the existing database and trying to figure out which combinations of modules are very likely to happen so that I can put back the hard constraints but it seems hazardous (maybe I will select a combination that leads to a conflict therefore not finding any solution or omitting a valid combination). The current solution I use is putting a time threshold to stop the optimisation...

Related

OptaPlanner: Is the "constraint match" associated with a score just a semantical thing?

I have a question about OptaPlanner constraint stream API. Are the constraint matches only used to calculate the total score and are meant to help the user see how the score results, or is this information used to find a better solution?
With "used to find a better solution" I mean the information is used to get the next move(s) in the local search phase.
So does it matter which planning entity I penalize?
Currently, I am working on an examination scheduler. One requirement is to distribute the exams of a single student optimally.
The number of exams per student varies. Therefore, I wrote a cost function that gives a normalized value, indicating how well the student's exams are distributed.
Let's say the examination schedule in the picture has costs of 80. Now, I need to break down this value to the individual exams. There are two different ways to do this:
Option A: Penalize each of the exams with 10 (10*8 = 80).
Option B: Penalize each exam according to its actual impact.=> Only the exams in the last week are penalized as the distribution of exams in week one and week two is fine.
Obviously, option B is semantically correct. But does the choice of the option affect the solving process?
The constraint matches are there to help explain the score to humans. They do not, in any way, affect how the solver moves or what solution you are going to get. In fact, ScoreManager has the capability to calculate constraint matches after the solver has already finished, or for a solution that's never even been through the solver before.
(Note: constraint matching does affect performance, though. They slow everything down, due to all the object iteration and creation.)
To your second question: Yes, it does matter which entity you penalize. In fact, you want to penalize every entity that breaks your constraints. Ideally it should be penalized more, if it breaks the constraints more than some other entity - this way, you get to avoid score traps.
EDIT based on an edit to the question:
In this case, since you want to achieve fairness per student, I suggest your constraint does not penalize the exam, but rather the student. Per student, group your exams and apply some fairness ConstraintCollector. If you do it like that, you will be able to create a per-student fairness function and use its value as your penalty.
The OptaPlanner Tennis example shows one way of doing fairness. You may also be interested in a larger fairness discussion on the OptaPlanner blog.

Prioritising scores in the VRP solution with OptaPlanner

I am using optaplanner to solve my VRP problem. I have several constraint providers, for example: one to enforce the capabilities and another to enforce the TW regarding the arrival time, both HARD. At the finish of the optimisation it returns a route with a negative score and when I analyse the ConstraintMach I find that it is a product of a vehicle capacity constraint. However, I consider that in my problem it does not objective that the vehicle arrives on time (meeting TW's constraint) if it will not be able to satisfy the customer's demands.That's why I require that the constraints I have defined for the capacities (Weight and Volume) have more weight/priority than the Time Window constraint.
Question: How can I configure the solver or what should I consider to apply all the hard constraints, but make some like the capacity ones have more weight than others?
Always grateful for your suggestions and help
I am not by far an expert on OptaPlanner but every constraint penalty (or reward) is divided into two parts if you use penalizeConfigurable(...) instead of penalize(...). Then each constraint score will be evaluated as the ConstraintWeight that you declare in a ConstrainConfig file multiplied by MatchWeight that is how you implement the deviation from the desired result. Like the number of failed stops might be Squared turning into an exponential penalty instead of just linear.
ConstaintWeights can be reconfigured between Solutions to tweak the importance of a penalty and setting it to Zero will negate it completely. MatchWeight is an implementation detail even in my view that you tweak while you develop. At least this how I see it.

Getting the optimal number of employees for a month (rostering)

Is it possible to get the optimal number of employees in a month for a given number of shifts?
I'll explain myself a little further taking the nurse rostering as an example.
Imagine that we don't know the number of nurses to plan in a given month with a fixed number of shifts. Also, imagine that each time you insert a new nurse in the planification it decreases your score and each nurse has a limited number of normal hours and a limited number of extra hours. Extra hours decrease more the score than normal ones.
So, the problem consists on getting the optimal number of nurses needed and their planification. I've come up with two possible solutions:
Fix the number of nurses clearly above of the ones needed and treat the problem as an overconstrained one, so there will be some nurses not assigned to any shifts.
Launching multiple instances of the same problem in parallel with an incremental number of nurses for each instance. This solution has the problem that you have to estimate more or less an approximate range of nurses under and above the nurses needed beforehand.
Both solutions are a little bit inefficient, is there a better approach to tackle with this problem?
I call option 2 doing simulations. Typically in simulations, they don't just play with the number of employees, but also the #ConstraintWeights etc. It's useful for strategic "what if" decisions (What if we ... hire more people? ... focus more on service quality? ... focus more on financial gain? ...)
If you really just need to minimize the number of employees, and you can clearly weight that versus all the other hard and soft constraint (probably as a weight in between both, similar to overconstrained planning), then option 1 is good enough - and less cpu costly.

Approximation to Large Linear Program

I have a simple LP with linear constraints. There are many decision variables, roughly 24 million. I have been using lpSolve in R to play with small samples, but this solver isn't scaling well. Are there ways to get an approximate solution to the LP?
Edit:
The problem is a scheduling problem. There are 1 million people who need to be scheduled into one of 24 hours, hence 24 million decision variables. There is a reward $R_{ij}$ for scheduling person $i$ into hour $j$. The constraint is that each person needs to be scheduled into some hour, but each hour only has a finite amount of appointment slots $c$
One good way to approach LPs/IPs with a massive number of variables and constraints is to look for ways to group the decision variables in some logical way. Since you have only given a sketch of your problem, here's a solution idea.
Approach 1 : Group people into smaller batches
Instead of 1M people, think of them as 100 units of 10K people each. So now you only have 2400 (24 x 100) variables. This will get you part of the way there, and note that this won't be the optimal solution, but a good approximation. You can of course make 1000 batches of 1000 people and get a more fine-grained solution. You get the idea.
Approach 2: Grouping into cohorts, based on the Costs
Take a look at your R_ij's. Presumably you don't have a million different costs. There will typically be only a few unique cost values. The idea is to group many people with the same cost structure into one 'cohort'. Now you solve a much smaller problem - which cohorts go into which hour.
Again, once you get the idea you can make it very tractable.
Update Based on OP's comment:
By its very nature, making these groups is an approximation technique. There is no guarantee that the optimal solution will be obtained. However, the whole idea of careful grouping (by looking at cohorts with identical or very similar cost structures) is to get solutions as close to the optimal as possible, with far less computational effort.
I should have also added that when scaling (grouping is just one way to scale-down the problem size), the other constants should also be scaled. That is, c_j should also be in the same units (10K).
If persons A,B,C cannot be fit into time slot j, then the model will squeeze in as many of those as possible in the lowest cost time slot, and move the others to other slots where the cost is slightly higher, but they can be accommodated.
Hope that helps you going in the right direction.
Assuming you have a lot of duplicate people, you are now using way too many variables.
Suppose you only have 1000 different kinds of people and that some of these occcur 2000 times whilst others occur 500 times.
Then you just have to optimize the fraction of people that you allocate to each hour. (Note that you do have to adjust the objective functions and constraints a bit by adding 2000 or 500 as a constant)
The good news is that this should give you the optimal solution with just a 'few' variables, but depending on your problem you will probably need to round the results to get whole people as an outcome.

Is it preferred to use end-time or duration for events in sql? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
My gut tells me that start time and end time would be better than start time and duration in general, but I'm wondering if there are some concrete advantages or disadvantages to the differing methods.
The advantage for strttime and endtime I am seeing is that if you want to call all events active during a certain time period you don't have to look outside that time period.
(this is for events that are not likely to change much after initial input and are tied to a specific time, if that makes a difference)
I do not see it as a preference or a personal choice. Computer Science is, well, a science, and we are programming machinery, not a sensitive child.
Re-inventing the Wheel
Entire books have been written on the subject of Temporal Data in Relational Databases, by giants of the industry. Codd has passed on, but his colleague and co-author C J Date, and recently H Darwen carry on the work of progressing and refining the Relational Model, in The Third Manifesto. The seminal book on the subject is Temporal Data & the Relational Model by C J Date, Hugh Darwen, and Nikos
A Lorentzos.
There are many who post opinions and personal choices re CS subjects as if they were choosing ice cream. This is due to not having had any formal training, and thus treating their CS task as if they were the only person on the planet who had come across that problem, and found a solution. Basically they re-invent the wheel from scratch, as if there were no other wheels in existence. A lot of time and effort can be saved by reading technical material (that excludes Wikipedia and MS publications).
Buy a Modern Wheel
Temporal Data has been a problem that has been worked with by thousands of data modellers following the RM and trying to implement good solutions. Some of them are good and others not. But now we have the work of giants, seriously researched, and with solutions and prescribed treatment provided. As before, these will eventually be implemented in the SQL Standard. PostgreSQL already has a couple of the required functions (the authors are part of TTM).
Therefore we can take those solutions and prescriptions, which will be (a) future-proofed and (b) reliable (unlike the thousands of not-so-good Temporal databases that currently exist), rather than relying on either personal opinion, or popular votes on some web-site. Needless to say, the code will be much easier as well.
Inspect Before Purchase
If you do some googling, beware that there are also really bad "books" available. These are published under the banner of MS and Oracle, by PhDs who spend their lives at the ice cream parlour. Because they did not read and understand the textbooks, they have a shallow understanding of the problem, and invent quite incorrect "solutions". Then they proceed to provide massive solutions, not to Temporal data, but to the massive problems inherent in their "solutions". You will be locked into problems that have been identified and sole; and into implementing triggers and all sorts of unnecessary code. Anything available free is worth exactly the price you paid for it.
Temporal Data
So I will try to simplify the Temporal problem, and paraphrase the guidance from the textbook, for the scope of your question. Simple rules, taking both Normalisation and Temporal requirements into account, as well as usage that you have not foreseen.
First and foremost, use the correct Datatype for any kind of Temporal column. That means DATETIME or SMALLDATETIME, depending on the resolution and range that you require. Where only DATE or TIME portion is required , you can use that. This allows you to perform date & time arithmetic using SQL function, directly in your WHERE clause.
Second, make sure that you use really clear names for the columns and variables.
There are three types of Temporal Data. It is all about categorising the properly, so that the treatment (planned and unplanned) is easy (which is why yours is a good question, and why I provide a full explanation). The advantage is much simpler SQL using inline Date/Time functions (you do not need the planned Temporal SQL functions). Always store:
Instant as SMALL/DATETIME, eg. UpdatedDtm
Interval as INTEGER, clearly identifying the Unit in the column name, eg. IntervalSec or NumDays
There are some technicians who argue that Interval should be stored in DATETIME, regardless of the component being used, as (eg) seconds or months since midnight 01 Jan 1900, etc. That is fine, but requires more unwieldy (not complex) code both in the initial storage and whenever it is extracted.
whatever you choose, be consistent.
Period or Duration. This is defined as the time period between two separate Instants. Storage depends on whether the Period is conjunct or disjunct.
For conjunct Periods, as in your Event requirement: use one SMALL/DATETIME for EventDateTime; the end of the Period can be derived from the beginning of the Period of the next row, and EndDateTime should not be stored.
For disjunct Periods, with gaps in-between yes, you need 2 x SMALL/DATETIMEs, eg. a RentedFrom and a RentedTo. If it is in the same row.
Period or Duration across rows merely need the ending Instant to be stored in some other row. ExerciseStart is the Event.DateTime of the X1 Event row, and ExerciseEnd is the Event.DateTime of the X9 Event row.
Therefore Period or Duration stored as an Interval is simply incorrect, not subject to opinion.
Data Duplication
Separately, in a Normalised database, ie. where EndDateTime is not stored (unless disjoint, as per above), storing a datum that can be derived will introduce an Update Anomaly where there was none.
with one EndDateTime, you have version of a the truth in one place; where as with duplicated data, you have a second version of the fact in another column:
which breaks 1NF
the two facts need to be maintained (updated) together, transactionally, and are at the risk of being out of synch
different queries could yeild different results, due to two versions of the truth
All easily avoided by maintaining the science. The return (insignificant increase in speed of single query) is not worth destroying the integrity of the data for.
Response to Comments
could you expand a little bit on the practical difference between conjunct and disjunct and the direct practical effect of these concepts on db design? (as I understand the difference, the exercise and temp-basal in my database are disjunct because they are distinct events separated by whitespace.. whereas basal itself would be conjunct because there's always a value)
Not quite. In your Db (as far as I understand it so far):
All the Events are Instants, not conjunct or disjunct Periods
The exceptions are Exercise and TempBasal, for which the ending Instant is stored, and therefore they have Periods, with whitespace between the Periods; thus they are disjunct.
I think you want to identify more Durations, such a ActiveInsulinPeriod and ActiveCarbPeriod, etc, but so far they only have an Event (Instant) that is causative.
I don't think you have any conjunct Periods (there may well be, but I am hard pressed to identify any. I retract what I said (When they were Readings, they looked conjunct, but we have progressed).
For a simple example of conjunct Periods, that we can work with re practical effect, please refer to this time-series question. The text and perhaps the code may be of value, so I have linked the Q/A, but I particularly want you the look at the Data Model. Ignore the three implementation options, they are irrelevant to this context.
Every Period in that database is Conjunct. A Product is always in some Status. The End-DateTime of any Period is the Start-DateTime of the next row for the Product.
It entirely depends on what you want to do with the data. As you say, you can filter by end time if you store that. On the other hand, if you want to find "all events lasting more than an hour" then the duration would be most useful.
Of course, you could always store both if necessary.
The important thing is: do you know how you're going to want to use the data?
EDIT: Just to add a little more meat, depending on the database you're using, you may wish to consider using a view: store only (say) the start time and duration, but have a view which exposes the start time, duration and computed end time. If you need to query against all three columns (whether together or separately) you'll want to check what support your database has for indexing a view column. This has the benefits of convenience and clarity, but without the downside of data redundancy (having to keep the "spare" column in sync with the other two). On the other hand, it's more complicated and requires more support from your database.
End - Start = Duration.
One could argue you could even use End and Duration, so there really is no difference in any of the combinations.
Except for the triviality that you need the column included to filter on it, so include
duration: if you need to filter by duration of execution time
start + end: if you need to trap for events that both start and end within a timeframe