How to model domain index with subset of set in minizinc - indexing

I am working on a nurse scheduling problem with the constraint stating that there should not be a day shift assigned to nurse after a night shift in previous day.
The constraint looks like this:
The shift set is "set of int: shift=1..7",
set of int: dayshift ={1,3,4};
set of int: dayshift={2,5,6,7};
How to model this constraint in minizinc?
I've tried:
constraint
forall(e in Employee, d in Day where d != Day[card(Day)])(
Assign[e, d, sh in NightShfit] + Assign[e, d+1, sh in DayShift] < 1
);
error: MiniZinc: type error: undefined identifier `sh'

The solution to your problem is to compute the sum of the "Assign" variables for both of the shift:
constraint forall(e in Employee, d in Day where d != Day[card(Day)])(
sum(sh in NightShift)(Assign[e, d, sh]) + sum(sh in DayShift)(Assign[e, d+1, sh]) < 1
);
As a side note I would like to remark that using 0/1 variables for these kinds of problems is only a good way of modelling for mathematical optimisation solvers (MIP). Constraint Programming (CP) solvers and Lazy Clause Generation (LCG) solvers will not work efficiently even though they are great for these kinds of problems.
My recommendation would be to have the different kinds of shift (for every day) as an enum and then assign one version of it for every employee for every day. Constraints like the ones your expressing here then often fit well into a regular constraint which performs great of both MIP and CP/LCG solvers. (In case of MIP it gets automatically transformed into a flow model).

Related

Closest position between randomly moving objects

I have a large database tables that contains grid references (X and Y) associated with various objects (each with a unique object identifier) as they move with time. The objects move at approximately constant speed but random directions.
The table looks something like this….
CREATE TABLE positions (
objectId INTEGER,
x_coord INTEGER,
y_coord INTEGER,
posTime TIMESTAMP);
I want to find which two objects got closest to each other and at what time.
Finding the distance between two fixes is relatively easy – simple Pythagoras for the differences between the X and Y values should do the trick.
The first problem seems to be one of volume. The grid itself is large, 100,000 possible X co-ordinates and a similar number of Y co-ordinates. For any given time period the table might contain 10,000 grid reference positions for 1000 different objects – 10 million rows in total.
That’s not in itself a large number, but I can’t think of a way of avoiding doing a ‘product query’ to compare every fix to every other fix. Doing this with 10 million rows will produce 100 million million results.
The next issue is that I’m not just interested in the closest two fixes to each other, I’m interested in the closest two fixes from different objects.
Another issue is that I need to match time as well as position – I’m not just interested in two objects that have visited the same grid square, they need to have done so at the same time.
The other point (may not be relevant) is that the items are unlikely to every occupy exactly the same location at the same time.
I’ve got as far as a simple product query with a few sample rows, but I’m not sure on my next steps. I’m beginning to think this isn’t going something I can pull off with a single SQL query (please prove me wrong) and I’m likely to have to extract the data and subject it to some procedural programming.
Any suggestions?
I’m not sure what SE forum this best suited for – database SQL? Programming? Maths?
UPDATE - Another issue to add to the complexity, the timestamping for each object and position is irregular, one item might have a position recorded at 14:10:00 and another at 14:10:01. If these two positions are right next to each other and one second apart then they may actually represent the closest position although the time don't match!
In order to reduce the number of tested combinations you should segregate them by postime using subqueries. Also, it's recommended you create an index by postime to increase performance.
create index ix1_time on positions (postime);
Since you didn't mention any specific database I assumed PostgreSQL since it's easy to use (for me). The solution should look like:
with t as (
select distinct(postime) as pt from positions
)
select *
from t,
(
select *
from (
select
a.objectid as aid, b.objectid as bid,
a.x_coord + a.y_coord + b.x_coord + b.y_coord as dist -- fix here!
from t
join positions a on a.postime = t.pt
join positions b on b.postime = t.pt
where a.objectid <> b.objectid
) x
order by dist desc
limit 1
) y;
This SQL should compare each 10000 objects against each other on by postime. It will test 10 million combinations for each different postime value, but not against other postime values.
Please note: I used a.x_coord + a.y_coord + b.x_coord + b.y_coord as the distance formula. I leave the correct one for you to implement here.
In total it will compute 10 million x 1000 time values: a total of 10 billion comparisons. It will return the closest two points for each timepos, that is a total of 1000 rows.

Heterogeneous resources in shift rostering

In a shift rostering problem, how would you model a situation in which the number of employees needed depends on how good the employees are?
The advice given in the optaplanner documentation and elsewhere is that you should divide a many-to-many relationship into a many-to-one and one-to-many. In the nurserostering example, this results in a Shift, ShiftAssignment and Employee.
But in nurserostering, Shift has a fixed requiredEmployeeSize property. In my problem, I can't have a fixed value here. The number of employees required is determined by the capacity of the employees.
How would you do this?
Thanks!
Firstly, define a numeric capacity variable in your Employee class and a numeric need variable in Shift class.
Then, you can use a rule like the following one. For each shift, this rule will apply a penalty if insufficient total capacity is assigned.
rule "Insufficient Capacity Assignment"
when
$shift : Shift(need > 0, $need : need)
$totalCapacity : Number() from accumulate(
$assignment : ShiftAssignment(employee != null, shift == $shift, $capacity : employee.getCapacity() ),
sum($capacity)
)
eval($totalCapacity.intValue() < $need)
then
scoreHolder.addHardConstraintMatch(kcontext, 1, -10);
end

What is the use case that makes EAVT index preferable to EATV?

From what I understand, EATV (which Datomic does not have) would be great fit for as-of queries. On the other hand, I see no use-case for EAVT.
This is analogous to row/primary key access. From the docs: "The EAVT index provides efficient access to everything about a given entity. Conceptually this is very similar to row access style in a SQL database, except that entities can possess arbitrary attributes rather then being limited to a predefined set of columns."
The immutable time/history side of Datomic is a motivating use case for it, but in general, it's still optimized around typical database operations, e.g. looking up an entity's attributes and their values.
Update:
Datomic stores datoms (in segments) in the index tree. So you navigate to a particular E's segment using the tree and then retrieve the datoms about that E in the segment, which are EAVT datoms. From your comment, I believe you're thinking of this as the navigation of more b-tree like structures at each step, which is incorrect. Once you've navigated to the E, you are accessing a leaf segment of (sorted) datoms.
You are not looking for a single value at a specific point in time. You are looking for a set of values up to a specific point in time T. History is on a per value basis (not attribute basis).
For example, assert X, retract X then assert X again. These are 3 distinct facts over 3 distinct transactions. You need to compute that X was added, then removed and then possibly added again at some point.
You can do this with SQL:
create table Datoms (
E bigint not null,
A bigint not null,
V varbinary(1536) not null,
T bigint not null,
Op bit not null --assert/retract
)
select E, A, V
from Datoms
where E = 1 and T <= 42
group by E, A, V
having 0 < sum(case Op when 1 then +1 else -1 end)
The fifth component Op of the datom tells you whether the value is asserted (1) or retracted (0). By summing over this value (as +1/-1) we arrive at either 1 or 0.
Asserting the same value twice does nothing, and you always retract the old value before you assert a new value. The last part is a prerequisite for the algorithm to work out this nicely.
With an EAVT index, this is a very efficient query and it's quite elegant. You can build a basic Datomic-like system in just 150 lines of SQL like this. It is the same pattern repeated for any permutation of EAVT index that you want.

Fox-Goat-Cabbage Transportation

My question is about an old transportation problem -- carrying three items across a river with a boat only capable of tranferring one item at a time. A constraint is certain items cannot be left together, such as the cabbage with the goat, wolf with the goat etc. This problem should be solveable using Integer programming, or another optimization approach. The cost function is all items being on the other side of the river, and the trips required to get there could be the output from Simplex (?) that tries out different feasible solutions. I was wondering if anyone has the Integer Programming (or Linear Programming) formulation of this problem, and / or Matlab, Octave, Python based code that can offer the solution programmatically, including a trace of Simplex trying out all paths -- our boat rides.
There was some interesting stuff here
http://www.zib.de/Publications/Reports/SC-95-27.pdf
Thanks,
I recommend using binary variables x_i,t to model the positions of your items, i.e. they are zero if the item is located on the left shore after trip t and one otherwise. At most one of these variables can change during a trip. This can be modeled by
x_wolf,1 + x_cabbage,1 + x_goat,1 <= 1 + x_wolf,0 + x_cabbage,0 + x_goat,0 and
x_wolf,1 >= x_wolf,0
x_cabbage,1 >= x_cabbage,0
x_goat,1 >= x_goat,0
Similar constraints are required for trips in the other direction.
Furthermore, after an odd number of trips you nedd constraints to check the items on the left shore, and similarily you have to check the right shore. For instance:
x_wolf,1 + x_goat,1 >= 0 and
x_wolf,2 + x_goat,2 <= 1 ...
Use an upper bound for t, such that a solution is surely possible.
Finally, introduce the binary variable z_t and let
z_t <= 1/3 (x_wolf,t + x_cabbage,t + x_goat,t)
and maximize sum_t (z_t).
(Most probably sum_t (x_wolf,t + x_cabbage,t + x_goat,t) shold work too.)
You are right that this formulation will require integer variables. The traditional way of solving a problem like this would be to formulate a binary variable model and pass the formulation onto a solver. MATLAB in this case would not work unless you have access to the Optimization Toolbox.
http://www.mathworks.com/products/optimization/index.html
In your formulation you would need to address the following:
Decision Variables
In your case this would look something like:
x_it (choose [yes=1 no=0] to transport item i during boat trip number t)
Objective Function
I'm not quite sure what this is from your description but there should be a cost, c_t, associated with each boat trip. If you want to minimize total time, each trip would have a constant cost of 1. So your objective should look something like:
minimize SUM((i,t),c_t*x_it) (so you are minimizing the total cost over all trips)
Constraints
This is the tricky part for your problem. The complicating constraint is the exclusivity that you identified. Remember, x_it is binary.
For each pair of items (i1,i2) that conflict with each other you have a constraint that looks like this
x_(i1 t) + x_(i2 t) <= 1
For example:
x_("cabbage" "1") + x_("goat" "1") <= 1
x_("wolf" "1") + x_("goat" "1") <= 1
x_("cabbage" "2") + x_("goat" "2") <= 1
x_("wolf" "2") + x_("goat" "2") <= 1
ect.
You see how this prevents conflict. A boat schedule that assigns "cabbage" and "goat" to the same trip will violate this binary exclusivity constraint since "1+1 > 1"
Tools like GAMS,AMPL and GLPK will allow you to express this group of constraints very concisely.
Hope that helps.

table design + SQL question

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify
I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.
I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.
I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.
It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.