From the optaplanner examples: which rule makes sure every team gets assigned at least once? - optaplanner

In the optaplanner tennis solution, there's a rule that makes sure every team gets assigned an equal amount of times (if I understand correctly at least). That's the fairAssignmentCountPerTeam rule.
What I don't see is how a team doesn't get totally excluded from being assigned.
My actual problem: I want to do a similar thing, though different ;)
In my case, I want to assign players to a field (4 since it's about organising doubles in padel or tennis).
But suppose I have 4 field, so 16 players needed and I have 25 players.
There will be multiple confrontations (different timeslots) for playing.
So how can I make sure that all players will be taken into account?
And that it's not a continuous game play among e.g. a set of 16 players.
I would like to create a similar rule as fairAssignmentCountPerTeam (maybe four of those rules, one for each player on the field). But I don't see a constraint saying that all teams (players) should get assigned over all time slots.
Don't know if my question is clear enough.
Let me know what extra information could help. (I am new to optaplanner)

In the tennis example, every assignment (day + indexInDay) must be assigned to a team, because the tennis example does not use overconstrained planning (it does have nullable = true in #PlannigVariable(...)). So it can not leave the TeamAssignment.team = null.
Now, let's look at that constraint:
Constraint fairAssignmentCountPerTeam(ConstraintFactory constraintFactory) {
return constraintFactory.forEach(TeamAssignment.class)
.groupBy(loadBalance(TeamAssignment::getTeam))
.penalize("fairAssignmentCountPerTeam", HardMediumSoftScore.ONE_MEDIUM,
result -> (int) result.getZeroDeviationSquaredSumRootMillis());
}
Given 40 assignments and 10 teams, it can:
A) Assign all 40 assignments to team 1, resulting in a penalty of 40 * 40 = 1600
B) Assign 20 assignments to team 1 and 20 to team 2, resulting in a penalty of 20 * 20 + 20 * 20 = 800
C) Assign 4 assignments to each of the 10 teams, resulting in a penalty of 10 * (4 * 4) = 160
Obviously C) is the lowest penalty. By not assigning assignments to one team, the penalty of another team exponentially increases, so it won't do that.

Related

How to use normalization to set levels of confidence between a rating and the number of ratings in Python or SQL?

I have a list of about 800 sales items that have a rating (from 1 to 5), and the number of ratings. I'd like to list the items that are most probable of having a "good" rating in an unbiased way, meaning that 1 person voting 5.0 isn't nearly as good as 50 people having voted and the rating of the item being a 4.5.
Initially I thought about getting the smallest amount of votes (which will be zero 99% of the time), and the highest amount of votes for an item on the list and factor that into the ratings, giving me a confidence level of 0 to 100%, however I'm thinking that this approach would be too simplistic.
I've heard about Bayesian probability but I have no idea on how to implement it. My list of items, ratings and number of ratings is on a MySQL view, but I'm parsing the code using Python, so I can make the calculations on either side (but preferably at the SQL view).
Is there any practical way that I can normalize this voting with SQL, considering the rating and number of votes as parameters?
|----------|--------|--------------|
| itemCode | rating | numOfRatings |
|----------|--------|--------------|
| 12330 | 5.00 | 2 |
| 85763 | 4.65 | 36 |
| 85333 | 3.11 | 9 |
|----------|--------|--------------|
I've started off trying to assign percentiles to the rating and numOfRatings, this way I'd be able to do normalization (sum them with an initial 50/50 weight). Here's the code I've attempted:
SELECT p.itemCode AS itemCode, (p.rating - min(p.rating)) / (max(p.rating) - min(p.rating)) AS percentil_rating,
(p.numOfRatings - min(p.numOfRatings)) / (max(p.numOfRatings) - min(p.numOfRatings)) AS percentil_qtd_ratings
FROM products p
WHERE p.available = 1
GROUP BY p.itemCode
However that's only bringing me a result for the first itemCode on the list, not all of them.
Clearly the issue here is the low number of observations your data has. Implementing Bayesian's method is the way to go because it provides great probability distribution for applications involving ratings especially if there is limited observations, and it easily decides the future likelihood ratio based on given parameters (this article provides an excellent explanation about Bayesian probability for beginners).
I would suggest storing your data in CSV files so it becomes easier to manipulate in python. Denormalizing the data via joins is the first task to do before analyzing your ratings.
This is Bayesian's simplified formula to use in your python code:
R – Confidence level aka number of observations
v – number of votes for a single product
C – avg vote for all products
m - tuneable parameter aka cutoff number required for votes to be considered (How many votes do you want displayed)
Since this is the simplified formula, this article explains how its been derived from its original formula. This article is helpful too in explaining the parameters.
Knowing the formula pretty much gets 50% of your work done, the rest is just importing your data and working with it. I provided below examples similar to your problem in case you need full demonstration:
Github example 1
Github example 2

Database table design for a user referencing a particular outcome

Scenario:
The user inputs name, goal, and size. Goals are "long", "average", "short". Sizes are "big", "medium", "small". There are 3 tables. User stores name, goal, and size of each user. Final stores outcome based on goal/size combination. UserFinal serves as an associate table that stores the appropriate user with corresponding outcome. A certain combination of goal and size results in a specific outcome (refer to the table below) that is assigned to that user.
GOAL SIZE OUTCOME
Long Big 12
Long Medium 14
Long Small 18
Average Big 13
Average Medium 16
Average Small 19
Short Big 15
Short Medium 17
Short Small 20
Objective: Table Final should not grow and is only intended to serve as a 'lookup' table for when a User submits their goal and size (not sure if a table that isn't expected to grow is a good design?). The outcome value will be used for further calculations for each particular user.
Question: Is the illustrated table design the correct way to reflect this scenario? If it isn't, what is the best way for accomplishing this?
It seems Goal and Size identifies Outcome, and each User sets one Outcome.
See the following data model:
Sample data:
Goal
--------
L Long
A Average
S Short
Size
--------
B Big
M Medium
S Small
Outcome
--------
L B 12
L M 14
L S 18
A B 13
A M 16
A S 19
S B 15
S M 17
S S 20
I deleted my answer and am submitting another one, because the edits changed the clarity of the question significantly.
Have a final table purely as a look-up table does not seem like good db design.
I would do a simple one to many relationship.
Every user has many outcomes like so:
Users
user_id [pk]
name
Outcomes
outcome_id [pk]
user_id [fk]
goal
size
outcome

Efficient SELECT with complex WHERE condition - do I need to store a column with the calculated value?

Suppose I have a large table to store ranges of integers. I can do this with two fields:
start|end
10 |210 (represents 10 to 210)
5 |55 (represents 5 to 55)
(quick to select by end column), or:
start|length
10 | 200 (represents 10 to 210)
5 | 50 (represents 5 to 55)
(quick to select by length column).
What if sometimes I need to select by end, and sometimes by length, and both queries need to be fast? I could store both:
start|length|end
10 | 200 |210
5 | 50 |55
But then this is not normalised and everyone has to remember to update both fields, and is just bad design.
I know I can select by start + length or end - start but for a very large table, isn't this extremely slow?
How can I query by calculated values quickly without storing redundant data - or should I just store the extra column?
Depending on the database type you are using, you might want to use a trigger to calculate the derived field. That way, they can never get out of synch.
This means that the field (length) could be re-calculated every time start or end changes.
I'd store the length, but I'd make sure the calculation was done in my insert and update sprocs so that as long as everyone uses your sprocs there is no more overhead for them.
Unfortunately neither of your target databases support computed columns. I would do the following:
First, determine whether you really have a performance problem. It is true that WHERE end - start = ? will perform more slowly than WHERE length = ?, but you don't define what a "really big table" is in your application, nor what the required performance is. No need to optimize away a problem that may not exist.
Determine whether you can support any latency in your searches. If so, you can add the calculated column to the table but dedicate a separate task, running every five minutes, each hour, or whatever, to fill in the values.
In PostgreSQL you could consider a materialized view, which I believe are supported at the engine level. (See Catcall's comment, below).
Finally, if all else fails, consider using a trigger to maintain the calculated column.

How to tally and store votes for a web site?

I am using SQL Server 2005.
I have a site that people can vote on awesome motorcycles. Each time a user votes, there is one for the first bike and one vote against the second bike. Two votes are stored in the database. The vote table looks like this:
VoteID VoteDate BikeID Vote
1 2012-01-12 123 1
2 2012-01-12 125 0
3 2012-01-12 126 0
4 2012-01-12 129 1
I want to tally the votes for each bike quite frequently, say each hour. My idea is to store the tally as a percentage of contest won versus lost on the bike table as an attribute of the bike. So, if a bike won 10 contests and lost 20 contest, they would have a score (tally) of 33. I would tally up daily, weekly, and monthly scores.
BikeID BikeName DailyTally WeeklyTally MonthlyTally
1 Big Dog 5 10 50
2 Big Cat 3 15 40
3 Small Dog 9 8 0
4 Fish Face 19 21 0
Right now, there are about 500 votes per day being cast. We anticipate 2500 - 5000 per day in the next month or so.
What is the best way to tally the data and what is the best way to store it? Should the tallies be on their own table? Should a trigger be used to run a new tally each time a bike is voted on? Should a stored procedure be run hourly to get all tallies?
Any ideas would be very helpful!
Store your VoteDate as a datetime value instead of just date.
For your tallies, you can just make that a view and calculate it on the fly. This should be very simple to do using GROUP BY and DATEPART functions. If you need exact code for how to do this, please open a new question.
For that low volume of rows it doesn't make any sense to store aggregations in a table when you can just calculate them whenever you want to see them and get accurate and immediate results that are up-to-date.
I agree with #JNK try a view or just a normal stored proc to calculate the outputs on the fly. If you find it becomes too slow as your data grows I would investigate other routes then (like caching the data in another table etc). Probably worth keeping it simple to start with; you can always resuse the logic from the SP/VIEW later if you do want to setup a scheduled task.
Edit :
Removed the index view as per #Damien_The_Unbeliever comments its not deterministic and i'm stupid :)

Database scheme for searching my age groups

I've struggled with this for a while now trying to figure out how to do this most efficiently.
The problem is as follows. I have items in a database to be marketed for specific age groups such as ages 10 to 20 or ages 16+ and I need to be able to make a query like, find item that is for 17 year old
Here are my two best ideas (but I don't like either, as I think they're both inefficient).
Have a csv column with values like 10-20 and 16+ , retrieve the entire list, and parse through it (Bad idea, I know, I'm fresh out of ideas here though)
Have a csv column with values like 10,11,12,13...20 for ranges, so I can look for it using WHERE ages LIKE "%17%", and for cases like 16+ I'd have to retrieve those special cases using something like WHERE ages LIKE "%+%" and parse through those.
I'm of course leaning towards the second option, but in the very best scenario, I'm running two queries one for regular items, and one for things like 16+
Is there a better way? If not, do you think you could make either of my models more efficient? Thanks.
You can do it like this:
Add lower_age and upper_age columns to your table, both integers that allow NULLs.
If lower_age is NULL then there is no lower bound.
If upper_age is NULL then there is no upper bound.
Combine COALESCE and BETWEEN for your queries.
To clarify (4), you want to say things like this:
select *
from your_table
where $n between coalesce(lower_age, $n) and coalesce(upper_age, $n)
where $n is the age you're looking for. BETWEEN uses inclusive bounds so coalesce(lower_age, $n) ignores $n if lower_age is not NULL and gives you $n >= $n (i.e. an automatic true on that bound) if lower_age is NULL; similarly for the upper_age.
If something is suitable for only 11 year olds, then your [lower_age,upper_age] closed interval would be [11, 11], 16+ would be [16, NULL], six and lower would be [NULL, 6], everyone would be [NULL, NULL], and no one would be [23, 11] or anything else with lower_age > upper_age (or, more likely, invalid data that a CHECK constraint would throw a hissy fit over).
You can do this a number of ways. If you store the age of the user(whatever) in the row. Then you can query the age and with > 16 or < 30 or between 10-20 whatever. The other option is to store this as a bitwise. Have a reference table and store your different ranges if they can have multiples then you just add the two row values together.
1 = 10
2 = 16+
4 = 10-20
8 = 20-30
16 = 20+
32 = 30+
.
.
.
.
then in the table that stores the persons info you can set the column to an int or bigint take your preference and then for whatever groups they belong to you can determine this by the number for example:
Table of Users
ID Name BitWise
1 test 2
2 something 6 (2+4)
3 blah 24 (8+16)
However I think that it may be a bit overkill with the bitwise you might be best just storing the age as a number an running queries against that. More than likely this will be the most efficient.
You have a range of options (no pun intended). For age recommendations, the easiest way is to store a min_age and max_age and query like this:
select * from item where :age between min_age and max_age
where you have to decide whether you allow nulls for these columns (then you need to use coalesce() or nvl() or whatever function your database provides for dealing with comparisons with nulls), or set boundary values for these columns where you can be sure :age will always fall in between.
Alternatively, you can use a m:n table
create table item_ages (item_id int not null, age int not null, constraint item_ages_pk primary key (item_id, age)
and fill it with explicit values:
item_id | age
-------------
1 | 16
1 | 17
1 | 18
and so on. This is more cumbersome tha using a range, but also more flexible, and since your database can index the table and probably store that index in memory, queries should be fast. You only have to touch this table when a new item is entered or the age range for a particular item changes.
Note that CBRRacer's answer has similar properties: both share the idea that you prepare a datastructure that can easily be indexed, and answer the filter question from that index. This is a popular method for storing marketing data in ecommerce applications. The extreme end of that range would be to use a dedicated package for storing inverted indexes for that purpose. But for a simple age recommendation that's of course overkill.
Someting like this:
SELECT *
FROM tablename
WHERE 17 BETWEEN start_age AND end_age