Stuck too early in local minima with tabu and simulated annealing - optaplanner

Trying to generate the Roster of employees however going doing at the level of hour. Resource hour requirements are like 1h at 8, 2 at 9, 7 at 1pm..
After assigning the first 3 resources, it keeps checking solutions are around them via assigning/reassigning them to Slots without trying to assign other employees.
How to troubleshoot this problem? couldn't it be the weights for each constraint/violation? Does it speed it if I implement a quick construction heuristic that fills the slots before handing to local search?
Current configuration consists of first_fit for construction heuristic, hill climbing as first phase till it get stuck then tabu with simulated annealing

Normally a CH assigns all employees before LS starts, which just moves them around but never unassigns anyone. See general phase sequence diagram in chapter "optimization algorithms" in the docs. That's presuming you don't apply overconstrained planning (nullable=true or null in the value range).
If you do apply overconstrained planning, that you need to make sure that the score cost of leaving an employee unassigned is worse than the score cost of however he's could be assigned.
Also set up a benchmarker config, so you have some benchmark report graphs to allow you to understand what's going on.

Related

Why is score calculation speed faster for Construction Heuristics than Local Search?

Getting started with OptaPlanner (v.23.0.Final), I am experimenting with the CloudBalancing example. Using the IncrementalScoreCalculator Java class, I notice that the score calculation speed is much higher in the construction phase (>1M/sec) than in the local search phase (~50k/sec). How can this happen? Is the algorithm outside the score calculation included? That could explain the differnce, since the local search algorithm will spend much more time outside the score calculator than the construction algorithm.
Two reasons:
1) The construction Heuristic starts with no processes assigned to a computer, so all Process.getComputer() is null. Most constraints match on Processes for which computer != null, so they short circuit and don't do any expensive joins, groupBy's, accumulates, etc. So an empty or a partially initialized solution evaluates much faster than a fully initialized one (which Local Search uses).
2) The CH's only do ChangeMove's. LS does more expensive moves including swap moves (twice as big) and pillar moves (n times as big). So the amount of delta impact to calculate per move is bigger in LS too.

Data Science: Using Inferential Statistics to label train dataset

Lack of High Schools in remote areas is a problem for students in developing country. Students in some locations are better than that in other. So, I have to find those locations. Now, the main problem is defining "BETTER". I have made some rules that will define the profile of a location.
Right now, I am concerned with the good students.
So, what I have done is-
1. Used some inferential statistics to and made some rules to come up with the conclusion that Location A,B,C,etc are the most potential locations where you can put the high schools because according to my rules these locations contain quality students.
I did all of the things above to label the data because I required to define "BETTER" and label the data so that I can now use machine learning algorithm to learn the factors which makes a location a potential location so that if I give a data point from test data to the model, it will instantly tell if the location is better or not.
Overview of the method:
For each location, I have these 4 information:
total_students_staying_for_high_school_education(A),
total_students_leaving_for_high_school_education_in_another_place(B),
mean_grade_point_of_students_of_type_B,
ratio (calculated as B/A),
For the location whose ratio > 1
I applied the chi-squared significance test to come up with a statistic which would tell me if students are leaving that place in significant amount than staying. I used ANOVA and then Tukey test to compare means_grade points and then find combinations of pairs of locations whose means vary and whose is greater than the others.
I then wrote a python program with a custom comparator that first compares if mean_grade of those points vary and returns the one with greater mean. If the means don't vary, the comparator return the location with the one whose chi-squared value is greater.
This is how, the whole process comes up with few suggestions of location and I call those location "BETTER".
What I am concerned about is-
1. How do I verify if my rules are valid? Or do I even need to verify it?
2. Most importantly, is mingling statistics with machine learning as described above an appropriate approach?Is there any major leakage in the method?Can anyone suggest a more general method?

Can proc sql embedded in sas macros dynamically merge to data-sets, simulating residential treatment placement decisions for trouble youth?

Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P
The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.

Optaplanner: how to handle minimum number consecutive

Let's assume a variation on Nurse Rostering example in which instead of assigning a nurse to a shift on a day, the nurse is assigned to a variable number of timeblocks on that day (which consists of 24 timeblocks). eg: Nurse1 is assigned to timeblocks [8,9,10,11,12,13,14]. Let's call these consecutive assignments a ShiftPeriod. There is a hard minimum and maximum on these shiftperiods. However, optaplanner has difficulties finding a feasible solution.
When having hard consecutive constraints, is it better to model the planning entity as a startTimeBlock with a duration instead of my current way with assignment to a timeblock and a day and then imposing min/max consecutive?
Take a look at the meeting scheduling example on github master for 6.4.0.Beta1 (but the example will work perfectly with 6.3.0.Final too). Video and docs coming soon. That example uses the design pattern TimeGrains, which is what you're looking for I think.

Neural Network Input and Output Data formatting

and thanks for reading my thread.
I have read some of the previous posts on formatting/normalising input data for a Neural Network, but cannot find something that addresses my queries specifically. I apologise for the long post.
I am attempting to build a radial basis function network for analysing horse racing data. I realise that this has been done before, but the data that I have is "special" and I have a keen interest in racing/sportsbetting/programming so would like to give it a shot!
Whilst I think I understand the principles for the RBFN itself, I am having some trouble understanding the normalisation/formatting/scaling of the input data so that it is presented in a "sensible manner" for the network, and I am not sure how I should formulate the output target values.
For example, in my data I look at the "Class change", which compares the class of race that the horse is running in now compared to the race before, and can have a value between -5 and +5. I expect that I need to rescale these to between -1 and +1 (right?!), but I have noticed that many more runners have a class change of 1, 0 or -1 than any other value, so I am worried about "over-representation". It is not possible to gather more data for the higher/lower class changes because thats just 'the way the data comes'. Would it be best to use the data as-is after scaling, or should I trim extreme values, or something else?
Similarly, there are "continuous" inputs - like the "Days Since Last Run". It can have a value between 1 and about 1000, but values in the range of 10-40 vastly dominate. I was going to scale these values to be between 0 and 1, but even if I trim the most extreme values before scaling, I am still going to have a huge representation of a certain range - is this going to cause me an issue? How are problems like this usually dealt with?
Finally, I am having trouble understanding how to present the "target" values for training to the network. My existing results data has the "win/lose" (0 or 1?) and the odds at which the runner won or lost. If I just use the "win/lose", it treats all wins and loses the same when really they're not - I would be quite happy with a network that ignored all the small winners but was highly profitable from picking 10-1 shots. Similarly, a network could be forgiven for "losing" on a 20-1 shot but losing a bet at 2/5 would be a bad loss. I considered making the results (+1 * odds) for a winner and (-1 / odds) for a loser to capture the issue above, but this will mean that my results are not a continuous function as there will be a "discontinuity" between short price winners and short price losers.
Should I have two outputs to cover this - one for bet/no bet, and another for "stake"?
I am sorry for the flood of questions and the long post, but this would really help me set off on the right track.
Thank you for any help anyone can offer me!
Kind regards,
Paul
The documentation that came with your RBFN is a good starting point to answer some of these questions.
Trimming data aka "clamping" or "winsorizing" is something I use for similar data. For example "days since last run" for a horse could be anything from just one day to several years but tends to centre in the region of 20 to 30 days. Some experts use a figure of say 63 days to indicate a "spell" so you could have an indicator variable like "> 63 =1 else 0" for example. One clue is to look at outliers say the upper or lower 5% of any variable and clamp these.
If you use odds/dividends anywhere make sure you use the probabilities ie 1/(odds+1) and a useful idea is to normalize these to 100%.
The odds or parimutual prices tend to swamp other predictors so one technique is to develop separate models, one for the market variables (the market model) and another for the non-market variables (often called the "fundamental" model).