R-error: "number of levels of each grouping factor must be < number of observations" - error-handling

I'm relatively new to R and have to perform a Linear Mixed Model-Analysis on some data for my university studies.
To describe my data ("data_complete_group"):
I tested flow (variable: "fss_M") for individual soccer players deriving from 4 different teams. Each team (and therefore each player in the related team) was allocated to one of two conditions of the variable "group".
Each person also completed three different surveys on three different days and therefore had a personalized "ID" (which is also a variable in the model).
The variable "team_num" represents the related team for each player.
I now want test whether the group-factor has a significant influence on the flow-score.
The model looks as follows:
model1 <- lmer(fss_M ~ group + (1 + group|team_num/ID), data = data_complete_group)
summary(model1)
If I understood it correctly, this means that "fss_M ~ group" is the fixed effect with "1 + group|team_num/ID" as random effect.
Unfortunately I get an error message when I want to run the code:
Eror: number of levels of each grouping factor must be < number of observations (problems: ID:team_num)
In contrast to that, the analysis works when I remove the term for the random effect.
How can I understand this? What's wrong with the code for the analysis with fixed + random effect?
I'm glad for every answer to this, thanks a lot!

Related

Select the best "team" of 9 "players" based on overall "team" performance only

I have 9 bins named A through I containing the following number of objects:
A(8), B(7), C(6), D(7), E(5), F(6), G(6), H(6), I(6)
Objects from each bin fulfill a specific role and cannot be interchanged. I am selecting one object from each bin at random forming a "team" of 9 "players":
T_ijklmnopq = {a_i, b_j, c_k, d_l, e_m, f_n, g_o, h_p, i_q}
There are 15,240,960 such combinations - a huge number. I have means of evaluating performance of each "team" via a costly objective function, F(T_ijklmnopq). Thus, I can feasibly sample a limited number of random combinations, say no more than 500 samples.
Having results of such sampling, I want to predict the most likely best combination of "players". How to do it?
Keep in mind this is different from classical team selection because there is no meaningful evaluation of F() based on individual performance. For example, "player" a_6 may be good individually, but he may not "like" e_2 and therefore the performance of "team" containing the two suffers. Conversely, three mediocre players b_1, f_5, i_2 may be a part of an awesome "team". What's know is the whole "team" performance, that's all.
One more detail: contributions of the individual roles A through I are not weighted equally. Position of, say, E may be more important than, say, H. Unfortunately, these weights are not known upfront.
The described problem must be know to combinatorial analysts, but I haven't found anything exactly like it. Linear programming solutions with known individual "player" scores do not apply here. I will be most grateful for a specific name under which this problem is known to experts.
So far I have collected 400 samples. Here is a graph of the sorted F(T) values vs. a (arbitrary) sample number to illustrate that F(T) is "reasonable".
F(T) graph of sorted samples

Traveling salesman ampl

I am working on a Traveling salesman problem and can't figure how to solve it. The problem contains ten workers, ten workplaces where they should be driven to and one car driving them one by one. There is a cost of $1.5 per km. Also, all the nodes (both workers and workplaces) are positioned in a 10*10 matrix and the distance between each block in the matrix is 1 km.
The problem should be solved using AMPL.
I have already calculated the distances between each coordinate in excel and have copy pasted the matrix to the dat.file in AMPL.
This is my mod.file so far (without the constrains):
param D > 0;
param D > 0;
set A = 1..W cross 1..D;
var x{A}; # 1 if the route goes from person p to work d,
# 0 otherwise
param cost;
param distance;
minimize Total_Cost:
sum {(w,d) in A} cost * x[w,d];
OK, so your route looks like: start-worker 1-job 1-worker 2-job 2-worker 3-job-3-...-job 10-end (give or take start & end points, depending on how you formulate the problem.
That being the case, the "worker n-job n" parts of your route are predetermined. You don't need to include "worker n-job n" costs in the route optimisation, because there's no choice about those parts of the route (though you do need to remember them for calculating total cost, of course).
So what you have here is really a basic TSP with 10 "destinations" (each representing a single worker and their assigned job) but with an asymmetric cost matrix (because cost of travel from job i to worker j isn't the same as cost of travel from job j to worker i).
If you already have an implementation for the basic TSP, it should be easy to adapt. If not, then you need to write one and make that small change for an asymmetric cost matrix. I've seen two different approaches to this in AMPL.
2-D decision matrix with subtour elimination
Decision variable x{1..10,1..10} is defined as: x[i,j] = 1 if the route goes from job i to job j, and 0 otherwise. Constraints require that every row and column of this matrix has exactly one 1.
The challenging part with this approach is preventing subtours (i.e. the "route" produced is actually two or more separate cycles instead of one large cycle). It sounds like your current attempt is at this stage.
One solution to the problem of subtours is an iterative approach:
Write an implementation that includes all requirements except for subtour prevention.
Solve with this implementation.
Check the resulting solution for subtours.
If no subtours are found, return the solution and end.
If you do find subtours, add a constraint which prevents that particular subtour. (Identify the arcs involved in the subtour, and set a constraint which implies they can't all be selected.) Then go to #2.
For a small exercise you may be able to do the subtour elimination by hand. For a larger exercise, or if your lecturer doesn't like that approach, you can create a .run that automates it. See Bob Fourer's post of 31/7/2013 in this thread for an example of implementation.
3-D decision matrix with time dimension
Under this approach, you set up a decision variable x{1..10,1..10,1..10} where x[i,j,t] = 1 if the route goes from job i to worker j at time t, and 0 otherwise. Your constraints then require that the route goes to and from each job/worker combination exactly once, that if it goes to worker i at time t then it must go from job i at time t+1 (excepting first/last issues), that it's doing exactly one thing at time t, and that the endpoint at time 10 matches the startpoint at time 1 (assuming you want a circuit).
This prevents subtours, because it forces a route that starts at some point at time 1, returns to that point at time 10, and doesn't visit any other point more than once - meaning that it has to go through all of them exactly once.

Can proc sql embedded in sas macros dynamically merge to data-sets, simulating residential treatment placement decisions for trouble youth?

Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P
The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.

Neural Network Input and Output Data formatting

and thanks for reading my thread.
I have read some of the previous posts on formatting/normalising input data for a Neural Network, but cannot find something that addresses my queries specifically. I apologise for the long post.
I am attempting to build a radial basis function network for analysing horse racing data. I realise that this has been done before, but the data that I have is "special" and I have a keen interest in racing/sportsbetting/programming so would like to give it a shot!
Whilst I think I understand the principles for the RBFN itself, I am having some trouble understanding the normalisation/formatting/scaling of the input data so that it is presented in a "sensible manner" for the network, and I am not sure how I should formulate the output target values.
For example, in my data I look at the "Class change", which compares the class of race that the horse is running in now compared to the race before, and can have a value between -5 and +5. I expect that I need to rescale these to between -1 and +1 (right?!), but I have noticed that many more runners have a class change of 1, 0 or -1 than any other value, so I am worried about "over-representation". It is not possible to gather more data for the higher/lower class changes because thats just 'the way the data comes'. Would it be best to use the data as-is after scaling, or should I trim extreme values, or something else?
Similarly, there are "continuous" inputs - like the "Days Since Last Run". It can have a value between 1 and about 1000, but values in the range of 10-40 vastly dominate. I was going to scale these values to be between 0 and 1, but even if I trim the most extreme values before scaling, I am still going to have a huge representation of a certain range - is this going to cause me an issue? How are problems like this usually dealt with?
Finally, I am having trouble understanding how to present the "target" values for training to the network. My existing results data has the "win/lose" (0 or 1?) and the odds at which the runner won or lost. If I just use the "win/lose", it treats all wins and loses the same when really they're not - I would be quite happy with a network that ignored all the small winners but was highly profitable from picking 10-1 shots. Similarly, a network could be forgiven for "losing" on a 20-1 shot but losing a bet at 2/5 would be a bad loss. I considered making the results (+1 * odds) for a winner and (-1 / odds) for a loser to capture the issue above, but this will mean that my results are not a continuous function as there will be a "discontinuity" between short price winners and short price losers.
Should I have two outputs to cover this - one for bet/no bet, and another for "stake"?
I am sorry for the flood of questions and the long post, but this would really help me set off on the right track.
Thank you for any help anyone can offer me!
Kind regards,
Paul
The documentation that came with your RBFN is a good starting point to answer some of these questions.
Trimming data aka "clamping" or "winsorizing" is something I use for similar data. For example "days since last run" for a horse could be anything from just one day to several years but tends to centre in the region of 20 to 30 days. Some experts use a figure of say 63 days to indicate a "spell" so you could have an indicator variable like "> 63 =1 else 0" for example. One clue is to look at outliers say the upper or lower 5% of any variable and clamp these.
If you use odds/dividends anywhere make sure you use the probabilities ie 1/(odds+1) and a useful idea is to normalize these to 100%.
The odds or parimutual prices tend to swamp other predictors so one technique is to develop separate models, one for the market variables (the market model) and another for the non-market variables (often called the "fundamental" model).

Design Problem

Recently I was faced with this interview question (K-Means Clustering solution). The design I came up with did not meet the expectations of the interviewer (to put simply I didnt get the job because I lost to another candidate on this design problem). I am wondering how many different / efficient / simply solutions can the SO community come up with (by doing this I am hoping to hone my skills):
To implement a simple algorithm to cluster people according to their weight and height. The
data set includes a list of people with their weights and heights like so:
Person Weight Height
(kg) (inches)
Person 1 70 70
Person 2 75 80
Person 3 120 85
You can plot the data as a 2 dimensional data. Weight being one dimension and height being
the other dimension. Weight can range from a minimum of 50kg to 150kg. Height can range
from a minimum of 38inches to 90inches
Algorithm:
The algorithm (called K-means clustering) will cluster data into K groups goes as such:
Start with K clusters. Each cluster is defined by its center point which will start of as
random weight and random height. Pick random numbers from within the
corresponding ranges defined above.
For each person
Calculate distance to center of each cluster using formula
distance = sqrt(pow((wperson−wcenter), 2) + (pow(hperson−hcenter),2))
where wperson = weight of person,
hperson = height of person
wcenter = weight of cluster center point,
hcenter = height of cluster center point
Assign the person to the cluster with the shortest distance to center point of cluster
After end of step 2, you will end up with K clusters each assigned with a set of people
For each cluster, set the weight and height of the center point to the average of the
people in the cluster
wcenter = (sum of weight of each person in cluster)/(number of people in cluster)
hcenter = (sum of height of each person in cluster)/number of people in cluster)
Repeat steps 2 to 5 for 1000 iterations, then print out following information for each
cluster.
weight and height of center of cluster.
list of people in cluster.
I am not looking for a implementation/solution but for a high level design. can you list the interfaces / classes etc.
I dont want to give my solution now, but will post it later in the day?
This is my attempt at the design. I only show the static diagram since the algorithm is pretty much laid out already. I would have a plan to have a visitor for the representation of the clusters, could allow different types of output (xml, strings, csv..etc). Maybe the visitor is overkill, if it was then I'd just have something like a ToString method that could be overridden.
Note: the Cluster creates a CenterClusterItem on the SetCenter and FindNewCenter methods. The CenterClusterItem is not a PersonClusterItem, it just holds the same amount of AClusterValues as a PersonClusterItem would (since the average isn't really a person).
Also, I forgot to make a method on the KCluster to begin the process, but that's implied.
Class Diagram http://img11.imageshack.us/img11/499/kcluster.png
Well, I would first tackle all the constants/magic numbers that reduce the reusability of the algorithm:
instead of a fixed number of iterations, use a stopping criterion (e.g., if clusters don't change too much, terminate)
don't restrict yourself to 2-dim data, use vectors
let the user define the number of clusters to be found
Then, you could hide some specifics behind interfaces, e.g. the distance might be calculated differently (for example, it might at some point have to cope with values other than double).
On the other hand, if you really have this simple problem, some of these generalizations might well be overkill - but that's what I would discuss with someone telling me to implement this algorithm.
You can create the following classes:
Person to store data about persons and centers. Properties: id, weight and height. Method: calculateDistance
Cluster to store one center and a list of persons: Properties: center and list of Person. Method: calculateCenter.
KCluster to hold your algorithm and store a list of clusters: Property: list of Cluster. Methods: generateClusters.
I'm not sure what your question actually is, the steps you point out effectively define the algorithm you're talking about.
A better idea may be to include exactly what you did then people can give you some hints / tips as to where you might have gone wrong or what they would have done differently.
That sounds like a really good way to do it. K-means will usually converge quickly (though not necessarily to the global optimum), so my one suggestion would be to run the algorithm until no more changes occur, rather than a fixed number of 1000 iterations. You could then repeat the entire process a few times with different random starting points.
One weakness of k-means is that it does require you to specify (i.e. guess) an appropriate value for k up-front. I think you would get points for asking the interviewer what an appropriate value for k would be, or, if there is no way to know, describing some goodness-of-fit measure and then calculating that measure for different values of k to find a "just low enough" value.