Predicting game outcomes at a point in time - sql-server-2005

Hi I am trying to gauge from past information I have on an MSSQL database the predicted outcomes of soccer games (win, tie or loss for the home team) at any point in time based on the minutes played and the scoreline
What I had envisaged as output was something like fangraphs does for baseball
http://www.fangraphs.com/scoreboard.aspx?date=2010-11-01
although with two lines as there are three rather than two possible outcomes
From the data and the existing tables I can create game records like this
Time TeamID Venue MatchID Result
6 TOT H 5 W
27 ASV A 5 W
58 ASV A 5 W
66 TOT H 5 W
77 TOT H 5 W
So for the graph for this game the home team TOT would start with the win line at around 45% (based on the historical probability of a home win) it would spike when they score their goal, dip significantly after ASV score twice but be probably above 90% when they score to go 3-2 up and then rise gently to 100% at the cloing 90 minute mark
So I want to go through the 7500 games I have data on and based on them establish for every minute of a 90 minute game what are the chances of a win, tie or loss for the home team based on the these results
For instance, in the simplest situation after 1 minute of play in actuality 44 of the home teams scored, 33 of them went on to win, 6 tied and 5 lost. The corresponding case where the away team scored has been 9 wins, 8 ties 23 losses for the home team. However, I am having trouble getting my head around how to get all 90 minutes scorelines and compare them with the final result (Only one goal can be scored in any specific minute)
TIA for any help

There will always be things you can add to the model, but the first thing I would do is, for each game, pull out the score at each minute, and assume that the probability of winning doesn't depend on when the goal was scored, but depends on what the score is now.
So now you would have 90 data points per game.
game1:
Minute: 0 10 20 30 40 50 60 70 80 90
Score :[0 0] [0 0] [0 1] [0 1] [1 1] [1 1] [1 1] [2 1] [2 1] [3 1]
The next thing I would do is, for each minute slice, add up the number of wins, losses, and draws over all games, for each configuration of scores.
So each entry in that table might correspond to something like this:
#minute 27, for score {home:5, away:2} : {homeWins: 9, draws:1, homeLosses:0}
you might want to try using the difference in score instead of the actual score values..
Either way, Once you have the data formatted that way, getting a reasonable solution is easy.
If a game is on, and it's minute 77, and the score is {home:5, away:2}, the (MLE) estimate is 90% wins, 10% draw, 0% looses (according to the example table entry above).
So you see already how it will help to include "laplacian smoothing": adding +1 to the final values of each of those win/lose/draw counters. This way if you've never seen a loss in this exact situation you don't say it's impossible, impossible is a very strong word (look for beta or drichlet distributions for background).
The obvious problem with this approach is that if you've never seen a particular score combination before it will predict (33%,33%,33%), which is obviously wrong in some cases.
The simplest fix would be to enforce a rule like "leading by 6 points is at least as good as leading by 5 points". It's ugly, but it's a start.
To avoid that sort of special-case logic you could try averaging that approach with a monte-carlo approximation.
The simplest of those approaches is to say : over all my data I expect about a 1 in 30 chance of a goal by each team in each minute of play -> simulate the game 10000 times form the current point, count the number of wins/losses/draws and you're done.
If that's too random, or processor intense, switch to Markov Chains.

Related

How to find frequency of element list in data frame using pandas?

I have a list and a data frame. I want to find the number of each word in the list (some words in the list are pair) for each "emotions" in the data frame.
Here is my list:
[(frozenset({'know'}), 16528),
(frozenset({'im'}), 39047),
(frozenset({'feeling'}), 99455),
(frozenset({'like'}), 49332),
(frozenset({'feel', 'im'}), 16602),
(frozenset({'feeling', 'im'}), 23488),
(frozenset({'feel'}), 202985),
(frozenset({'feel', 'like'}), 42162),
(frozenset({'time'}), 17203),
(frozenset({'really'}), 17247)]
and this is my data frame:
Unnamed: 0 id text emotions
0 0 27383 [feel, awful, job, get, position, succeed, hap... sadness
1 1 110083 [im, alone, feel, awful] sadness
2 2 140764 [ive, probably, mentioned, really, feel, proud... joy
3 3 100071 [feeling, little, low, day, back] sadness
4 4 2837 [beleive, much, sensitive, people, feeling, te... love
Here is the expected output:
6 columns for six existed emotions and the last column is for totall count.

Weighted Activity Selection Problem with allowing shifting starting time

I have some activities with weights, and I would like to select non overlapping activities by maximizing the total weight. This is known problem and solution exists.
In my case, I am allowed to shift the start time of activities in some extend while duration remains same. This will give me some flexibility and I might increase my utilization.
Example scenario is something like the following where all activities are supposed to be in interval (0-200):
(start, end, profit)
a1: 10 12 120
a2: 10 13 100
a3: 14 18 150
a4: 14 20 100
a5: 120 125 100
a6: 120 140 150
a7: 126 130 100
Without shifting flexibility, I would choose (a1, a3, a6) and that is it. On the other hand I have shifting flexibility to the left/right by at most t units for any task where t is given. In that case I might come up with this schedule and all tasks can be selected except a7 since conflict cannot be avoided by shift .
t: 5
a1: 8 10 120 (shifted -2 to left)
a2: 10 13 100
a3: 14 18 150
a4: 18 24 100 (shifted +4 to right)
a5: 115 120 100 (shifted -5 to left)
a6: 120 140 150
In my problem, total time I have is very big with respect to activity duration. While activities are like 10sec on average, total time I have would even be 10000sec. However that does not mean all of activities can be selected since shifting flexibility would not be enough for some activities to non-overlap.
Also in my problem, there are clusters of activities which overlaps and very big empty space where no activities and there comes another cluster of overlapping activities i.e a1, a2, a3 and a4 are let say cluster1 and a5, a6 and a7 is cluster2. Each cluster can be expanded in time by shifting some of them to left and right. By doing that, I can select more activities than the original activity selection problem. However, I do not know how to decide which tasks to be shifted to left or right.
My expectation is to find an near-optimal solution where total profit would be somehow local optima. I do not need global optimum value. Also I do not have any criteria about cluster utilization., i.e I do not have a guarantee about a minimum number of activity per cluster etc. Actually, these clusters something I visually describe. There is not defined cluster. However, in time domain, activities are separated as clusters somehow.
Also activity start and end times are all integers since I can dismiss fractions. I would have around 50 activities whose duration would be 10 on average. And time window is like 10000.
Are there any feasible solution to this problem?
You mentioned that you can partition the activities into clusters that don't overlap even if activities within them are shifted to the extent. Each of these clusters can be considered independently, and the optimal results computed for each cluster simply summed up for the final answer. So the first step of the algorithm could be a trial run that extends all activities in both directions, finds which ones form clusters, and process each cluster independently. In the worst case, all of the activities might form a single cluster.
Depending on the maximum size of the remaining clusters, there are several approaches. If it's under 20 (or even 30, depending on whether you want your program to run in seconds or minutes), you could combine search over all subsets of activities in the given cluster with a greedy approach. In other words: if you are processing a subset of N elements, try every one of its 2^N possible subsets (okay, 2^N-1 if we forget the empty subset), check whether the activities in this specific subset can be scheduled in non-overlapping manner, and pick the subset that is eligible and has maximum sum.
How do we check that a given subset of activities can be scheduled in non-overlapping manner? Let's sort them in ascending order of their end and process them from left to right. For every activity, we try to schedule it as early as possible, making sure it does no intersect with activities we already considered. So, the first activity in the cluster is always started time t earlier than originally planned, the second one is started either when the first one ends, or t earlier than originally planned, whichever is larger, and so on. If at any point we can't schedule the next activity in a way that it does not overlap with previous one, then there is no way to schedule the activities in this subset in a non-overlapping manner. This algorithm takes O(NlogN) time, and overall each cluster is processed in O(2^N * NlogN). Once again, note that this function grows very quickly, so if you are dealing with large enough clusters, this approach goes out the window.
===
Another approach is specific to the additional restrictions you provided. If the activities' starts and ends and parameter t are all measured in integer number of seconds, and t is about 2 minutes, then the problem for each cluster is set in a small discrete space. Even though you could position a task to start at a non-integer second value, there always is an optimal solution that uses only integers. (To prove it, consider an optimal solution that does not use integers - since t is integer, you can always shift tasks, starting from the leftmost, to the left a bit so that it starts at an integer value.)
Knowing that the start and end times are discrete, you can build a DP solution: process the activities in the ascending order of their end*, and memoize the maximum possible sum of weights you can obtain from the first 1, 2, ..., N activities for each x from activity_start - t to activity_start + t if a given activity ends at time x. If we denote this memoized function as f[activity][end_time], then the recurrence relation is f[a][e] = weight[a] + max(f[i][j] over all i < a, j <= e - (end[a] - start[a]), which roughly translates to "if activity a ended at time e, the previous activity must have ended at or before start of a - so let's pick the maximum total weight over previous activities and their ends, and add the current activity's weight".
*Again, we can prove that there is at least one optimal answer where this ordering is preserved, even though there might be other optimal answers which do not possess this property
We could go further and eliminate the iteration over previous activities, instead encoding this information in f. Its definition would then change to "f[a][e] is the maximum possible total weight of the first a activities if none of them ends after e", and recurrence relation would become f[a][e] = max(f[a-1][e], weight[a] + max(f[a-1][i] over i <= e - (end[a] - start[a])])), and its computational complexity would be O(X * N), where X is the total span of the discrete space where task starts/ends are placed.
I assume you need to compute not just the maximum possible weight, but also the activities you need to select to obtain it, and possibly even the exact time each of them needs to be started. Thankfully, we can derive all of this from the values of f, or compute it at the same time as we compute f. The latter is easier to reason about, so let's introduce a second function g[activity][end]. g[activity][end] returns a pair (last_activity, last_activity_end), essentially pointing us to the exact activity and its timing that the optimal weight in f[activity][end] uses.
Let's go through the example you provided to illustrate how this works:
(start, end, profit)
a1: 10 12 120
a2: 10 13 100
a3: 14 18 150
a4: 14 20 100
a5: 120 125 100
a6: 120 140 150
a7: 126 130 100
We order the activities by their end time, thereby swapping a7 and a6.
We initialize the values of f and g for the first activity:
f[1][7] = 120, f[1][8] = 120, ..., f[1][17] = 120, meaning that the first activity could end anywhere from 7 to 17, and costs 120. f[1][i] for all other i should be set to 0.
g[1][7] = (1, 7), g[1][8] = (1, 8), ..., g[1][17] = (1, 17), meaning that the last activity that was included in f[1][i] values was a1, and it ended at i. g[1][i] for all i outside [7, 17] is undefined/irrelevant.
That's where something interesting begins. For each i such that a2 cannot end at time i, let's assign f[2][i] = f[1][i], g[2][i] = g[1][i], which essentially means that we wouldn't be using activity a2 in those answers. For all other i, namely, in [8..18] interval, we have:
f[2][8] = max(f[1][8], 100 + max(f[1][0..5])) = f[1][8]
f[2][9] = max(f[1][9], 100 + max(f[1][0..6])) = f[1][9]
f[2][10] = max(f[1][10], 100 + max(f[1][0..7])). This is the first time when the second clause is not just plain 100, as f[1][7]>0. It is, in fact, 100+f[1][7]=220, meaning that we can take activity a2, shift it in a way that puts its end at time 10, and get a total weight of 220. We continue computing f[2][i] this way for all i <= 18.
The values of g are: g[2][8]=g[1][8]=(1, 8), g[2][9]=g[1][9]=(1, 9), g[2][10]=(2, 10), because it was optimal to take activity a2 and end it at time 10 in this case.
I hope the pattern of how this continues is visible - we compute all the values of f and g through the end, and then pick the maximum f[N][e] over all possible end times e of the last activity. Armed with the auxiliary function g, we can traverse the values backwards to figure out the exact activities and times. Namely, the last activity we use and its timing is in g[N][e]. Let's call them A and T. We know that A began at T-(end[A]-start[A]). Then, the previous activity must have ended at that point or before - so let's look at g[A-1][T-(end[A]-start[A]) for it, and so on.
Note that this approach works even if you don't partition anything into clusters, but with the partitioning, the size of the space in which tasks can be scheduled is reduced, and with it the runtime.
You might notice that neither of these solutions is polynomial in the size of input. I have a feeling that your problem doesn't have a general polynomial solution, but I was unable to prove it by reducing another NP-complete problem to it. Would be really curious to read a reduction / better general solution!

Plotting data from two sets with different shapes in the same plot

I am using data collected from two different instruments which have different resolution because of the sampling rate of each instrument. For a specific time, one of the sets have >10k entries while the other has ~2.5k. They however capture data over the same time interval, and I want to plot them on top of each other even though they have different resolution in data. The minimum and maximum x of both sets are the same however one of them have more entries.
Simplified it could look like this:
1st set from instrument with higher sampling rate:
time(s) value
0.0 10
0.2 11
0.4 12
0.6 13
0.8 14
... ..
100 50
2nd set from instrument with lower sampling rate:
time(s) value
0 100
1 120
2 125
3 128
4 130
. ...
100 430
They are measuring different things, but I would like to display them in the same plot. How can I accomplish this?
I found the mistake.. I was trying to plot both datasets using the time data from the first instrument. Of course they need to be plotted with their respective time data and I put the first time data in the second plot by mistake..

Google Spreadsheet with SQL query - finding best combination

I have a google spreadsheet for my gaming information. It contains 2 sheets - one for monster information, another for team.
Monster information sheet contains the attack value, defend value, and the mana cost of monsters. It's almost like a database of monsters that I can summon.
Team sheet does the following:
Asks for the amount of mana I currently have.
Computes a list of up to 5 monsters that I can summon (it can be less than 5).
Each monster has their own mana cost, therefore total mana cost mustn't exceed the amount of mana I have given in point 1.
The tabulated list should give me a team that have the highest combined attack value. It does not matter how many monsters are summoned. Each monster cannot be summoned twice though.
I have been thinking of using query() function so that I can make use of SQL statements. (so that I can hopefully retrieve the tabulated list directly)
Sample: Monster Info
A B C D
1 Monster Attack Defense Cost
2 MonA 1200 1200 35
3 MonB 1400 1300 50
... ...
Sample: Team
A B C D
1 Mana 120
2
3 Attack Team
4 Monster Attack Cost Total Attack
5 MonB 1400 50 1400
6 MonA 1200 35 2600
7 ... ...
I have these formula in "Team" sheet
A5: =query('Monster Info'!$A$:$D,"SELECT A,B,D ORDER BY B DESC LIMIT 5")
B5: =CONTINUE(A5, 1, 2)
C5: =CONTINUE(A5, 1, 3)
D5: =C5
A6: =CONTINUE(A5, 2, 1)
B6: =CONTINUE(A5, 2, 2)
C6: =CONTINUE(A5, 2, 3)
D6: =D5+C6
That only gets the 5 best attack monsters, regardless of the mana cost consideration. How do I do that such that it takes consideration of both attack value and mana cost value? There is another problem shown in the example below:
Example: (simplified version, without defense value etc)
Monster Attack Cost
MonA 1400 50
MonB 1200 35
MonC 1100 30
MonD 900 25
MonE 500 20
MonF 400 15
MonG 350 10
MonH 250 5
If I have 160 mana, then the obvious team is A+B+C+D+E (5100 Attack).
If I have 150 mana, it becomes A+B+C+D+G (4950 Attack).
If I have 140 mana, it becomes A+B+C+D (4600 Attack).
If I have 130 mana, it becomes B+C+D+E+F (4100 Attack using 125 mana) or A+B+C+F (4100 Attack using all 130 mana).
If I have 120 mana, it becomes B+C+D+E+G (4050 Attack).
If I have 110 mana, it becomes B+C+D+F+H (3850 Attack).
As you can see, there isn't really a pattern within the results.
Any expert willing to share their insights on this?
I've played with the problem for an hour and I only have a workaround here. Your problem seems to be a standard linear programming task which should can easily be solved by a "Solver" software. There used to be a so called "Solver" in google spreadsheet, but unfortunately it was removed from the newest version. If you are not insisting on Google solution, you should try it in one of the Solver-supported spreadsheet manager softwares.
I tried MS Office (it has a Solver add-in, installation guide: http://office.microsoft.com/en-001/excel-help/load-the-solver-add-in-HP010342660.aspx).
Before you run the solver, you should prepare your original dataset a bit, with helper columns and cells.
Add a new column next to the "Cost" column (let's assume it is column "D"), and under it put each row either 0, or 1. This column will tell you if a monster is selected to the attack team or not.
Add two more columns ("E" and "F" respectively). These columns will be products of the Attack and of the Cost respectively. So you should write a function to the E2 cell: =b2*d2, and for the F2 cell: =c2*d2. With this way if a monster is selected (which is told by the D column, remember), the appropriate E and F cells will be non zero values, aotherwise they will be 0.
Create a SUM row under the last row, and create a summarizing function for the D,E,F columns respectively. So in my spreadsheet D10 cell gets its value like this: =sum(d2:d9), and so on.
I created a spreadsheet to show these steps: https://docs.google.com/spreadsheets/d/1_7XRlupEEwat3CthSSz8h_yJ44MysK9hMsj0ijPEn18/edit?usp=sharing
Remember to copy this worksheet to an MS Office worksheet, before you start the Solver.
Now, you are ready to start the Solver. (Data menu, Solver in MS Office). You can see a video here on using the Solver: https://www.youtube.com/watch?v=Oyc0k9kiD7o
It's not that hard as it looks like, but for this case I'll describe what to write where:
Set Objective: you should select the "E10" cell, as that represents the sum of all the attack points.
Check "Max" radiobutton as we would like to maximize the value of the attacks.
By Changing variable cells: Select the "d2:d9" interval as those cells are representing whether a monster is selected or not. The solver will try to adjust these values (0, or 1) in order to maximise the sum attack.
Subject to the Contraints: Here we should add some constraints. Click on the Add button, and then:
First we should ensure that d2:d9 are all binary values. So "Cell reference" should be "d2:d9" and from the dropdown menu, select "bin" as binary.
Another constraint should be that the sum of the selected monsters should not exceed 5. So select the cell where the sum of the selected monsters is represented (D10) and add "<=" and the value "5"
Finally we cannot use more manna that we have, so select the cell in which you store the sum of used manna (F2), and "<=", and add the whole amount of manna we can spend in my case it's in the I2 cell).
Done. It should work, in my case it worked at least.
Hope it helps anyway.

Power-law distribution in T-SQL

I basically need the answer to this SO question that provides a power-law distribution, translated to T-SQL for me.
I want to pull a last name, one at a time, from a census provided table of names. I want to get roughly the same distribution as occurs in the population. The table has 88,799 names ranked by frequency. "Smith" is rank 1 with 1.006% frequency, "Alderink" is rank 88,799 with frequency of 1.7 x 10^-6. "Sanders" is rank 75 with a frequency of 0.100%.
The curve doesn't have to fit precisely at all. Just give me about 1% "Smith" and about 1 in a million "Alderink"
Here's what I have so far.
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank] = ROUND(88799 * RAND(), 0)
But this of course yields a uniform distribution.
I promise I'll still be trying to figure this out myself by the time a smarter person responds.
Why settle for the power-law distribution when you can draw from the actual distribution ?
I suggest you alter the LastNames table to include a numeric column which would contain a numeric value representing the actual number of indivuduals with a name that is more common. You'll probably want a number on a smaller but proportional scale, say, maybe 10,000 for each percent of representation.
The list would then look something like:
(other than the 3 names mentioned in the question, I'm guessing about White, Johnson et al)
Smith 0
White 10,060
Johnson 19,123
Williams 28,456
...
Sanders 200,987
..
Alderink 999,997
And the name selection would be
SELECT TOP 1 [LastName]
FROM [LastNames] as LN
WHERE LN.[number_described_above] < ROUND(100000 * RAND(), 0)
ORDER BY [number_described_above] DESC
That's picking the first name which number does not exceed the [uniform distribution] random number. Note how the query, uses less than and ordering in desc-ending order; this will guaranty that the very first entry (Smith) gets picked. The alternative would be to start the series with Smith at 10,060 rather than zero and to discard the random draws smaller than this value.
Aside from the matter of boundary management (starting at zero rather than 10,060) mentioned above, this solution, along with the two other responses so far, are the same as the one suggested in dmckee's answer to the question referenced in this question. Essentially the idea is to use the CDF (Cumulative Distribution function).
Edit:
If you insist on using a mathematical function rather than the actual distribution, the following should provide a power law function which would somehow convey the "long tail" shape of the real distribution. You may wan to tweak the #PwrCoef value (which BTW needn't be a integer), essentially the bigger the coeficient, the more skewed to the beginning of the list the function is.
DECLARE #PwrCoef INT
SET #PwrCoef = 2
SELECT 88799 - ROUND(POWER(POWER(88799.0, #PwrCoef) * RAND(), 1.0/#PwrCoef), 0)
Notes:
- the extra ".0" in the function above are important to force SQL to perform float operations rather than integer operations.
- the reason why we subtract the power calculation from 88799 is that the calculation's distribution is such that the closer a number is closer to the end of our scale, the more likely it is to be drawn. The List of family names being sorted in the reverse order (most likely names first), we need this substraction.
Assuming a power of, say, 3 the query would then look something like
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 88799 - ROUND(POWER(POWER(88799.0, 3) * RAND(), 1.0/3), 0)
Which is the query from the question except for the last line.
Re-Edit:
In looking at the actual distribution, as apparent in the Census data, the curve is extremely steep and would require a very big power coefficient, which in turn would cause overflows and/or extreme rounding errors in the naive formula shown above.
A more sensible approach may be to operate in several tiers i.e. to perform an equal number of draws in each of the, say, three thirds (or four quarters or...) of the cumulative distribution; within each of these parts list, we would draw using a power law function, possibly with the same coeficient, but with different ranges.
For example
Assuming thirds, the list divides as follow:
First third = 425 names, from Smith to Alvarado
Second third = 6,277 names, from to Gainer
Last third = 82,097 names, from Frisby to the end
If we were to need, say, 1,000 names, we'd draw 334 from the top third of the list, 333 from the second third and 333 from the last third.
For each of the thirds we'd use a similar formula, maybe with a bigger power coeficient for the first third (were were are really interested in favoring the earlier names in the list, and also where the relative frequencies are more statistically relevant). The three selection queries could look like the following:
-- Random Drawing of a single Name in top third
-- Power Coef = 12
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 425 - ROUND(POWER(POWER(425.0, 12) * RAND(), 1.0/12), 0)
-- Second third; Power Coef = 7
...
WHERE LN.[Rank]
= (425 + 6277) - ROUND(POWER(POWER(6277.0, 7) * RAND(), 1.0/7), 0)
-- Bottom third; Power Coef = 4
...
WHERE LN.[Rank]
= (425 + 6277 + 82097) - ROUND(POWER(POWER(82097.0, 4) * RAND(), 1.0/4), 0)
Instead of storing the pdf as rank, store the CDF (the sum of all frequencies until that name, starting from Aldekirk).
Then modify your select to retrieve the first LN with rank greater than your formula result.
I read the question as "I need to get a stream of names which will mirror the frequency of last names from the 1990 US Census"
I might have read the question a bit differently than the other suggestions and although an answer has been accepted, and a very through answer it is, I will contribute my experience with the Census last names.
I had downloaded the same data from the 1990 census. My goal was to produce a large number of names to be submitted for search testing during performance testing of a medical record app. I inserted the last names and the percentage of frequency into a table. I added a column and filled it with a integer which was the product of the "total names required * frequency". The frequency data from the census did not add up to exactly 100% so my total number of names was also a bit short of the requirement. I was able to correct the number by selecting random names from the list and increasing their count until I had exactly the required number, the randomly added count never ammounted to more than .05% of the total of 10 million.
I generated 10 million random numbers in the range of 1 to 88799. With each random number I would pick that name from the list and decrement the counter for that name. My approach was to simulate dealing a deck of cards except my deck had many more distinct cards and a varing number of each card.
Do you store the actual frequencies with the ranks?
Converting the algebra from that accepted answer to MySQL is no bother, if you know what values to use for n. y would be what you currently have ROUND(88799 * RAND(), 0) and x0,x1 = 1,88799 I think, though I might misunderstand it. The only non-standard maths operator involved from a T-SQL perspective is ^ which is just POWER(x,y) == x^y.