How to find multiple subsets of numbers that are approximately equal to a given value? - vba

I am using VBA that gets data from an Excel 2013 spreadsheet. I have a couple years experience in computer science from a while back using VBA and java, but I'm by no means an expert.
I have a column of numbers ranging from 20 to 60 total. Each of those numbers represents 'minutes' and can range from 3 to 500 (normally 60 to 300). Each number has an assigner called a 'load number' (such as N03, N22 and etc.) and a date/time. All of these values are attributed to a 'load' that needs to be picked. 'Pickers' are the ones that have the loads or minutes assigned to them. They can only pick so many minutes per given day which ranges from 400-600 (8 hour shift = 400 minutes).
What I need to do is assign sets of loads that are equal to an approximate amount of total minutes (set number w/ threshold) to two groups of pickers (The groups are AM and PM, each have 3-5 pickers). Once one load is assigned to a picker, it can't be assigned to another UNLESS the loads for a given day have too many minutes and all the pickers can't be assigned an approximate amount of minutes.
Example: Out of 8 pickers, 6 can be assigned loads totaling between 380-420 minutes, but 2 can't be assigned between 380-420 because of the remaining loads.
In the case of the given example, for the remaining 2 pickers, a total of 760 - 840 minutes can be assigned to BOTH of them.
Loads also need to be assigned based on their date/time. If pickers are picking loads due on the same day, the earliest loads need to be assigned to the AM group of pickers and, accordingly, the latest to the PM group of pickers. If all loads to be assigned are for the next day, they can be assigned to anyone as long as the earliest loads are prioritized.
Example: AM shift starts at 5AM w/ 5 pickers. There is three loads that are 200 minutes (4 hours, actual) due at 9AM on the same day
The three loads should be assigned to three different pickers, so the loads can be done on time. They would be marked as the #1 load, so each picker knows to do it first
Example: Another load is due at 9AM on the same day. It is 400 minutes though.
2 pickers can be assigned to this load as their #1 pick and 200 minutes would be assigned to both of them.
Once the loads are assigned to the pickers, the results will be displayed in a separate spreadsheet with each row having: AM/PM, Picker's name, Load number #'s 1-10 w/ load number and minutes to pick and the total minutes.
Example: PICKER | AM | Toby | 029-N10 (268), 030-N05 (93), 030-N04 (111) | 472 TOTAL
Any help / pointers on this problem would be appreciated. I've looked at similar questions posted on here and abroad, but couldn't find any that would give me enough to go by to start working on a solution. It's not too bad assigning loads manually, but it gets complex one there's over 30 and 4,000 minutes total and especially when most of them are larger. It would just be much easier having a program assign everything and save 1-2 hours in the process everyday.
The data, in Excel, is structured into 8 columns and up to 50 rows. Each row represents a 'load' and has only 3 useful cells. I got all the information into three arrays, which can be used to display the info for any load by using the same element (1-50) for each array.
Dim LoadNumbers(1 To 50) As String
Dim LoadTime(1 To 50) As Double
Dim LoadMinutes(1 To 50) As Double
Dim C As Integer
C = 1
Do While C < 50
LoadNumbers(C) = Cells(C, 2)
LoadTime(C) = Cells(C, 5) * 24
LoadMinutes(C) = Cells(C, 7)
C = C + 1
For example:
LoadNumbers(5) & " # " & LoadTimes(5) & " Hours PST # " & LoadMinutes(5) & " Minutes"
Will return:
039-N06  # 9.5 Hours PST # 67.4 Minutes (9.5 hours = 9:30AM)
The LoadTimes and LoadMinutes arrays are the ones I need to assign loads. I will have another two cells that users will input the desired minutes (M) to be assigned and the threshold (T). I then need to VBA script to assign (M-T to M+T) minutes to each picker.
Here's what the values in LoadMinutes look like:
There's 29 loads # 3,818 minutes total
Lets say the minutes need to be between 430 to 470. Out of those 29 loads, I need to assign sets of different numbers adding up to 430 to 470 based on their time. The times in LoadTimes ranges from 7 to 20 (7AM to 8PM).


Pandas - Extract all content based on certain keywords

I am trying to extract all the content from a Dataframe till a specific word appears. I am trying to extract the entire content till the following words appear:
high, medium, low
Sample view of the text in the Dataframe:
Ticket creation dropped in last 24 hours medium range for cust_a
Calls dropped in last 3 months high range for cust_x
Expected output:
text, new_text
Ticket creation dropped in last 24 hours medium range for cust_a, Ticket creation dropped in last 24 hours
Calls dropped in last 3 months high range for cust_x, Calls dropped in last 3 months
You need replace and regex.
The idea will be to match any words from your list and then replace it and anything after it.
We use .* to match anything until the end of a string:
words = 'high, medium, low'
match_words = '|'.join(words.split(', '))
df['new_text'] = df['text'].str.replace(f"({match_words}).*",'',regex=True)
0 Ticket creation dropped in last 24 hours
1 Calls dropped in last 3 months
Name: text, dtype: object

Weighted Activity Selection Problem with allowing shifting starting time

I have some activities with weights, and I would like to select non overlapping activities by maximizing the total weight. This is known problem and solution exists.
In my case, I am allowed to shift the start time of activities in some extend while duration remains same. This will give me some flexibility and I might increase my utilization.
Example scenario is something like the following where all activities are supposed to be in interval (0-200):
(start, end, profit)
a1: 10 12 120
a2: 10 13 100
a3: 14 18 150
a4: 14 20 100
a5: 120 125 100
a6: 120 140 150
a7: 126 130 100
Without shifting flexibility, I would choose (a1, a3, a6) and that is it. On the other hand I have shifting flexibility to the left/right by at most t units for any task where t is given. In that case I might come up with this schedule and all tasks can be selected except a7 since conflict cannot be avoided by shift .
t: 5
a1: 8 10 120 (shifted -2 to left)
a2: 10 13 100
a3: 14 18 150
a4: 18 24 100 (shifted +4 to right)
a5: 115 120 100 (shifted -5 to left)
a6: 120 140 150
In my problem, total time I have is very big with respect to activity duration. While activities are like 10sec on average, total time I have would even be 10000sec. However that does not mean all of activities can be selected since shifting flexibility would not be enough for some activities to non-overlap.
Also in my problem, there are clusters of activities which overlaps and very big empty space where no activities and there comes another cluster of overlapping activities i.e a1, a2, a3 and a4 are let say cluster1 and a5, a6 and a7 is cluster2. Each cluster can be expanded in time by shifting some of them to left and right. By doing that, I can select more activities than the original activity selection problem. However, I do not know how to decide which tasks to be shifted to left or right.
My expectation is to find an near-optimal solution where total profit would be somehow local optima. I do not need global optimum value. Also I do not have any criteria about cluster utilization., i.e I do not have a guarantee about a minimum number of activity per cluster etc. Actually, these clusters something I visually describe. There is not defined cluster. However, in time domain, activities are separated as clusters somehow.
Also activity start and end times are all integers since I can dismiss fractions. I would have around 50 activities whose duration would be 10 on average. And time window is like 10000.
Are there any feasible solution to this problem?
You mentioned that you can partition the activities into clusters that don't overlap even if activities within them are shifted to the extent. Each of these clusters can be considered independently, and the optimal results computed for each cluster simply summed up for the final answer. So the first step of the algorithm could be a trial run that extends all activities in both directions, finds which ones form clusters, and process each cluster independently. In the worst case, all of the activities might form a single cluster.
Depending on the maximum size of the remaining clusters, there are several approaches. If it's under 20 (or even 30, depending on whether you want your program to run in seconds or minutes), you could combine search over all subsets of activities in the given cluster with a greedy approach. In other words: if you are processing a subset of N elements, try every one of its 2^N possible subsets (okay, 2^N-1 if we forget the empty subset), check whether the activities in this specific subset can be scheduled in non-overlapping manner, and pick the subset that is eligible and has maximum sum.
How do we check that a given subset of activities can be scheduled in non-overlapping manner? Let's sort them in ascending order of their end and process them from left to right. For every activity, we try to schedule it as early as possible, making sure it does no intersect with activities we already considered. So, the first activity in the cluster is always started time t earlier than originally planned, the second one is started either when the first one ends, or t earlier than originally planned, whichever is larger, and so on. If at any point we can't schedule the next activity in a way that it does not overlap with previous one, then there is no way to schedule the activities in this subset in a non-overlapping manner. This algorithm takes O(NlogN) time, and overall each cluster is processed in O(2^N * NlogN). Once again, note that this function grows very quickly, so if you are dealing with large enough clusters, this approach goes out the window.
Another approach is specific to the additional restrictions you provided. If the activities' starts and ends and parameter t are all measured in integer number of seconds, and t is about 2 minutes, then the problem for each cluster is set in a small discrete space. Even though you could position a task to start at a non-integer second value, there always is an optimal solution that uses only integers. (To prove it, consider an optimal solution that does not use integers - since t is integer, you can always shift tasks, starting from the leftmost, to the left a bit so that it starts at an integer value.)
Knowing that the start and end times are discrete, you can build a DP solution: process the activities in the ascending order of their end*, and memoize the maximum possible sum of weights you can obtain from the first 1, 2, ..., N activities for each x from activity_start - t to activity_start + t if a given activity ends at time x. If we denote this memoized function as f[activity][end_time], then the recurrence relation is f[a][e] = weight[a] + max(f[i][j] over all i < a, j <= e - (end[a] - start[a]), which roughly translates to "if activity a ended at time e, the previous activity must have ended at or before start of a - so let's pick the maximum total weight over previous activities and their ends, and add the current activity's weight".
*Again, we can prove that there is at least one optimal answer where this ordering is preserved, even though there might be other optimal answers which do not possess this property
We could go further and eliminate the iteration over previous activities, instead encoding this information in f. Its definition would then change to "f[a][e] is the maximum possible total weight of the first a activities if none of them ends after e", and recurrence relation would become f[a][e] = max(f[a-1][e], weight[a] + max(f[a-1][i] over i <= e - (end[a] - start[a])])), and its computational complexity would be O(X * N), where X is the total span of the discrete space where task starts/ends are placed.
I assume you need to compute not just the maximum possible weight, but also the activities you need to select to obtain it, and possibly even the exact time each of them needs to be started. Thankfully, we can derive all of this from the values of f, or compute it at the same time as we compute f. The latter is easier to reason about, so let's introduce a second function g[activity][end]. g[activity][end] returns a pair (last_activity, last_activity_end), essentially pointing us to the exact activity and its timing that the optimal weight in f[activity][end] uses.
Let's go through the example you provided to illustrate how this works:
(start, end, profit)
a1: 10 12 120
a2: 10 13 100
a3: 14 18 150
a4: 14 20 100
a5: 120 125 100
a6: 120 140 150
a7: 126 130 100
We order the activities by their end time, thereby swapping a7 and a6.
We initialize the values of f and g for the first activity:
f[1][7] = 120, f[1][8] = 120, ..., f[1][17] = 120, meaning that the first activity could end anywhere from 7 to 17, and costs 120. f[1][i] for all other i should be set to 0.
g[1][7] = (1, 7), g[1][8] = (1, 8), ..., g[1][17] = (1, 17), meaning that the last activity that was included in f[1][i] values was a1, and it ended at i. g[1][i] for all i outside [7, 17] is undefined/irrelevant.
That's where something interesting begins. For each i such that a2 cannot end at time i, let's assign f[2][i] = f[1][i], g[2][i] = g[1][i], which essentially means that we wouldn't be using activity a2 in those answers. For all other i, namely, in [8..18] interval, we have:
f[2][8] = max(f[1][8], 100 + max(f[1][0..5])) = f[1][8]
f[2][9] = max(f[1][9], 100 + max(f[1][0..6])) = f[1][9]
f[2][10] = max(f[1][10], 100 + max(f[1][0..7])). This is the first time when the second clause is not just plain 100, as f[1][7]>0. It is, in fact, 100+f[1][7]=220, meaning that we can take activity a2, shift it in a way that puts its end at time 10, and get a total weight of 220. We continue computing f[2][i] this way for all i <= 18.
The values of g are: g[2][8]=g[1][8]=(1, 8), g[2][9]=g[1][9]=(1, 9), g[2][10]=(2, 10), because it was optimal to take activity a2 and end it at time 10 in this case.
I hope the pattern of how this continues is visible - we compute all the values of f and g through the end, and then pick the maximum f[N][e] over all possible end times e of the last activity. Armed with the auxiliary function g, we can traverse the values backwards to figure out the exact activities and times. Namely, the last activity we use and its timing is in g[N][e]. Let's call them A and T. We know that A began at T-(end[A]-start[A]). Then, the previous activity must have ended at that point or before - so let's look at g[A-1][T-(end[A]-start[A]) for it, and so on.
Note that this approach works even if you don't partition anything into clusters, but with the partitioning, the size of the space in which tasks can be scheduled is reduced, and with it the runtime.
You might notice that neither of these solutions is polynomial in the size of input. I have a feeling that your problem doesn't have a general polynomial solution, but I was unable to prove it by reducing another NP-complete problem to it. Would be really curious to read a reduction / better general solution!

how to calculate a rolling average based on a column in spotfire

I have a data set where you have a Document Property that Selects "items", each "item" has a particular "usage days". I want to calculate an output of "Moving Average" for 1 or more selected items. the data for the moving average lives under a column named "usage days".
How do I calculate this taking into account the "selected date of my choice" and the rolling average number of days of my choice.
Do you have particular ideas of how I can perform the calculation i.e. in a calculated column or a text field?
Car/ Trip / Start Date/ End Date / Days on trip
1 AB123 / 2 / 6/07/2013
1 AB234 / 29/07/2013 / 6/09/2013 / 42
1 AB345 /6/09/2013 /28/09/2013 /22
1 AB456 /29/09/2013 /21/10/2013 /23
2 AB567 / 26/10/2013 / 12/11/2013 / 22
2 AB678 /12/11/2013 /8/12/2013 /26
[The rows above have an example of the problem (sorry couldn't paste an image because im new), I want to calculate the %usage of the Car and or cars for a selected range of time e.g (Select date range JUlY to AUGUST then (#of days on trip for car 1and 2)/#on days in that period)/2*100]
As phiver said, it is still difficult to see what you expect as a result... but I think I have something that might work. First, I slightly altered the dataset you provided, like so:
car trip startDate endDate daysOnTrip
1 AB123 7/6/2013 7/29/2013 23
1 AB234 7/29/2013 9/6/2013 42
1 AB345 9/6/2013 9/28/2013 22
1 AB456 9/29/2013 10/21/2013 23
2 AB567 10/26/2013 11/12/2013 22
2 AB678 11/12/2013 12/8/2013 26
I then added 2 document properties, "DateRangeFirst" and "DateRangeLast", to allow the user to select beginning and ending dates. Next I made input box property controls for each of the aforementioned document properties in a text area so the user can alter the date range. I then added a datatable visualization with a "Limit data using expression:" of "[startDate] >= Date(${DateRangeFirst}) and [endDate]<= Date(${DateRangeLast})" so we could see the trips selected. Finally, to get the average you appear to be looking for, a barchart set to % of total (daysOnTrip) / car with the same data limiting expression as above. The below screenshot should have everything you need to reproduce my results. I hope this gives you what you need.
NOTE: With this method if you select a date in the middle of a trip, an entire row and all of the days on that trip will be ignored.

Google Spreadsheet with SQL query - finding best combination

I have a google spreadsheet for my gaming information. It contains 2 sheets - one for monster information, another for team.
Monster information sheet contains the attack value, defend value, and the mana cost of monsters. It's almost like a database of monsters that I can summon.
Team sheet does the following:
Asks for the amount of mana I currently have.
Computes a list of up to 5 monsters that I can summon (it can be less than 5).
Each monster has their own mana cost, therefore total mana cost mustn't exceed the amount of mana I have given in point 1.
The tabulated list should give me a team that have the highest combined attack value. It does not matter how many monsters are summoned. Each monster cannot be summoned twice though.
I have been thinking of using query() function so that I can make use of SQL statements. (so that I can hopefully retrieve the tabulated list directly)
Sample: Monster Info
1 Monster Attack Defense Cost
2 MonA 1200 1200 35
3 MonB 1400 1300 50
... ...
Sample: Team
1 Mana 120
3 Attack Team
4 Monster Attack Cost Total Attack
5 MonB 1400 50 1400
6 MonA 1200 35 2600
7 ... ...
I have these formula in "Team" sheet
A5: =query('Monster Info'!$A$:$D,"SELECT A,B,D ORDER BY B DESC LIMIT 5")
B5: =CONTINUE(A5, 1, 2)
C5: =CONTINUE(A5, 1, 3)
D5: =C5
A6: =CONTINUE(A5, 2, 1)
B6: =CONTINUE(A5, 2, 2)
C6: =CONTINUE(A5, 2, 3)
D6: =D5+C6
That only gets the 5 best attack monsters, regardless of the mana cost consideration. How do I do that such that it takes consideration of both attack value and mana cost value? There is another problem shown in the example below:
Example: (simplified version, without defense value etc)
Monster Attack Cost
MonA 1400 50
MonB 1200 35
MonC 1100 30
MonD 900 25
MonE 500 20
MonF 400 15
MonG 350 10
MonH 250 5
If I have 160 mana, then the obvious team is A+B+C+D+E (5100 Attack).
If I have 150 mana, it becomes A+B+C+D+G (4950 Attack).
If I have 140 mana, it becomes A+B+C+D (4600 Attack).
If I have 130 mana, it becomes B+C+D+E+F (4100 Attack using 125 mana) or A+B+C+F (4100 Attack using all 130 mana).
If I have 120 mana, it becomes B+C+D+E+G (4050 Attack).
If I have 110 mana, it becomes B+C+D+F+H (3850 Attack).
As you can see, there isn't really a pattern within the results.
Any expert willing to share their insights on this?
I've played with the problem for an hour and I only have a workaround here. Your problem seems to be a standard linear programming task which should can easily be solved by a "Solver" software. There used to be a so called "Solver" in google spreadsheet, but unfortunately it was removed from the newest version. If you are not insisting on Google solution, you should try it in one of the Solver-supported spreadsheet manager softwares.
I tried MS Office (it has a Solver add-in, installation guide:
Before you run the solver, you should prepare your original dataset a bit, with helper columns and cells.
Add a new column next to the "Cost" column (let's assume it is column "D"), and under it put each row either 0, or 1. This column will tell you if a monster is selected to the attack team or not.
Add two more columns ("E" and "F" respectively). These columns will be products of the Attack and of the Cost respectively. So you should write a function to the E2 cell: =b2*d2, and for the F2 cell: =c2*d2. With this way if a monster is selected (which is told by the D column, remember), the appropriate E and F cells will be non zero values, aotherwise they will be 0.
Create a SUM row under the last row, and create a summarizing function for the D,E,F columns respectively. So in my spreadsheet D10 cell gets its value like this: =sum(d2:d9), and so on.
I created a spreadsheet to show these steps:
Remember to copy this worksheet to an MS Office worksheet, before you start the Solver.
Now, you are ready to start the Solver. (Data menu, Solver in MS Office). You can see a video here on using the Solver:
It's not that hard as it looks like, but for this case I'll describe what to write where:
Set Objective: you should select the "E10" cell, as that represents the sum of all the attack points.
Check "Max" radiobutton as we would like to maximize the value of the attacks.
By Changing variable cells: Select the "d2:d9" interval as those cells are representing whether a monster is selected or not. The solver will try to adjust these values (0, or 1) in order to maximise the sum attack.
Subject to the Contraints: Here we should add some constraints. Click on the Add button, and then:
First we should ensure that d2:d9 are all binary values. So "Cell reference" should be "d2:d9" and from the dropdown menu, select "bin" as binary.
Another constraint should be that the sum of the selected monsters should not exceed 5. So select the cell where the sum of the selected monsters is represented (D10) and add "<=" and the value "5"
Finally we cannot use more manna that we have, so select the cell in which you store the sum of used manna (F2), and "<=", and add the whole amount of manna we can spend in my case it's in the I2 cell).
Done. It should work, in my case it worked at least.
Hope it helps anyway.

Daemon to monitor query and send mail conditionally in SQL Server

I've been melting my brains over a peculiar request: execute every two minutes a certain query and if it returns rows, send an e-mail with these. This was already done and delivered, so far so good. The result set of query is like this:
| ID | last_update |
| 21 | 2011-07-20 13:03:21 |
| 32 | 2011-07-20 13:04:31 |
| 43 | 2011-07-20 13:05:27 |
| 54 | 2011-07-20 13:06:41 |
The trouble starts when the user asks me to modify it so the solution so that, e.g., the first time that ID 21 is caught being more than 5 minutes old, the e-mail is sent to a particular set of recipients; the second time, when ID 21 is between 5 and 10 minutes old another set of recipients is chosen. So far it's ok. The gotcha for me is from the third time onwards: the e-mails are now sent each half-hour, instead of every five minutes.
How should I keep track of the status of Mr. ID = 43 ? How would I know if he has already received an e-mail, two or three? And how to ensure that from the third e-mail onwards, the mails are sent each half-hour, instead of the usual 5 minutes?
I get the impression that you think this can be solved with a simple mathematical formula. And it probably can be, as long as your system is reliable.
Every thirty minutes can be seen as 360 degrees, or 2 pi radians, on a harmonic function graph. That's 12 degrees = 1 minute. Let's take cosin for instance:
f(x) = cos(x)
f(x) = cos(elapsedMinutes * 12 degrees)
Where elapsed minutes is the time since the first 30 minute update was due to go out. This should be a constant number of minutes added to the value of last_update.
Since you have a two minute window of error, it will be time to transmit the 30 minute update if the the value of f(x) (above) is between the value you would get at less than one minute before or after the scheduled update. Which would be = cos(1* 12 degrees) = 0.9781476007338056379285667478696.
Bringing it all together, it's time to send a thirty minute update if this SQL expression is true:
DATEADD(minutes, constantNumberOfMinutesBetweenSecondAndThirdUpdate, last_update),
CURRENT_TIMESTAMP))) > 0.9781476007338056379285667478696
If you need a wider window than exactly two minutes, just lower this number slightly.