Im trying to make a mathematical model for flexible job shop scheduling. In this model, operations can be divided into certain sub-groups, and relations between operations in the subgroup, and those outside of it can be defined. If an operation belongs in subgroup y the operations y value would be 1, and otherwise it will be set to 0. Whenever I create a relation between two operations i can the use a sufficiently large number L so that
T1+((1-y1)*L)<= T2+((1-y2)*L)
This ensures that the T value of operation 1 is smaller than that of operation 2 if operation 1 belongs in subgroup y, and operation 2 doesn't.
While this method works as a improvised if-statement, it seems incredibly clunky, does anyone know of a better way to achieve the same effect?
Related
I have three data sets of distance estimates by humans in three different experimental methods such as (real life, virtual reality, and computer-based simulation). I want to compare how do humans differ in estimating distances in these three experimental methods. Which statistical analysis would be good to use for it? My dependent variable is the distance estimates and independent variable are the three different experimental conditions.
Thank you.
It depends on whether an individual took part in all 3 conditions, or if you essentially have 3 different experiments with an entirely new subject pool each time
In the first case, where a single individual took part in all conditions, you could use a One-way repeated measures ANOVA, with method as a 3 level within-subjects factor. Alternatively, a mixed-effects regression with random slopes/intercepts for the subjects
In the second case, where an individual only took part in 1 condition, you could simply use a One-way ANOVA, with method as a 3 level between-subjects factor
An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).
Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.
Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.
Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).
As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):
MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc
UPDATE / ADDITIONAL INFORMATION
This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).
From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.
To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).
Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.
I have some data that I arrange into a collection of custom class objects.
Each object has a couple of properties aside from its unique name, which I will refer to as batch and exists
There are many objects in my collection, but only a few possible values of batch (although the number of possibilities is not pre-defined).
What is the easiest way to count occurrences of each possible value of batch?
Ultimately I want to create a userform something like this (values are arbitrary, for illustration):
Batch A 25 parts (2 missing)
Batch B 17 parts
Batch C 16 parts (1 missing)
One of my ideas was to make a custom "batch" class, which would have properties .count and .existcount and create a collection of those objects.
I want to know if there is a simpler, more straightforward way to count these values. Should I scrap the idea of a secondary collection and just create some loops and counter variables when I generate my userform?
You described well the two possibilities that you have:
Loop over your collection every time you need the count
Precompute the statistics, and access it when needed
This is a common choice one often has to do. I think here it is between performance vs. complexity.
Option 1 with a naive loop implementation will take you an O(n) time, where n is the size of your collection. And, unless your collection is static, you will have to compute it everytime you need your statistics. On the bright side, the naive looping is fairly trivial to write. Performance on frequent queries and/or large collections could suffer.
Option 2 is fast for retrieval, O(1) basically. But everytime your collection changes, you need to recompute your statistics. However this is incremental recomputing, i.e. you do not have to go through the whole collection but just over the changed items. But that means you need to deal with all the possibilities of updates (new item, deleted item, updated items). So that's a bit more complex than the naive loop. Now if your collections are entirely new all the time, and you query them only once, you have little to gain here.
So up to you to decide where to tradeoff according to the parameters of your problems.
TL;DR version: Is there a way to cope with optimisation problems where there exists a large number of optimal solutions (solutions that find the best objective value)? That is, finding an optimal solution is pretty quick (but highly dependent on the size of the problem, obviously) but many such solutions exists so that the solver runs endlessly trying to find a better solution (endlessly because it does find other feasible solutions but with an objective value equals to the current best).
Not TL;DR version:
For a university project, I need to implement a scheduler that should output the schedule for every university programme per year of study. I'm provided some data and for the matter of this question, will simply stick to a general but no so rare example.
In many sections, you have mandatory courses and optional courses. Sometimes, those optional courses are divided in modules and the student needs to choose one of these modules. Often, they have to select two modules, but some combinations arise more often than others. Clearly, if you count the number of courses (mandatory + optional courses) without taking into account the subdivision into modules, you happen to have more courses than time slots in which they need to be scheduled. My model is quite simple. I have constraints stating that every course should be scheduled to one and only one time slot (period of 2 hours) and that a professor should not give two courses at the same time. Those are hard constraints. The thing is, in a perfect world, I should add hard constraints stating that a student cannot have two courses at the same time. But because I don't have enough data and that every combination of modules is possible, there is no point in creating one student per combination mandatory + module 1 + module 2 and apply the hard constraints on each of these students, since it is basically identical to have one student (mandatory + all optionals) and try to fit the hard constraints - which will fail.
This is why, I decided to move those hard constraints in an optimisation problem. I simply define my objective function minimising for each student the number of courses he/she takes that are scheduled simultaneously.
If I run this simple model with only one student (22 courses) and 20 time slots, I should have an objective value of 4 (since 2 time slots embed each 2 courses). But, using Gurobi, the relaxed objective is 0 (since you can have fraction of courses inside a time slot). Therefore, when the solver does reach a solution of cost 4, it cannot prove optimality directly. The real trouble, is that for this simple case, there exists a huge number of optimal solutions (22! maybe...). Therefore, to prove optimality, it will go through all other solutions (which share the same objective) desperately trying to find a solution with a smaller gap between the relaxed objective (0) and the current one (4). Obviously, such solution doesn't exist...
Do you have any idea on how I could tackle this problem? I thought of analysing the existing database and trying to figure out which combinations of modules are very likely to happen so that I can put back the hard constraints but it seems hazardous (maybe I will select a combination that leads to a conflict therefore not finding any solution or omitting a valid combination). The current solution I use is putting a time threshold to stop the optimisation...
I'm working on a project in which we will need to determine certain types of statuses for a large body of people, stored in a database. The business rules for determining these statuses are fairly complex and may change.
For example,
if a person is part of group X
and (if they have attribute O) has either attribute P or attribute Q,
or (if they don't have attribute O) has attribute P but not Q,
and don't have attribute R,
and aren't part of group Y (unless they also are part of group Z),
then status A is true.
Multiply by several dozen statuses and possibly hundreds of groups and attributes. The people, groups, and attributes are all in the database.
Though this will be consumed by a Java app, we also want to be able to run reports directly against the database, so it would be best if the set of computed statuses were available at at the data level.
Our current design plan, then, is to have a table or view that consists of a set of boolean flags (hasStatusA? hasStatusB? hasStatusC?) for each person. This way, if I want to query for everyone who has status C, I don't have to know all of the rules for computing status C; I just check the flag.
(Note that, in real life, the flags will have more meaningful names: isEligibleForReview?, isPastDueForReview?, etc.).
So a) is this a reasonable approach, and b) if so, what's the best way to compute those flags?
Some options we're considering for computing flags:
Make the set of flags a view, and calculate the flag values from the underlying data in real time using SQL or PL-SQL (this is an Oracle DB). This way the values are always accurate, but performance may suffer, and the rules would have to be maintained by a developer.
Make the set of flags consist of static data, and use some type of rules engine to keep those flags up-to-date as the underlying data changes. This way the rules can be maintained more easily, but the flags could potentially be inaccurate at a given point in time. (If we go with this approach, is there a rules engine that can easily manipulate data within a database in this way?)
In a case like this I suggest applying Ward Cunningham's question- ask yourself "What's the simplest thing that could possibly work?".
In this case, the simplest thing might be to come up with a view that looks at the data as it exists and does the calculations and computations to produce all the fields you care about. Now, load up your database and try it out. Is it fast enough? If so, good - you did the simplest possible thing and it worked out fine. If it's NOT fast enough, good - the first attempt didn't work, but you've got the rules mapped out in the view code. Now you can go on to try the next iteration of "the simplest thing" - perhaps your write a background task that watches for inserts and updates and then jumps in to recompute the flags. If that works, fine and dandy. If not, go to the next iteration...and so on.
Share and enjoy.
I would advise against making the statuses as column names but rather use a status id and value. such as a customer status table with columns of ID and Value.
I would have two methods for updating statuses. One a stored procedure that either has all the logic or calls separate stored procs to figure out each status. you could make all this dynamic by having a function for each status evaluation, and the one stored proc could then call each function. The 2nd method would be to have whatever stored proc(s), that updates user info, call a stored proc to go update all the users statuses based upon the current data. These two methods would allow you to have both realtime updates for the data that changed and if you add a new status, you can call the method to update all statuses with new logic.
Hopefully you have one point of updates to the user data, such as a user update stored proc, and you can put the status update stored proc call in that procedure. This would also save having to schedule a task every n seconds to update statuses.
An option I'd consider would be for each flag to be backed by a deterministic function that returns the up-to-date value given the relevant data.
The function might not perform well enough, however, if you're calling it for many rows at a time (e.g. for reporting). So, if you're on Oracle 11g, you can solve this by adding virtual columns (search for "virtual column") to the relevant tables based on the function. The Result Cache feature should improve the performance of the function as well.