Finding the right attributes for maximizing another attribute

Finding the right attributes for maximizing another attribute - optimization

I have a model with several attributes/properties that are fixed (approximately 15 independent attributes).
The same model has another attribute which is the most interesting to me. I want to maximize a certain value of that attribute.
I would like to find what fixed attribute values influence the most the interesting attribute based on my data. I think this is a stats problem but I'm not sure.
A real life example would be a database of mortgages with all the following fixed attributes : bank branch, postal code, employment, salary, credit score, relationship status, number of children, etc. Then I have one attribute that is whether the mortgage has defaulted.
I would like to find what are the fixed attributes that have the biggest impact on reducing the defaults on these mortgages. The answer to that question would be more than one set of "optimal" attributes. It could be a coefficient for each attribute or combination of attributes that correlates to a low default rate.
Basically, I don't event know how to ask my question, I just have an idea of what I am looking for and the best way to do it (sorry)!

You can implement an ML approach with a regression model based on Gradient descent, decision trees, SVM, or ensembles, etc. You question is very unspecific.

Related

Looking for a base for a constraint solving problem

I am new to OptPlanner but I have a reasonable understanding of constraint solving alebit somewhat dated.
I have a problem I want to model. On the one hand the National Grid have requirements to save electricity between defined time slots on specific days in specific locations (post codes). On the other individuals with static or mobile batteries charge their batteries at some point during a 24 hour cycle and have a need to get a specific amount of charge into those batteries. I need to model a set of constraints at the top (the grid) and the constraints at the bottom (the individuals) to ensure the individuals get what they need and the grid saves what it requires.
What model should I pick and why?
I am just starting this so I have not tried anything yet. I would prefer a Java/SpringBoot solution.
Many thanks for any help.
Steve T

First read the domain modeling guide in the docs to understand my answer below.
https://www.optaplanner.org/docs/optaplanner/latest/design-patterns/design-patterns.html#domainModelingGuide
I think the maintenance scheduling quickstart might be a good start. Code is here:
https://github.com/kiegroup/optaplanner-quickstarts/tree/stable/use-cases/maintenance-scheduling
Motivation: it sounds like there could be gaps between charging at the charging stations, so a chained through time model does not fit. You're not solving a VRP anyway. So I suspect a timegrain model it is, which is what the maintenance scheduling quickstart actually uses.

Can OptaPlanner be used to do physical layout problems?

I'd like to use OptaPlanner (or something comparable that allows for constraint based optimization) to do vegetable garden bed layout. Ideally this would take into account the space needed per plant, which plants do well next to each other and which don't, and then also how long it takes for the plant to produce food, how much food, etc. The idea would be to help plan a full season where as soon as a plant has been harvested something else can be dropped into its place to max out the productivity and the continuity of food availability. Last, I'm hoping to create aesthetic rules which bias designs towards more pleasing layouts. For example symmetry & repeated (rhythmic) elements vs randomness and ease of harvesting large contiguous groupings.
I think most of these I can figure out how to create constraints in OptaPlanner, but I'm not sure how to represent the physical space and the proximity of each location to its neighbors. Would I break each bed into a set of grid cells, and then assign plants of different cell sizes to regions? Or maybe this needs to be turned into a graph representation? Are there any other physical layout example scenarios I can build off of?

At the moment, we do not have any examples for using OptaPlanner on a bin packing problem. (Which I think your situation could be mapped onto.)
Assuming you can separate your garden bed into a predefined set of fixed slots, in which a plant either does or does not fit, I think OptaPlanner could be used. These slots would then have a fixed set of geographic data, like which is close to which.
If, on the other hand, you can not subdivide the garden beds beforehand and would like the solver to do that for you as well, then the problem becomes much harder to solve. You are solving two problems, one of which is spatial - and you may actually find some success treating the two problems individually. First, find the ideal layout based on the number and type of plants you want to plant, and in the second step figure out which plants to put where.

How to make testing data manually for clustering of citation records?

I'm doing a research on the author name disambiguation problem. I want to make some experiments. I want to perform clustering on citation records. My dataset consist of 2000 xml records. I need testing data. The dataset that I'm using is not popular and I need to make testing data manually. I don't know how to do so. I need instruction of how to make testing data manually. Note: I want to compare the performance of a set of techniques in solving the author name disambiguation problem, So I must perform testing.

Even though it is not really clear what kind of testing you want to perform, but general answer to the issue at hand - trying to artificially create more data from the data you have at hand - is a bootstrap. In general it is technique when you perform sampling with replacement from your dataset as many times as you want. It randomly picks up some element from your data repetitively untill you get a sample of the size you want. The sample you get could be larger than your original dataset but should have similar (from statistical point of view) as your original dataset. Bootstrap sampling is available in sklearn.
P.S. You need to keep in mind that this solution is not optimal - best solution to this problem is to actually get more real data somehow.

Classification vs. Clustering
For author name disambiguation, I don't think you want clustering. What you want is classification.
You have a features for each author / publication. Now you give the classifier two of those feature vectors. It classifies "it is the same author" or "those are different authors".
Training / testing data
Having a binary classification problem, the testing suddenly becomes simple: Just use one of the measures used in literature so often (accuracy, precision, recall, confuscation matrix).
Getting the data might be a bit more complicated. You wrote that you have an XML file of 2000 records. I guess you can derive features from those records automatically and authors have an identifier? Then you can simply generate negative examples by having different authors and positive examples by checking if the identifier is the same.
Otherwise you can have a look at http://dblp.uni-trier.de/. Although there are likely many publications under the same author which should be different, they do distinguish authors not only by name but give them identifiers.
Alternatively, you can train a classifier to classify each of the known authors with e.g. > 30 publications. Then remove the softmax layer and use those features to distinguish the authors.

SSAS IsAggregatable for Currency Dimensions

I have a cube with a Source Currency Dimension and a Billing Currency Dimension. I set both of these to IsAggregatable = False (which seems to be recommended since I don't want an All level to automatically sum up over different currencies!). When the All level is taken away I am left with a single default currency, that I can set if I want.
Problem is the two dimensions now act as a sort of filter to each other. If I want a total of all Billing Amounts by billing currency and drag that dimension on its own onto the grid, the result is filtered to show only the transactions (if any) that also match the default Source Currency. And vica versa. It is only if I drag the other dimension onto the grid that I have the ability to show all the data.
Is there some setting that allows a default member of a dimension to represent all the members? It feels like Not Aggregatable makes sense in the context of that one dimension, but seems to make little sense in the context of other dimensions in terms of seeing all data. Ie I want to see a summary of transactions by Billing Currency - and I DON'T CARE what the source currency was (ie you can consider All of them).
I am quite likely falling into some basic trap here, possibly design related - any clues would be appreciated.
Glenn

Finally found a reference to an article that describes this situation EXACTLY, and an interesting approach to resolving it. The main thing for me was recognition that the situation is real and not an issue with my design.
Issue described and resolved here

Generating a sudoku of a desired difficulty?

So, I've done a fair bit of reading into generation of a Sudoku puzzle. From what I can tell, the standard way to have a Sudoku puzzle of a desired difficulty is to generate a puzzle, and then grade it afterwards, and repeat until you have one of an acceptable rating. This can be refined by generating via backtracing using some of the more complex solving patterns (XY-wing, swordfish, etc.), but that's not quite what I'm wanting to do here.
What I want to do, but have been unable to find any real resource on, is generate a puzzle from a "difficulty value" (0-1.0 value, 0 being the easiest, and 1.0 being the hardest).
For example, I want create a moderately difficult puzzle, so the value .675 is selected. Now using that value I want to be able to generate a moderately difficult puzzle.
Anyone know of something like this? Or perhaps something with a similar methodology?

Adding another answer for generating a sudoku of desired difficulty on-the-fly.
This means that unlike other approaches the algorithm runs only once and returns a sudoku configuration matching the desired difficulty (with high probability within a range or with probability=1)
Various solutions for generating (and rating) a sudoku difficulty have to do with human-based techniques and approaches, which can be easily rated.
Then one (after having generated a sudoku configuration) re-solves the sudoku with the human-like solver and depending on the techniques the solver used (e.g pairs, x-wing, swordfish etc.) a difficulty rate is also assigned.
Problems with this approach
(and requirements for the use case i had)
In order to generate a sudoku with given difficulty, with previous method one needs to solve a sudoku twice (once with the basic algorithm and once with the human-like solver).
One has to (pre-)generate many sudokus which can only be rated as to difficulty after being solved by the human-like solver. So one cannot generate a desired sudoku on-the-fly once.
The human-like solver can be complicated and in most cases (if not all) is tightly coupled to 9x9 sudoku grids. So no easy generalisation to other sudokus (e.g 4x4, 16x16, 6x6 etc.)
The difficulty rating of the human-like techniques is very subjective. For example why x-wing is taken to be more difficult than hidden singles? (personaly have solved many difficult published sudoku puzzles manualy and never used such techniques)
Another approach was used which has the following benefits:
Generalises well to arbitrary sudokus (9x9, 4x4, 6x6, 16x16 etc..)
The sudoku configuration, with desired difficulty, is generated once and on-the-fly
The difficulty rating is objective.
How it works?
First of all, the simple fact that the more difficult the puzzle, the more time it needs to be solved.
But time to be solved is intimately correlated to both number of clues (givens) and average alternatives to be investigated per empty cell.
Extending my previous answer, it was mentioned that for any sudoku puzzle the minimum number of clues is an objective property of the puzzle (for example for 9x9 grids the minimum number of clues for having a valid sudoku is 17)
One can start from there and compute minimum number of clues per difficulty level (linear correlation).
Furthermore at each step of the sudoku generation process, one can make sure the average alternatives (to be investigated) per empty cell is within given bounds (as a function of desired difficulty)
Depending on whether the algorithm uses backtrack or not (for the use case discussed the algorithm does no backtracking) the desired difficulty can be reached either with probability=1 or with high probability within bounds (respectively).
Tests of the sudokus generated with this algorithm and difficulty rating based on the previous approaches (human-like solver), show a correlation of desired and estimated difficulty rates, plus a greater ability for generalisation to arbitrary sudoku configurations.
(have used this online sudoku solver (and also this one) to correlate the difficulty rates of the test sudokus)
The code is available free on github sudoku.js (along with sample demo application), a scaled-down version of CrossWord.js a professional crossword builder in JavaScript, by same author

The sudoku difficulty is related in an interesting way to the (minimum) amount of information needed to specify a unique solution for a given grid.
Sounds like information theory, yes it has applications here too.
Sudoku puzzles should have a unique solution. Furthermore sudoku puzzles have certain symmetries, i.e by row, by column and by sub-square.
These symmetries specify the minimum number of clues (and their position more or less) needed so that the solution would be unique (i.e using a sudoku compiler or an algorithm like backtrack-search).
This would be the most difficult/hard sudoku puzzle level (i.e minimum needed number of clues). Then all other difficulty levels from less hard to easy are generated by allowing more clues than the minimum amount needed.
It should be noted that sudoku difficulty levels are not standard, as explained above, one can have as many or as few difficulty levels as one wants. What is standard is the minimum number (and position) of clues (which is the hardest level and which is relatd to the sudoku symmetries), then one can generate as many difficulty levels as one wants simply by allowing extra/redundant clues to be visible as well.

It's not as elegant as what you ask, but you can simulate this behavior with caching:
Decide how many "buckets" you want for puzzles. For example, let's say you choose 20. Thus, your buckets will contain puzzles of different difficulty ranges: 0-.05, .05-.1, .1-.15, .. , .9-.95, .95-1
Generate a puzzle
Grade the puzzle
Put it in the appropriate bucket (or throw it away when the bucket is full)
Repeat till your buckets are "filled". The size of the buckets and where they are stored will be based on the needs of your application.
Then when a user requests a certain difficulty puzzle, give them a cached one from the bucket they choose. You might also want to consider swapping numbers and changing orientation of puzzles with known difficulties to generate similar puzzles with the same level of difficulty. Then repeat the above as needed when you need to refill your buckets with new puzzles.

Well, you can't know how complicated it is, before you know how to solve it. And Sudoku solving (and therefore also the difficulty rating) belongs to the NP-C complexity class, that means it's (most likely) logically impossible to find an algorithm that is (asymptotically) faster than the proposed randomly-guess-and-check.
However, if you can find one, you have solved the P versus NP problem and should clear a cupboard for the Fields Medal... :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Finding the right attributes for maximizing another attribute - optimization

You can implement an ML approach with a regression model based on Gradient descent, decision trees, SVM, or ensembles, etc. You question is very unspecific.

Related

Looking for a base for a constraint solving problem

Can OptaPlanner be used to do physical layout problems?

How to make testing data manually for clustering of citation records?

SSAS IsAggregatable for Currency Dimensions

Generating a sudoku of a desired difficulty?

Categories

Resources