Regularizing the constraints in fairlearn's ThresholdOptimizer - fairlearn

Is there a way to have a lambda regularizer value on the constraints in the ThresholdOptimizer? For instance if we want to create accuracy vs SPD curves I want to have different thresholds enforced on the SPD/accuracy constraints that would indicate their importance (maybe initially accuracy is more important then gradually SPD gains importance).

Fairlearn maintainer here! [I can't comment on StackOverflow, so sadly these clarifying questions need to be in an "answer", but I'll update it once I understand your concern.]
What do you mean by SPD?
Can you describe a use case where it's clear what you mean by "initially accuracy is more important, then gradually SPD gains importance"? ThresholdOptimizer currently only supports the case where you satisfy your constraints 100%. One could think of ways to extend this to have some tolerance in constraint violation to improve the accuracy (or other performance measures).
You might have come across the built-in charts fairlearn provides for ThresholdOptimizer: https://fairlearn.org/v0.6.1/api_reference/fairlearn.postprocessing.html#fairlearn.postprocessing.plot_threshold_optimizer
The chart depends on your constraint, of course, but those may prove to be helpful in explaining how it arrived at the threshold(s).
If you have a concrete feature request feel free to open an issue directly in the repository as well! Thanks!

Related

Neural network hyperparameter tuning - is setting random seed a good idea? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am trying to tune a basic neural network as practice. (Based on an example from a coursera course: Neural Networks and Deep Learning - DeepLearning.AI)
I face the issue of the random weight initialization. Lets say I try to tune the number of layers in the network.
I have two options:
1.: set the random seed to a fixed value
2.: run my experiments more times without setting the seed
Both version has pros and cons.
My biggest concern is that if I use a random seed (e.g.: tf.random.set_seed(1)) then the determined values can be "over-fitted" to the seed and may not work well without the seed or if the value is changed (e.g.: tf.random.set_seed(1) -> tf.random.set_seed(2). On the other hand, if I run my experiments more times without random seed then I can inspect less option (due to limited computing capacity) and still only inspect a subset of possible random weight initialization.
In both cases I feel that luck is a strong factor in the process.
Is there a best practice how to handle this topic?
Has TensorFlow built in tools for this purpose? I appreciate any source of descriptions or tutorials. Thanks in advance!
Tuning hyperparameters in deep learning (generally in machine learning) is a common issue. Setting the random seed to a fixed number ensures reproducibility and fair comparison. Repeating the same experiment will lead to the same outcomes. As you probably know, best practice to avoid over-fitting is to do a train-test split of your data and then use k-fold cross-validation to select optimal hyperparameters. If you test multiple values for a hyperparameter, you want to make sure other circumstances that might influence the performance of your model (e.g. train-test-split or weight initialization) are the same for each hyperparameter in order to have a fair comparison of the performance. Therefore I would always recommend to fix the seed.
Now, the problem with this is, as you already pointed out, the performance for each model will still depend on the random seed, like the particular data split or weight initialization in your case. To avoid this, one can do repeated k-fold-cross validation. That means you repeat the k-fold cross-validation multiple times, each time with a different seed, select best parameters of that run, test on test data and average the final results to get a good estimate of performance + variance and therefore eliminate the influence the seed has in the validation process.
Alternatively you can perform k-fold cross validation a single time and train each split n-times with a different random seed (eliminating the effect of weight initialization, but still having the effect of the train-test-split).
Finally TensorFlow has no build-in tool for this purpose. You as practitioner have to take care of this.
There is no an absolute right or wrong answer to your question. You are almost answered your own question already. In what follows, however, I will try to expand more, via the following points:
The purpose of random initialization is to break the symmetry that makes neural networks fail to learn:
... the only property known with complete certainty is that the
initial parameters need to “break symmetry” between different units.
If two hidden units with the same activation function are connected to
the same inputs, then these units must have different initial
parameters. If they have the same initial parameters, then a
deterministic learning algorithm applied to a deterministic cost and
model will constantly update both of these units in the same way...
Deep Learning (Adaptive Computation and Machine Learning series)
Hence, we need the neural network components (especially weights) to be initialized by different values. There are some rules of thumb of how to choose those values, such as the Xavier initialization, which samples from normal distribution with mean of 0 and special variance based on the number of the network layer. This is a very interesting article to read.
Having said so, the initial values are important but not extremely critical "if" proper rules are followed, as per mentioned in point 2. They are important because large or improper ones may lead to vanishing or exploding gradient problems. On the other hand, different "proper" weights shall not hugely change the final results, unless they are making the aforementioned problems, or getting the neural network stuck at some local maxima. Please note, however, the the latter depends also on many other aspects, such as the learning rate, the activation functions used (some explode/vanish more than others: this is a great comparison), the architecture of the neural network (e.g. fully connected, convolutional ..etc: this is a cool paper) and the optimizer.
In addition to point 2, bringing a good learning optimizer into the bargain, other than the standard stochastic one, shall in theory not let a huge influence of the initial values to affect the final results quality, noticeably. A good example is Adam, which provides a very adaptive learning technique.
If you still get a noticeably-different results, with different "proper" initialized weights, there are some ways that "might help" to make neural network more stable, for example: use a Train-Test split, use a GridSearchCV for best parameters, and use k-fold cross validation...etc.
At the end, obviously the best scenario is to train the same network with different random initial weights many times then get the average results and variance, for more specific judgement on the overall performance. How many times? Well, if can do it hundreds of times, it will be better, yet that clearly is almost impractical (unless you have some Googlish hardware capability and capacity). As a result, we come to the same conclusion that you had in your question: There should be a tradeoff between time & space complexity and reliability on using a seed, taking into considerations some of the rules of thumb mentioned in previous points. Personally, I am okay to use the seed because I believe that, "It’s not who has the best algorithm that wins. It’s who has the most data". (Banko and Brill, 2001). Hence, using a seed with enough (define enough: it is subjective, but the more the better) data samples, shall not cause any concerns.

Complexity of Integer vs. Binary Constraints in CPLEX

Recently, I have been trying to learn a bit about CPLEX and was hoping someone could help me understand the complexity when solving for integer vs. binary constraints.
For example, say we are trying to allocate a pie around 10 people for maximum utility, where each person has a utility that is linear with the amount of pie they receive. However, we want to introduce the constraint that at least 3 people have to get a bit of pie.
What's the difference between thinking of this as a single integer constraint (number_of_people_with_pie >= 3) vs. 10 binary variables (person_1_has_pie + person_2_has_pie + ... person_10_has_pie >= 3)? I would imagine the former is simplest but wonder if there is any benefits to forming the problem in terms of binary variables?
In addition to this, any recommended reading for better understanding MIP and CPLEX would be greatly appreciated, especially in better understanding where the problem becomes NP or in what situations simplex struggles to find the global minima.
Thanks!
I agree with Alex and Erwin's comment that this really depends on what you want to model. For this particular model I disagree with Alex: to me it makes more sense to use one decision variable per person, otherwise it may become hard to figure out which person gets how much of the pie.
A problem becomes NP hard as soon as you add integrality or SOS constraints. A good reading for MIP in general is Alex Schrijver's "Theory of Integer and Linear Programming". That should cover all the topics you need for an in-depth understanding of things.
It really depends on the case but in yours I would use 1 decision variable rather than 10.
Sometimes, that's not obvious and trying and measuring can prove oneself right or wrong. And that's one of the reason why using high modeling languages can help. (Abstract modeling languages such as OPL)
I recommend a MOOC on cognitive class : https://cognitiveclass.ai/courses/mathematical-optimization-for-business-problems/
and the OPL language manual : https://www.ibm.com/support/knowledgecenter/SSSA5P_12.7.0/ilog.odms.studio.help/pdf/opl_languser.pdf

How does "minimize" work in Z3

I'm using the minimize function in Z3 a lot and I'm worrying about some scalability issues (when the number of variables I'm minimizing grows). What is the underlying algorithm of "minimize" and is there a general way to speed things up?
See this paper for details on the optimization algorithms used in Z3. Regarding your question about "general way to speed things up:" Impossible to tell without seeing exactly what you're trying to do and how you are encoding it. Posting a concrete example where things don't "scale" might be helpful.

Synopsys: Repeated compiles produce different results. How to automate iterated compile?

I'm new to using Design Compiler. In the past, I've done mostly FPGA work. Right now, I'm using Synopsys to determine the minimum are necessary to represent some circuits (using the Nangate 45nm library). I'm not doing P&R right now; I'm just trying to determine transistor area.
My only optimization constraint is to minimize area. I've noticed that if I tell DC to compile more than one time in a row, it produces different (and usually smaller) results each time.
I've looked and looked and failed to see if this is mentioned in a manual or anywhere in any discussion. Is it meant to work this way?
This suggests that optimization is stopping earlier than it could, so it's not REALLY minimizing area. Any idea why?
Is there a way I can tell it to increase the effort and/or tell it to automatically iterate compiles so that it will converge on the smallest design?
I'm guessing that DC is expecting to meet timing constraints, but I've given it a purely combinatorial block and no timing constraint. Did they never consider the usage scenario when all you want to do is work out the minimum gate area for a combinatorial circuit?
On a pure combinatorial circuit you can use a set_max_delay constraint and DC will attempt to meet that.
For reduced area you can use -map_effort high or -map_effort ultra to get it to work harder.
DC is a funny beast, and the algorithms it uses change as processes advance and make certain activities more or less useful. A lot of pre-layout optimization is less useful since the whole situation can change once the gates are actually placed and routed.
I filed a support ticket with Synopsys. I was using a 2010 version of design compiler. Apparently, area optimization has been improved since then, and the 2014 version will minimize area in one compiler pass.

What are the most useful software development metrics? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I would like to track metrics that can be used to improve my team’s software development process, improve time estimates, and detect special case variations that need to be addressed during the project execution.
Please limit each answer to a single metric, describe how to use it, and vote up the good answers.
(source: osnews.com)
ROI.
The total amount of revenue brought in by the software minus the total amount of costs to produce the software. Breakdown the costs by percentage of total cost and isolate your poorest performing and most expensive area in terms of return-on-investment. Improve, automate, or eliminate that problem area if possible. Conversely, find your highest return-on-investment area and find ways to amplify its effects even further. If 80% of your ROI comes from 20% of your cost or effort, expand that particular area and minimize the rest by comparison.
Costs will include payroll, licenses, legal fees, hardware, office equipment, marketing, production, distribution, and support. This can be done on a macro level for a company as whole or a micro level for a team or individual. It can also be applied to time, tasks, and methods in addition to revenue.
This doesn't mean ignore all the details, but find a way to quantify everything and then concentrate on the areas that yield the best (objective) results.
Inverse code coverage
Get a percentage of code not executed during a test. This is similiar to what Shafa mentioned, but the usage is different. If a line of code is ran during testing then we know it might be tested. But if a line of code has not been ran then we know for sure that is has not been tested. Targeting these areas for unit testing will improve quality and takes less time than auditing the code that has been covered. Ideally you can do both, but that never seams to happen.
"improve my team’s software development process": Defect Find and Fix Rates
This relates to the number of defects or bugs raised against the number of fixes which have been committed or verified.
I'd have to say this is one of the really important metrics because it gives you two things:
1. Code churn. How much code is being changed on a daily/weekly basis (which is important when you are trying to stabilize for a release), and,
2. Shows you whether defects are ahead of fixes or vice-versa. This shows you how well the development team is responding to defects raised by the QA/testers.
A low fix rate indicates the team is busy working on other things (features perhaps). If the bug count is high, you might need to get developers to address some of the defects.
A low find rate indicates either your solution is brilliant and almost bug free, or the QA team have been blocked or have another focus.
Track how long is takes to do a task that has an estimate against it. If they were well under, question why. If they are well over, question why.
Don't make it a negative thing, it's fine if tasks blow out or were way under estimated. Your goal is to continually improve your estimation process.
Track the source and type of bugs that you find.
The bug source represents the phase of development in which the bug was introduced. (eg. specification, design, implementation etc.)
The bug type is the broad style of bug. eg. memory allocation, incorrect conditional.
This should allow you to alter the procedures you follow in that phase of development and to tune your coding style guide to try to eliminate over represented bug types.
Velocity: the number of features per given unit time.
Up to you to determine how you define features, but they should be roughly the same order of magnitude otherwise velocity is less useful. For instance, you may classify your features by stories or use cases. These should be broken down so that they are all roughly the same size. Every iteration, figure out how many stories (use-cases) got implemented (completed). The average number of features/iteration is your velocity. Once you know your velocity based on your feature unit you can use it to help estimate how long it will take to complete new projects based on their features.
[EDIT] Alternatively, you can assign a weight like function points or story points to each story as a measure of complexity, then add up the points for each completed feature and compute velocity in points/iteration.
Track the number of clones (similar code snippets) in the source code.
Get rid of clones by refactoring the code as soon as you spot the clones.
Average function length, or possibly a histogram of function lengths to get a better feel.
The longer a function is, the less obvious its correctness. If the code contains lots of long functions, it's probably a safe bet that there are a few bugs hiding in there.
number of failing tests or broken builds per commit.
interdependency between classes. how tightly your code is coupled.
Track whether a piece of source has undergone review and, if so, what type. And later, track the number of bugs found in reviewed vs. unreviewed code.
This will allow you to determine how effectively your code review process(es) are operating in terms of bugs found.
If you're using Scrum, the backlog. How big is it after each sprint? Is it shrinking at a consistent rate? Or is stuff being pushed into the backlog because of (a) stuff that wasn't thought of to begin with ("We need another use case for an audit report that no one thought of, I'll just add it to the backlog.") or (b) not getting stuff done and pushing it into the backlog to meet the date instead of the promised features.
http://cccc.sourceforge.net/
Fan in and Fan out are my favorites.
Fan in:
How many other modules/classes use/know this module
Fan out:
How many other modules does this module use/know
improve time estimates
While Joel Spolsky's Evidence-based Scheduling isn't per se a metric, it sounds like exactly what you want. See http://www.joelonsoftware.com/items/2007/10/26.html
I especially like and use the system that Mary Poppendieck recommends. This system is based on three holistic measurements that must be taken as a package (so no, I'm not going to provide 3 answers):
Cycle time
From product concept to first release or
From feature request to feature deployment or
From bug detection to resolution
Business Case Realization (without this, everything else is irrelevant)
P&L or
ROI or
Goal of investment
Customer Satisfaction
e.g. Net Promoter Score
I don't need more to know if we are in phase with the ultimate goal: providing value to users, and fast.
number of similar lines. (copy/pasted code)
improve my team’s software development process
It is important to understand that metrics can do nothing to improve your team’s software development process. All they can be used for is measuring how well you are advancing toward improving your development process in regards to the particular metric you are using. Perhaps I am quibbling over semantics but the way you are expressing it is why most developers hate it. It sounds like you are trying to use metrics to drive a result instead of using metrics to measure the result.
To put it another way, would you rather have 100% code coverage and lousy unit tests or fantastic unit tests and < 80% coverage?
Your answer should be the latter. You could even want the perfect world and have both but you better focus on the unit tests first and let the coverage get there when it does.
Most of the aforementioned metrics are interesting but won't help you improve team performance. Problem is your asking a management question in a development forum.
Here are a few metrics: Estimates/vs/actuals at the project schedule level and personal level (see previous link to Joel's Evidence-based method), % defects removed at release (see my blog: http://redrockresearch.org/?p=58), Scope creep/month, and overall productivity rating (Putnam's productivity index). Also, developers bandwidth is good to measure.
Every time a bug is reported by the QA team- analyze why that defect escaped unit-testing by the developers.
Consider this as a perpetual-self-improvement exercise.
I like Defect Resolution Efficiency metrics. DRE is ratio of defects resolved prior to software release against all defects found. I suggest tracking this metrics for each release of your software into production.
Tracking metrics in QA has been a fundamental activity for quite some time now. But often, development teams do not fully look at how relevant these metrics are in relation to all aspects of the business. For example, the typical tracked metrics such as defect ratios, validity, test productivity, code coverage etc. are usually evaluated in terms of the functional aspects of the software, but few pay attention to how they matter to the business aspects of software.
There are also other metrics that can add much value to the business aspects of the software, which is very important when an overall quality view of the software is looked at. These can be broadly classified into:
Needs of the beta users captured by business analysts, marketing and sales folks
End-user requirements defined by the product management team
Ensuring availability of the software at peak loads and ability of the software to integrate with enterprise IT systems
Support for high-volume transactions
Security aspects depending on the industry that the software serves
Availability of must-have and nice-to-have features in comparison to the competition
And a few more….
Code coverage percentage
If you're using Scrum, you want to know how each day's Scrum went. Are people getting done what they said they'd get done?
Personally, I'm bad at it. I chronically run over on my dailies.
Perhaps you can test CodeHealer
CodeHealer performs an in-depth analysis of source code, looking for problems in the following areas:
Audits Quality control rules such as unused or unreachable code,
use of directive names and
keywords as identifiers, identifiers
hiding others of the same name at a
higher scope, and more.
Checks Potential errors such as uninitialised or unreferenced
identifiers, dangerous type casting,
automatic type conversions, undefined
function return values, unused
assigned values, and more.
Metrics Quantification of code properties such as cyclomatic
complexity, coupling between objects
(Data Abstraction Coupling), comment
ratio, number of classes, lines of
code, and more.
Size and frequency of source control commits.