multicollinearity for one-hot encoding

multicollinearity for one-hot encoding - pandas

Do we always need to remove a column for one-hot encoding to prevent multicollinearity?
In the solution here (https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic/comments#138896) it mentions
#Kevin Chang You need to delete one column of the dummy variables to
avoid the state of Multicollinearity. It's a state of very high
correlations among the columns(independent variables); meaning that
one can be predicted from the others. It is therefore, a type of
disturbance in the data, and if present in the data the statistical
conclusions made about the data may not be reliable.
In the solutions here, there is not catering for multicollinearity
https://www.kaggle.com/sharmasanthosh/allstate-claims-severity/exploratory-study-on-ml-algorithms
May I know is it a must, or in what situation we ned to cater that?

If I have to answer your question "Do we always need to remove a column for one-hot encoding to prevent multicollinearity?", the answer is yes.
The common way to prevent multicollinearity is to remove highly correlated predictors from the model. If you have two or more factors with a high VIF, remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn't reduce the R-squared.
Or you could use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number of predictors to a smaller set of uncorrelated components.

Related

Optimising table assignment to guests for an event based on a criteria

66 guests at an event, 8 tables. Each table has a "theme". We want to optimize various criteria: e.g., even number of men/women at the table, people get to discuss the topic they selected, etc.
I formulated this as a gradient-free optimisation problem: I wrote a function that calculates the goodness of the arrangement (i.e., cost of difference of men women, cost of non-preferred theme, etc.) and I am basically randomly perturbing the arrangement by swapping tables and keeping the "best so far" arrangement. This seems to work, but cannot guarantee optimality.
I am wondering if there is a more principled way to go about this. There (intuitively) seems to be no useful gradient in the operation of "swapping" people between tables, so random search is the best I came up with. However, brute-forcing by evaluating all possibilities seems to be difficult; if there are 66 people, there are factorial(66) possible orders, which is a ridiculously large number (10^92 according to Python). Since swapping two people at the same table is the same, there are actually fewer, which I think can be calculated by dividing out the repeats, e.g. fact(66)/(fact(number of people at table 1) * fact(number of people at table 2) * ...), which in my problem still comes out to about 10^53 possible arrangements, way too many to consider.
But is there something better that I can do than random search? I thought about evolutionary search but I don't know if it would provide any advantages.
Currently I am swapping a random number of people on each evaluation and keeping it only if it gives a better value. The random number of people is selected from an exponential distribution to make it more probable to swap 1 person than 6, for example, to make small steps on average but to keep the possibility of "jumping" a bit further in the search.
I don't know how to prove it but I have a feeling this is an NP-hard problem; if that's the case, how could it be reformulated for a standard solver?
Update: I have been comparing random search with a random "greedy search" and a "simulated annealing"-inspired approach where I have a probability of keeping swaps based on the measured improvement factor, that anneals over time. So far the greedy search surprisingly strongly outperforms the probabilistic approach. Adding the annealing schedule seems to help.
What I am confused by is exactly how to think about the "space" of the domain. I realize that it is a discrete space, and that distance are best described in terms of Levenshtein edit distance, but I can't think how I could "map" it to some gradient-friendly continuous space. Possibly if I remove the exact number of people-per-table and make this continuous, but strongly penalize it to incline towards the number that I want at each table -- this would make the association matrix more "flexible" and possibly map better to a gradient space? Not sure. Seating assignment could be a probability spread over more than one table..

Performance characteristics of a non-minimal DFA

Much has been written about the performance of algorithms for minimizing DFAs. That's frustrating my Google-fu because that's not what I'm looking for.
Can we say anything generally about the performance characteristics of a non-minimal DFA? My intuition is that the run time of a non-minimal DFA will still be O(n) with respect to the length of the input. It seems that minimization would only affect the number of states and hence the storage requirements. Is this correct?
Can we refine the generalizations if we know something about the construction of the NFA from which the DFA was derived? For example, say the NFA was constructed entirely by applying concatenation, union, and Kleene star operations to primitive automatons that match either one input symbol or epsilon. With no way to remove a transition or create an arbitrary transition, I don't think it's possible to have any dead states. What generalizations can we make about the DFAs constructed from these NFAs? I'm interested in both theoretical and empirical answers.

Regarding your first question, on the runtime of the non-optimal DFA. Purely theoretically your intuition that it should still run in O(n) is correct. However, imagine (as an example) the following pseudo-code for the Kleene-Star operator:
// given that the kleene-star operator starts at i=something
while string[i] == 'r':
accepting = true;
i++;
while string[i] == 'r':
accepting = true;
i++;
// here the testing of the input string can continue for i+1
As you can see, the first two while-loops are identical, and could be understood as a redundant state. However, "splitting" while loops will decrease (among other things) your branch-prediction accuracy and therefore the overall runtime (see Mysticial's brilliant explanation of branch prediction for more details here.
Many other, similar "practical" arguments can be made on why a non-optimal DFA will be slower; among them, as you mentioned, a higher memory usage (and in many cases, more memory means slower, for memory is - by comparison - a slower part of the computer); more "ifs", for each additional state requires input checking for its successors; possibly more loops (as in the example), which would make the algorithm slower not only on the basis of branch prediction, but simply because some programming languages are just very slow on loops.
Regarding your second question - here I am not sure on what you mean. After all, if you do the conversion properly you should derive a pretty optimal DFA in the first place.
EDIT:
In the discussion the idea came up that there can be several non-minimal DFAs constructed from one NFA that would have different efficiencies (in whatever measure chosen), not in the implementation, but in the structure of the DFA.
This is not possible, for there is only one optimal DFA. This is the outline of a proof for this:
Assuming that our procedure for creating and minimizing a DFA is optimal.
when applying the procedure, we will start by constructing a DFA first. In this step, we can create indefinitely many equivalent states. These states are all connected to the graph of the NFA in some way.
In the next step we eliminate all non-reachable states. This is indifferent to perfomance, for an unreachable state would correspond to "dead code" - never to be executed.
In the fourth step, we minimize the DFA by grouping equivalent states. This is where it becomes interesting - for the idea is that we can do this in different ways, resulting in different DFAs with different performance. However, the only "choice" we have is assigning a state to a different group.
So, for arguments sake, we assume we could do that.
But, by the idea behind the minimization algorithm, we can only group equivalent states. So if we have different choices of grouping a particular state, by transitivity of equivalence, not only would the state be equivalent to both groups, but the groups would be equivalent, too. So if we could group differently, the algorithm would not be optimal, for it would have grouped all states in the groups into one group in the first place.
Therefore, the assumption that there can be different minimizations has to be wrong.

The reasoning that the "runtime" for input acceptance will be the same, as usually one character of the input is consumed; I never heared the notion "runtime" (in the sense of asymptotic runtime complexity) in the context of DFAs. The minimization aims at minimizing the number of states (i.e. to optimize the "implementation size") of the DFA.

Performance Impact of turning Columns into Rows

I'm planning to use JavaDB (Derby) or PostgreSQL.
I have the following problem: I need to store a large set of vectors. Currently all vectors contain a fixed number of elements. Hence the appropriate way of storing the set is using one row per vector and a column per element. However, the number of elements might change over time. Additionally, in my case, from a software engineering perspective, having a fixed number of columns reflects knowledge about a software component which the general model should be unaware of.
Therefore I'm thinking about "linearizing" the layout and use a general table that stores elements instead of vectors.
The first element of the vector 5 could then be queried like this:
SELECT value FROM elements where v_id = 5 and e_id = 1;
In general, I do not need full table reads, and only a relatively small subset of the vectors is accessed.
Maybe database-savvy people can judge what the performance impact will be?
Many thanks in advance.

This is a variant of what's referred to in general database terms as Entity-Attribute-Value or EAV design. It's a bit of a relational database design anti-pattern and should be avoided in most cases. Performance tends to be poor due to the need for many self-joins, and queries are ugly at best.
In PostgreSQL look into the intarray extension, it should solve your problem pretty ideally if the values are simple integers. Otherwise consider PostgreSQL's standard array types. They've got their own issues, but are generally a lot better than EAV, though they're not lovely to work with from JDBC.
Otherwise, if all you're storing is these vectors, maybe consider a non-relational DB.

Nvarchar or varchar what is better use multiply of 2 or rounded full numbers?

My question is what is better to use in generating columns in SQL. Should the size of nvarchar (varchar) be multiply of 2 (32, 64, 128) or it's doesn't matter and we can use fully numbers example '100', '50' ?
Thank You very much for answers with reasons
Greeting's to all

Doesn't make any difference. Use the size appropiate for your data.
For instance SQL Server, if you look at the Anatomy of a Record you'll see that your size translates into record offsets that are dependent on the previous record in the table, null values and other factors, specially with row compression and page compression taken into account. By the time the field is accessed, any resemblance with the original declare size relation, vis-a-vis powers of 2 or powers of 10, is long gone. Also various elements higher on a query execution stack like join operators or sort operators or whatever, also would no benefit from powers of 2 sizes (I have no 'proof' linkes, but is OK if you take my word for it...). Neither does the TDS protocol when marshaling data back to client. And I see little benefit in the client too.

There's no reason to use multiples of 2. Set the field to match the estimated size of your data.
One number that is worth mentioning though is 255. Some database systems have a maximum varchar type of 255, though this is becoming rarer. I'm thinking mainly here of what are now very old versions of MySql. And so sometimes developers will set the column size at 255 or lower to ensure more portability.

I vote that it does not matter. Pick what makes the most sense for your application. Use human readable values. Pick nice names for variables and columns. Only in really extreme cases will you need to tune. When you find out that you need to tune, tune. Until then, go with whatever makes the most sense from a business or human perspective.

There's no benefit in having size of (N)VARCHAR columns be a power of 2. Use whatever is suitable for your domain model.

it doesn't matter, the values of all table columns are strcutured by the engine to fit together on the physical Page

How do you know when an SQL database needs more normalization?

Is it when you're trying to get data and there is no apparent easy way of doing it?
When you find something should be a table on it's own?
What are the laws?

Check out Wikipedia. The article talks about database normalization and the different forms (first, second, third, etc.). Most times you should be aiming for at least third normal form. There are times when you want to relax the rules a bit (it may be too expensive to join multiple tables together so might want to de-normalize a bit) but for the most part third normal form is good.

When you notice you have to repeat the same data, or when you start using single fields as arrays.

While this is a somewhat snarky answer, when you discover that the data isn't sufficiently normalized. There are many resources on the web about the levels (or, more properly, "forms") of normalization, and they more completely describe the forms than I could here. First and second normal forms should be pretty much required. If you aren't at third (or, really, fourth) normal form, you need to have a strong justification as to why.
Check out the Wikipedia article on database normalization.

When you're starting to question whether an SQL database needs more normalization.

Whenever you have a relational database.... <grin/>
No, actually there are laws, check out this Wikipedia link.
they are called the five normal forms or something like that. Originally from the guy who invented relational databases in the 50s/60s, E. F. Codd.
"The key the whole key and nothing but the Key, so help me Codd"
This is a synopsis:
First normal form (1NF) Table
faithfully represents a relation and
has no repeating groups
Second normal form (2NF) No
non-prime attribute in the table is
functionally dependent on a part
(proper subset) of a candidate key
Third normal form (3NF) Every
non-prime attribute is
non-transitively dependent on every
key of the table Every non-trivial functional dependency in the table is a dependency on a superkey
Fourth normal form (4NF) Every
non-trivial multivalued dependency
in the table is a dependency on a
superkey
Fifth normal form (5NF) Every non-trivial join dependency in the table is implied by the superkeys of the table. Domain/key normal form (DKNF) Ronald Fagin (1981)[19] Every constraint on the table is a logical consequence of the table's domain constraints and key constraints
Sixth normal form (6NF) Table features no
non-trivial join dependencies at all
(with reference to generalized join
operator)

Other people have pointed you to the formal rules for normalization. Here are some informal guidelines I use:
If you have columns in a table the names of which differ only by a number (eg Phone1 and PHone2).
If you have any columns in a table that should be filled in only when another column in the table is filled in.
If updating a "fact" in the database (such as a street address) requires more than one UPDATE.
If the same question could ever get two different answers depending on which table you get your information from.
If the answer to any non-trivial question can be gotten from the database without JOINing at least two tables.
If you have any quantity-based restrictions in the database other than "only 1 of something is allowed" (that is, "only one address is allowed" is okay, but "only two addresses are allowed" indicates a normalization problem).

3NF is generally all you need and it follows three rules:
Every column in the table should be dependent on:
the key (1NF),
the whole key (2NF),
and nothing but the key (3NF) (so help me Codd is the way that quote usually ends).
You can often "downgrade" to 2NF for performance reasons, provided you understand the implications and only when you strike problems, but 3NF should be the initial goal for all your designs..

As everyone else has said, you know when you start having (too many) duplicate columns in multiple tables.
That being said, it is sometimes useful to have redundant columns across multiple tables. This can reduce the number of JOINs you have to do in complicated queries. Just be careful to keep all the tables in sync, or you're just asking for trouble.

This is a pretty good article. Getting normal is a science, not an art. Now knowing when to DEnormalize... that's an art.
http://www.alvechurchdata.co.uk/hints-and-tips/softnorm.html

See Description of the database normalization basics

What level of normalization are you currently at? If you can't answer that I assume your database is a nasty mess. I always hit 3rd normal on initial design and de-normalize or normalize further if and when needed.

I assume you're talking about a transactional database supporting an interactive application, but for what it's worth...
OLAP databases used exclusively for reporting and only updated by ETL processes may benefit from a less normalized structure. In these applications you accept the cost of redundant data storage and duplication for the performance benefit of fewer joins and the increased ease of use for (sometimes less technical) data analysts and business analysts.
Transactional databases should always be normalized to the extent practical (at least 3NF) and then selectively denormalized only as needed. And the need to denormalize should ideally be based on actual performance testing results.

When you have to search trough huge amounts of data just to extract some basic info - i.e. what kind of Product categories are there or something like that.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas