Dummy Variable Trap And removing one Column - pandas

Can anyone explain me excatly what is meant by Dummy Variable Trap?And why we want to remove one column to avoid that trap?Please provide me some links or explain this.I am not clear about this process.

In regression analysis there's often talk about the issue of multicolinearity, which you might be familiar with already. The dummy variable trap is simply perfect colinearity between two or more variables. This can arise if, for one binary variable, two dummies are included; Imagine that you have a variable x which is equal to 1 when something is True. If you would include x, along with another variable z, which would be the opposite of x (i.e. 1 when that same thing is False), in your regression model, you would have two perfectly negatively correlated variables.
Here's a simple demonstration. Let's say your x is one column with True/False values in a pandas dataframe. See what happens when you use pd.get_dummies(df.x) below. The two dummies that are created are mirroring each other, so one of them is redundant. In simpler terms, you only need one of them since you can always guess the value of the other based on the one that you have.
import pandas as pd
df = pd.DataFrame({'x': [True, False]})
pd.get_dummies(df.x)
False True
0 0 1
1 1 0
The same applies if you have a categorical variable that can take on more than two values. Whether binary or not, there is always a "base scenario" that can be defined by the variation in the other case(s). This "base scenario" is therefore redundant and will only introduce perfect colinearity in the model if included.
So what's the issue with multicolinearity/linear dependence? The short answer is that if there is imperfect multicolinearity among your explanatory variables, your estimated coefficients can be distorted/biased. If there is perfect multicolinearity (which is the case with the dummy variable trap) you can't estimate your model at all; think of it like this, if you have a variable that can be perfectly explained by another variable, it means that your sample data only includes valuable information about one, not two, truly unique variables. So it would be impossible to obtain two separate coefficient estimates for the same variable.
Further Reading
Multicolinearity
Dummy Variable Trap

Related

Multi-objective optimization but the function equation is unknown?

Firstly, I am totally out of my expertise zone so please bear with me.
I developed a fluid dynamic engine with 5 exposed parameters (say A,B,C,D,E). When you give this engine these 5 parameters, it does magic and give out a value 'Z'.
I want to write a script which can explore which combinations of A-E give lowest (or close to lowest) value of Z.
I know optimization algorithm exists, but from all of my search for examples, they use some function.
So I guess my function would simply be minimize Z? But where do A-E go?
Not really an answer, but some questions and ideas that might help you think through the best way to address this. We have no understanding of how big a range of values needs to be explored for those parameters, or how Z behaves, so this is very vague...
If you look at the values of Z for given values of A...E, does the value of Z jump around a lot for small changes on the parameter values, or does the Z value change reasonably smoothly?
If the Z value is not too eratic you could try some kind of gradient descent approach using calculated values of Z for some values of the parameters to approximate the gradient - suppose changing the value of 'A' from 1 to 2 gives a better change in the Z value than a similar size change in the other parameters, then try other values of A while keeping the other parameters fixed until you find a value of A that gives the best value of Z. Then try changing the other parameter values to see which one gives the steepest descent and try to find some better value for that parameter. Repeat this process until you can't find any improvement and you will have found a (local) minimum. You could then start at a different place in your parameter space and try again - you will probably find several local minima, and may just choose the best of those. Not provably optimal but may be good enough. Of course you can get clever and use things like conjugate gradients, Newton-Raphson or similar if Z is smooth enough.
If the Z values are very eratic, then you might have to just do some sampling of the possible combinations of A...E to get values of Z and choose the best you can find. Again you might do that in some systematic way (e.g. points on a grid in your parameter space) or entirely at random, or a combination of both.
If you find that there are 'clusters' of good solutions with similar values of the parameters then maybe some kind of local search would help - the idea is that there is often a better solution in the local neighbourhood of a known good solution. So maybe try perturbing your parameter values a bit from a known solution to see if that can lead to a better solution - either by some gradient descent method or by random sampling.
Unfortunately, if your Z calculation is complex, then any method using it as a black box will likely be slow as it will need to be re-evaluated many times.
You could use a Genetic Algorithm, where your chromosomes are formed with the 5 candidate values of the variables you have to optimize, to minimize Z, and your optimization/fitness "function" is the simulation itself outputting Z.
Other viable alternatives are Particle Swarm Optimization algorithm or Ant Colony Optimization. All of those are usable algortihms for that kind of optimization problem.

What is a use case of `SSWAP`?

In doing some stuff with BLAS operations I see the level 1 operation SSWAP.
I can't come up with a programming use case for this.
My thinking is, if you where passing y to a function but wanted it with the values of x, why not simply pass x? Swapping the values seems rather convoluted.
This is just a question out of curiosity.
Sometimes swapping the content of two (stride) vectors is exactly what you need. For instance when doing row or column interchanges in pivoting during LU factorization -- the reference BLAS uses xSWAP in xGBTRF. The pivoting algorithm for LU decomposition requires swapping the content of two rows (or columns). These two rows (or columns) can be thought of as two vectors (possibly with non-unit stride between the elements). One needs to do many such interchanges along the way, and they gradually change, so there is no option to "just send some other line to a function" at the end of the algorithm.
To sum up, as a basic building block of more complex algorithms, a (potentially) optimized routine for interchanging columns or rows of a matrix seems useful.

python - pandas - dataframe - data padding multidimensional statistics

i have a dataframe with columns accounting for different characteristics of stars and rows accounting for measurements of different stars. (something like this)
\property_______A _______A_error_______B_______B_error_______C_______C_error ...
star1
star2
star3
...
in some measurements the error for a specifc property is -1.00 which means the measurement was faulty.
in such case i want to discard the measurement.
one way to do so is by eliminating the entire row (along with other properties who's error was not -1.00)
i think it's possible to fill in the faulty measurement with a value generated by the distribution based on all the other measurements, meaning - given the other properties which are fine, this property should have this value in order to reduce the error of the entire dataset.
is there a proper name to the idea i'm referring to?
how would you apply such an algorithm?
i'm a student on a solo project so would really appreciate answers that also elaborate on theory (:
edit
after further reading, i think what i was referring to is called regression imputation.
so i guess my question is - how can i implement multidimensional linear regression in a dataframe in the most efficient way???
thanks!

how to get surrogate variables in rpart

I have looked everywhere I can, but I couldn't find answer to my question regarding rpart package.
I have built a regression tree using rpart, I have around 700 variables. I want to get the variables actually used to build the tree including the surrogates. I can find the actual variables used using tree$variable.importance, but I also have to get the surrogates because I need them to predict on the test set data I have. I do not want to keep all the 700 variables in the test set as I have a very big data (20mil observations) and I am running out of memory.
The list variable.importance in an rpart object does show the surrogate variables, but it only shows the top variables limited by a minimum importance value.
The matrix splits in an rpart object lists all of the split variables and their surrogate variables along with some other data like index, the value on which it splits (for continuous variable) or the categories that are split (for categorical variable), count how many observations are that split applies to. It doesn't give a hierarchy of which surrogates apply to which split, but it does list every variable. To get the hierarchy, you have to do summary(rpart_object).

Why does pandas.apply() work differently for Series and DataFrame columns

apologies if this is a silly question, but I am not quite sure as to why this behavior is the case, and/or whether I am misunderstanding it. I was trying to create a function for the 'apply' method, and noticed that if you run apply on a series the series is passed as a np.array and if you pass the same series within a dataframe of 1 column, the series is passed as a series to the (u)func.
This affects the way a simpleton like me writes the function (i prefer iloc indexing to integer-based indexing on the array) so I was wondering whether this is on purpose, or historical accident?
Thanks,