Correlation between continuous independent variable and binary class dependent variable - pandas

Can someone please tell if it's correct to find correlation between a dependent variable that has binary class(0 or 1) and independent variables that have continuous values using pandas df.corr().
I am getting correlation output if I do use it. But I want to understand if it's statistically correct to find pearson correlation(using df.corr()) between a binary categorical output and continuous input variables.

pearson correlation is for continues data if one is categorical and other is binary, you should use ANOVA to see the relation between variables refrence

Related

How to Set the Same Categorical Codes to Train and Test data? Python-Pandas

NOTE:
If someone else it's wondering about this topic, I understand you're getting deeper in the Data Analysis world, so I did this question before to learn that:
You encode categorical values as INTEGERES only if you're dealing with Ordinal Classes, i.e. College degree, Customer Satisfaction Surveys as an example.
Otherwise if you're dealing with Nominal Classes like, gender, colors or names, you MUST convert them with other methods since they do not specific any numerical order, most known are One-hot Encoding or Dummy variables.
I encorage you to read more about them and hope this has been useful.
Check the link below to see a nice explanation:
https://www.youtube.com/watch?v=9yl6-HEY7_s
This may be a simple question but I think it can be useful for beginners.
I need to run a prediction model on a test dataset, so to convert the categorical variables into categorical codes that can be handled by the random forests model I use these lines with all of them:
Train:
data_['Col1_CAT'] = data_['Col1'].astype('category')
data_['Col1_CAT'] = data_['Col1_CAT'].cat.codes
So, before running the model I have to apply the same procedure to both, the Train and Test data.
And since both datasets have the same categorical variables/columns, I think it will be useful to apply the same categorical codes to each column respectively.
However, although I'm handling the same variables on each dataset I get different codes everytime I use these two lines.
So, my question is, how can I do to get the same codes everytime I convert the same categoricals on each dataset?
Thanks for your insights and feedback.
Usually, how I do this is to do the categorical conversions before the train test split so that I get a neat transformed dataset. Once I do that, I do the train-test split and train the model and test it on the test set.

Dealing with both categorical and numerical variables in a Multiple Linear Regression Python

So I have already performed a multiple linear regression in Python using LinearRegression from sklearn.
My independant variables were all numerical (and so was my dependant one)
But now I'd like to perform a multiple linear regression combining numerical and non numerical independant variables.
Therefore I have several questions:
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
If yes, do I have to change some parameters?
If not, how should I perform the Linear Regression?
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Problem is: Even if I want to encode diffently nominal and ordinal variables,
it seems impossible for Python to tell the difference between both of them?
This stuff might be easy for you but right now as you could tell I'm a little confused so I could really use your help !
Thanks in advance,
Alex
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
In fact the model has to be fed exclusively with numerical data, thus you must use OneHot vectors for the categorical data in your input features. For that you can take a look at Scikit-Learn's LabelEncoder and OneHotEncoder.
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Yes. As you mention one-hot methods don't deal with ordinal variables. One way to work with ordinal features is to create a scale map, and map those features to that scale. Ordinal is a very useful tool for these cases. You can feed it a mapping dictionary according to a predifined scale mapping as mentioned. Otherwise, obviously it randomly assigns integers to the different categories as it has no knowledge to infer any order. From the documentation:
Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in, in this case we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.
Hope this helps.

a tricky graph solve in tensorflow

As the following, I built a graph with two big variables and two input placeholder.
Every time, I want to use the current value of variables (partial values) and input placeholders to calculate delta values. Then the delta values are update to the variables using scatter_add.
problem: the two computing paths are not the same, one needs more computing. the tensorflow solving engine seems to prefer one of the path randomly-it solves one of path, then the other. For example, tf may update variable 0 first, then use this new variable 0 to solve another path (update variable 1). This is not my need.
so, any idea?
tensorflow graph:
I find the solution. Using the tf.control_dependencies() could solve this problem.
https://www.tensorflow.org/api_docs/python/tf/control_dependencies

How to effectively use knn in Stata

I have two questions with executing discrim knn in Stata.
1) How do you properly code the command? I've tried various versions, but seem to always get an error that there are too many variables specified.
The vector with the correct result is buy.
I am trying: discrim knn buy, group(train test) k(1)
2) My understanding with KNN was that factor variables (binary) were fine for using KNN, even encouraged. However I get the error message that factor variables and time-series operators not allowed.
Lastly, though I know this isn't the best space for this question, should each vector be normalized for knn? I've heard conflicting responses.
I'm guessing that the error you're getting is
group(): too many variables specified
This is because you can only group by 1 variable with knn. knn performs discriminant analysis based on a single grouping variable, in your case, distinguishing the training from the test. I imagine your train and test variables are binary, in which case using only one of the variables is enough, as they are merely logical opposites of each other. A single variable has enough information to distinguish the two groups.

How to plot a Pearson correlation given a time series?

I am using the code in this website http://blog.chrislowis.co.uk/2008/11/24/ruby-gsl-pearson.html to implement a Pearson Correlation given two time series data like so:
require 'gsl'
pearson_correlation = GSL::Stats::correlation(
GSL::Vector.alloc(first_metrics),GSL::Vector.alloc(second_metrics)
)
This returns a number such as -0.2352461593569471.
I'm currently using the highcharts library and am feeding it two sets of timeseries data. Given that I have a finite time series for both sets, can I do something with this number (-0.2352461593569471) to create a third time series showing the slope of this curve? If anyone can point me in the right direction I'd really appreciate it!
No, correlation doesn't tell you anything about the slope of the line of best fit. It just tells you approximately how much of the variability in one variable (or one time series, in this case) can be explained by the other. There is a reasonably good description here: http://www.graphpad.com/support/faqid/1141/.
How you deal with the data in your specific case is highly dependent on what you're trying to achieve. Are you trying to show that variable X causes variable Y? If so, you could start by dropping the time-series-ness, and just treat the data as paired values, and use linear regression. If you're trying to find a model of how X and Y vary together over time, you could look at multivariate linear regression (I'm not very familiar with this, though).