Missingdataerror: exog contains inf or nan while trying to fit an OLS model - missing-data

Trying to fit an OLS model but not successful.
2 responses were gotten as error:
missingdataerror: exog contains inf or nan
olsmodel1 not defined
Below is my code:
olsmodel1 =sm.OLS(y_train, x_train).fit()
print(olsmodel1.summary())
I have checked for missing values using :
df1.isna().sum()
Here is the output:
brand_name 0
os 0
screen_size 0
4g 0
5g 0
main_camera_mp 0
selfie_camera_mp 0
int_memory 0
ram 0
battery 0
weight 0
release_year 0
days_used 0
normalized_used_price 0
normalized_new_price 0
I will appreciate of someone can help me out.
Thank you
I was actually expecting the OLS result to be generated

Related

Sklearn only predicts one class while dataset is fairly balanced (±80/20 split)

I am trying to come up with a way to check what are the most influential factors of a person not paying back a loan (defaulting). I have worked with the sklearn library quite intensively, but I feel like I am missing something quite trivial...
The dataframe looks like this:
0 7590-VHVEG Female Widowed Electronic check Outstanding loan 52000 20550 108 0.099 288.205374 31126.180361 0 No Employed No Dutch No 0
1 5575-GNVDE Male Married Bank transfer Other 42000 22370 48 0.083 549.272708 26365.089987 0 Yes Employed No Dutch No 0
2 3668-QPYBK Male Registered partnership Bank transfer Study 44000 24320 25 0.087 1067.134272 26678.356802 0 No Self-Employed No Dutch No 0
The distribution of the "DefaultInd" column (target variable) is this:
0 0.835408
1 0.164592
Name: DefaultInd, dtype: float64
I have label encoded the data to make it look like this, :
CustomerID Gender MaritalStatus PaymentMethod SpendingTarget EstimatedIncome CreditAmount TermLoanMonths YearlyInterestRate MonthlyCharges TotalAmountPayments CurrentLoans SustainabilityIndicator EmploymentStatus ExistingCustomer Nationality BKR_Registration DefaultInd
0 7590-VHVEG 0 4 2 2 52000 20550 108 0.099 288.205374 31126.180361 0 0 0 0 5 0 0
1 5575-GNVDE 1 1 0 1 42000 22370 48 0.083 549.272708 26365.089987 0 1 0 0 5 0 0
2 3668-QPYBK 1 2 0 4 44000 24320 25 0.087 1067.134272 26678.356802 0 0 2 0 5 0
After that I have removed NaNs and cleaned it up some more (removing capitalizion, punctuation etc)
After that, I try to run this cell:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
y = df['DefaultInd']
X = df.drop(['CustomerID','DefaultInd'],axis=1)
X = X.astype(float)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))
Which results in this:
precision recall f1-score support
0 0.83 1.00 0.91 1073
1 0.00 0.00 0.00 213
accuracy 0.83 1286
macro avg 0.42 0.50 0.45 1286
weighted avg 0.70 0.83 0.76 1286
As you can see, the "1" class does not get predicted 1 time, I am wondering whether or not this behaviour is to be expected (I think it is not). I tried to use class_weightd = ‘balanced’, but that resulted in an average f1 score of 0.59 (instead of 0.76)
I feel like I am missing something, or is this kind of behaviour expected and should I rebalance the dataset before fitting? I feel like the division is not that skewed (±80/20), there should not be this big of a problem.
Any help would be more than appreciated :)

Handling features with multiple values per instance in Python for Machine Learning model

I am trying to handle my data set which contain some features that has some multiple values per instances as shown on the image
https://i.stack.imgur.com/D78el.png
I am trying to separate each value by '|' symbol to apply One-Hot encoding technique but I can't find any suitable solution to my problem
My idea is to keep every multiple values in one row or by another word convert each cell to list of integers
Maybe this is what you want:
df = pd.DataFrame(['465','444','465','864|857|850|843'],columns=['genre_ids'])
df
genre_ids
0 465
1 444
2 465
3 864|857|850|843
df['genre_ids'].str.get_dummies(sep='|')
444 465 843 850 857 864
0 0 1 0 0 0 0
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 1 1 1

To One-Hot encode or not to One-Hot encode

My data set has the day of the week number (Mon = 1, Tue = 2, Wed = 3 ...)
My data look like this
WeekDay Col1 Col2 Target
1 2.2 8 126
6 3.5 4 354
1 8.0 2 322
3 7.2 4 465
7 3.2 5 404
6 3.8 3 134
1 3.6 5 455
1 5.5 8 345
6 7.0 6 442
Shall I one-hot encode WeekDay so it will look like this ?
WeekDay Col1 Col2 Target Mo Tu We Th Fr Sa Su
1 2.2 8 126 1 0 0 0 0 0 0
6 3.5 4 354 0 0 0 0 0 1 0
1 8.0 2 322 1 0 0 0 0 0 0
3 7.2 4 465 0 0 1 0 0 0 0
7 3.2 5 404 0 0 0 0 0 0 1
6 3.8 3 134 0 0 0 0 0 1 0
1 3.6 5 455 1 0 0 0 0 0 0
1 5.5 8 345 1 0 0 0 0 0 0
6 7.0 6 442 0 0 0 0 0 1 0
I am going to use Random Forest
You should not use one hot encoding since you are using a random forest model. An RF model will be able to find the patterns from label encoding as well and generally RF models perform worse with one hot encoding as they might decide to lost a few days when creating a tree. Also one hot encoding introduces the curse of dimensionality in your data, which is never good.
One hot encoding is better in cases of methods like linear regression or logistic regression, where 1 i.e. Monday might get more importance then 6 i.e. Saturday as these models have a multiplication model on the backend.
Generally, it's preferable to use One-Hot-Encoding, before use Random Forest. If this is only a categorical variable in your dataset then go for One-hot-Encoding. If you use R's random forest then as I know R's library deal with it itself. For scikit-learn that's not the case and you have to one-hot encode yourself. There is a trade off. One-Hot encoding introduces sparsity which is undesirable for tree-based models if the cardinality of the categorical variable is big, or in other words, there are many unique values in the categorical variable. However, Python's catboost deals with categorical variables.

How exactly does feature hashing work?

I have read many online articles on feature hashing of categorical variables for machine learning. Unfortunately, I still couldn't grasp the concept and understand how it works. I will illustrate my confusion through the sample dataset and hashing function below that I grabbed from another site:
>>>data
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 New York 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 1.8 Oregon 2003
>>> def hash_col(df, col, N):
cols = [col + "_" + str(i) for i in range(N)]
def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
df[cols] = df[col].apply(xform)
return df.drop(col,axis=1)
The below functions are used to print out the different transformed output by specifying different number of dimensions (or in other words, hashed features):
>>> print(hash_col(data, 'state',4))
pop year state_0 state_1 state_2 state_3
0 1.5 2000 0 0 1 0
1 1.7 2001 0 0 1 0
2 3.6 2002 0 0 0 1
3 2.4 2001 0 1 0 0
4 2.9 2002 0 1 0 0
5 1.8 2003 0 0 0 1
>>> print(hash_col(data, 'state',5))
pop year state_0 state_1 state_2 state_3 state_4
0 1.5 2000 1 0 0 0 0
1 1.7 2001 1 0 0 0 0
2 3.6 2002 1 0 0 0 0
3 2.4 2001 0 0 1 0 0
4 2.9 2002 0 0 1 0 0
5 1.8 2003 0 0 0 0 1
>>> print(hash_col(data, 'state',6))
pop year state_0 state_1 state_2 state_3 state_4 state_5
0 1.5 2000 0 0 0 0 1 0
1 1.7 2001 0 0 0 0 1 0
2 3.6 2002 0 0 0 0 0 1
3 2.4 2001 0 0 0 1 0 0
4 2.9 2002 0 0 0 1 0 0
5 1.8 2003 0 0 0 0 0 1
What I can't understand is what does each of the 'state_0', 'state_1', 'state_2' etc. column represent. Also, since there are 4 unique states in my dataset (Ohio, New York, Nevada, Oregon), why are all the '1' allocated to just 3 'state_n' columns instead of 4 as in one hot encoding? For example, when I set the number of dimensions to 6, the output had two '1' in state_3, state_4 and state_5, but there was no '1' in state_0, state_1 and state_2. Any feedback would be greatly appreciated!
Feature hashing is typically used when you don't know all the possible values of a categorical variable. Because of this, we can't create a static mapping from categorical values to columns. So a hash function is used to determine which column each categorical value corresponds to.
This is not the best use case because we know there are exactly 50 states and could just use one hot encoding.
A hash function will also have collisions, where different values are mapped to the same value. That is what's happening here. Two different state names hash to the same value after the modulus operation in the hash function.
One way to alleviate collisions is to make your feature space(number of columns) larger than the number of possible categorical values.

Pandas: how to use convert_objects to replace strings with NaN values

This is related to a previous question I've asked, here: Replace any string in columns with 1
However, since that question has been answered long ago, I've started a new question here. I am essentially trying to use convert_objects to replace string values with 1's in the following dataframe (abbreviated here):
uniq_epoch T_Opp T_Eval
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
...
I am using the following code to do this. I've actually tried using this code on the entire dataframe, and have also applied it to a particular column. The result each time is that there is no error message, but also no change to the data (no values are converted to NaN, and the dtype is still 'O').
df = df.convert_objects(convert_numeric = True)
or
df.T_Eval = df.T_Eval.convert_objects(convert_numeric=True)
Desired final output is as follows:
uniq_epoch T_Opp T_Eval
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
...
Where there may also be a step prior to this, with the 1s as NaN, and fillna(1) is used to insert 1s where strings have been.
I've already searched posts on stackoverflow, and looked at the documentation for convert_objects, but it is unfortunately pretty sparse. I wouldn't have known to even attempt to apply it this way if not for the previous post (linked above).
I'll also mention that there are quite a few strings (codes) in these columns, and that the codes can recombine, so that to do this with a dict and replace(), would take about the same amount of time as if I did this by hand.
Based on the previous post and the various resources I've been able to find, I can't figure out why this isn't working - any help much appreciated, including pointing towards further documentation.
This is on 0.13.1
docs here
and here
Maybe you have an older version; IIRC convert_objects introduced in 0.11.
In [5]: df = read_csv(StringIO(data),sep='\s+',index_col=0)
In [6]: df
Out[6]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
[5 rows x 2 columns]
In [7]: df.convert_objects(convert_numeric=True)
Out[7]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 NaN
2 NaN 0
3 0 0
3 NaN 0
[5 rows x 2 columns]
In [8]: df.convert_objects(convert_numeric=True).dtypes
Out[8]:
T_Opp float64
T_Eval float64
dtype: object
In [9]: df.convert_objects(convert_numeric=True).fillna(1)
Out[9]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
[5 rows x 2 columns]