I am trying to convert the blank values of my csv file to the mean of the columns but it is giving "could not convert string to float: '-' " error - pandas

import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
data = pd.read_csv("austin_weather.csv")
data = data.drop(['Events', 'Date'], axis = 1)
X = data.iloc[:, :-1].values
Y = data.iloc[:, 18].values
data = data.replace('T', 0.0)
imputer = Imputer(missing_values="-", strategy="mean", axis = 0)
imputer.fit(X[:])
the imputer function is not able to convert the "-" blank value to the mean of the respective column

For the depreciated class sklearn.preprocessing.Imputer parameter missing_values is either string NaN or a number.
So you can first replace all - values with np.nan: X.replace('-', np.nan) and then call imputer.

Related

Problem with merging multiply excel files from python

My dtype is changing after i unhash the foo and groupby i get # we require a list, but not a 'str'.
I wanted if the value (in my case Date) in the 1 column is the same then the text from the 3 column goes there after a ',' sign, in my final project
import os
import pandas as pd
import dateutil
from pandas import DataFrame
from datetime import datetime, timedelta
data_file_folder = '.\Data'
df = []
for file in os.listdir(data_file_folder):
if file.endswith('.xlsx'):
print('Loading File {0}...'.format(file))
df.append(pd.read_excel(os.path.join(data_file_folder,file),sheet_name='Sheet1'))
df_master = pd.concat(df,axis=0)
df_master['Date'] = df_master['Date'].dt.date
#foo = lambda a: ", ".join(a)
#df_master = df_master.groupby(by='Date').agg({'Tweet': foo}).reset_index()
#df_master.to_excel('.\NewFolder\example.xlsx',index=False)
#df_master

pandas type_convert does not work as expected

As shown in example below, Pandas type_convert does not correctly convert numbers coded as objects. I found others mention the same problem, for example in comments: https://stackoverflow.com/a/65915289/1086346, but I find no follow up.
Working example creates columns (for x in [0,1,2,3]) "int_x", "alpha_x", "obj_x", and objective is to convert obj_x from string to integer.
import pandas as pd
import numpy as np
import string
import random
random.seed(10)
N = 1000
C = 4
df_int = pd.DataFrame(np.random.randint(0, 100, size=(N, C)), columns=["int_" + str(i) for i in range(C)])
XX = [random.choices(string.ascii_lowercase, k=N) for i in range(C)]
df_str = pd.DataFrame(np.array(XX).T, columns=['alpha_' + str(i) for i in range(C)])
YY = np.random.randint(0, 100, size=(N, C)).astype(str)
df_obj = pd.DataFrame(YY, columns=["obj_" + str(i) for i in range(C)])
X = pd.concat([df_int, df_str, df_obj], axis=1)
X.shape
X.dtypes
Xn = X.convert_dtypes()
Xn.dtypes
The dtypes in X[obj_x] are "object" and they are not converted by convert_dtypes.
However, column-by-column manual application of to_numeric does work correctly, e.g.,
X2 = X2.apply(pd.to_numeric, errors="ignore")
X2.dtypes
Output shows the obj_ columns are "int64", as they should be.
Is there a bug in convert_dtypes?

Dataframe column won't convert from integer string to an actual integer

I have a date string in microsecond resolution. I need it as an integer.
import pandas as pd
data = ["20181231235959383171", "20181231235959383172"]
df = pd.DataFrame(data=data, columns=["A"])
df["A"].astype(np.int)
Error:
File "pandas\_libs\lib.pyx", line 545, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long
Same problem if I try to cast it to standard Python int
Per my answer in your previous question:
import pandas as pd
data = ["20181231235959383171", "20181231235959383172"]
df = pd.DataFrame(data=data, columns=["A"])
# slow but big enough
df["A_as_python_int"] = df["A"].apply(int)
# fast but has to be split to two integers
df["A_seconds"] = (df["A_as_python_int"] // 1000000).astype(np.int)
df["A_fractions"] = (df["A_as_python_int"] % 1000000).astype(np.int)
You could do this:
import pandas as pd
data = ["20181231235959383171", "20181231235959383172"]
df = pd.DataFrame(data=data, columns=["A"])
before = df.A[0]
df.A = [int(x) for x in df.A.tolist()]
after = df.A[0]
before, after
Output:
The data has been cast into an integer. Showing: (before, after)
('20181231235959383171', 20181231235959383171)

Getting a score of zero using cross val score

I am trying to use cross_val_score on my dataset, but I keep getting zeros as the score:
This is my code:
df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)
# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = np.array(df.iloc[:, 0], dtype="S6")
logreg = LogisticRegression()
loo = LeaveOneOut()
scores = cross_val_score(logreg, X, y, cv=loo)
print(scores)
The features are categorical values, while the target value is a float value. I am not exactly sure why I am ONLY getting zeros.
The data looks like this before creating dummy variables
N level,species,Plant Weight(g)
L,brownii,0.3008
L,brownii,0.3288
M,brownii,0.3304
M,brownii,0.388
M,brownii,0.406
H,brownii,0.3955
H,brownii,0.3797
H,brownii,0.2962
Updated code where I am still getting zeros:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd
# Creating dummies for the non numerical features in the dataset
df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)
# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = df.iloc[:, 0]
forest = RandomForestRegressor()
loo = LeaveOneOut()
scores = cross_val_score(forest, X, y, cv=loo)
print(scores)
The general cross_val_score will split the data into train and test with the given iterator, then fit the model with the train data and score on the test fold. And for regressions, r2_score is the default in scikit.
You have specified LeaveOneOut() as your cv iterator. So each fold will contain a single test case. In this case, R_squared will always be 0.
Looking at the formula for R2 in wikipedia:
R2 = 1 - (SS_res/SS_tot)
And
SS_tot = sqr(sum(y - y_mean))
Here for a single case, y_mean will be equal to y value and hence denominator is 0. So the whole R2 is undefined (Nan). In this case, scikit-learn will set the value to 0, instead of nan.
Changing the LeaveOneOut() to any other CV iterator like KFold, will give you some non-zero results as you have already observed.

Using pdsit with string value in python scipy

I have a following code and I want to calculate the hamming strings of the strings:
from pandas import DataFrame
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
df = pd.read_csv("3d_printing.csv", encoding='utf-8', error_bad_lines=False, low_memory=False, names=['file_name', 'phash', 'dhash', 'file_date'])
def hamming_distance(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(el1 != el2 for el1, el2 in zip(s1, s2))
df.sort_values(by='file_date', ascending=0)
x = pd.DataFrame(np.triu(squareform(pdist(df[['phash']], hamming_distance))),
columns=df.file_name.str.split('_').str[0],
index=df.file_name.str.split('_').str[0]).replace(0, np.nan)
z = x[x.apply(lambda col: col.index != col.name)].max(1).max(level=0)
z.to_csv("3d_printing_x.csv", mode='a')
When I run the code I get
ValueError: could not convert string to float: '002889898888b8a9'
I know that pdist requires float values, but at this point I don't know what to do