Getting a score of zero using cross val score - pandas

I am trying to use cross_val_score on my dataset, but I keep getting zeros as the score:
This is my code:
df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)
# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = np.array(df.iloc[:, 0], dtype="S6")
logreg = LogisticRegression()
loo = LeaveOneOut()
scores = cross_val_score(logreg, X, y, cv=loo)
print(scores)
The features are categorical values, while the target value is a float value. I am not exactly sure why I am ONLY getting zeros.
The data looks like this before creating dummy variables
N level,species,Plant Weight(g)
L,brownii,0.3008
L,brownii,0.3288
M,brownii,0.3304
M,brownii,0.388
M,brownii,0.406
H,brownii,0.3955
H,brownii,0.3797
H,brownii,0.2962
Updated code where I am still getting zeros:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd
# Creating dummies for the non numerical features in the dataset
df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)
# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = df.iloc[:, 0]
forest = RandomForestRegressor()
loo = LeaveOneOut()
scores = cross_val_score(forest, X, y, cv=loo)
print(scores)

The general cross_val_score will split the data into train and test with the given iterator, then fit the model with the train data and score on the test fold. And for regressions, r2_score is the default in scikit.
You have specified LeaveOneOut() as your cv iterator. So each fold will contain a single test case. In this case, R_squared will always be 0.
Looking at the formula for R2 in wikipedia:
R2 = 1 - (SS_res/SS_tot)
And
SS_tot = sqr(sum(y - y_mean))
Here for a single case, y_mean will be equal to y value and hence denominator is 0. So the whole R2 is undefined (Nan). In this case, scikit-learn will set the value to 0, instead of nan.
Changing the LeaveOneOut() to any other CV iterator like KFold, will give you some non-zero results as you have already observed.

Related

Passing pandas NumPy arrays as feature vectors in scikit learn?

I have a vector of 5 different values that I use as my sample value, and the label is a single integer of 0, 1, or 3. The machine learning algorithms work when I pass an array as a sample, but I get this warning. How do I pass feature vectors without getting this warning?
import numpy as np
from numpy import random
from sklearn import neighbors
from sklearn.model_selection import train_test_split
import pandas as pd
filepath = 'test.csv'
# example label values
index = [0,1,3,1,1,1,0,0]
# example sample arrays
data = []
for i in range(len(index)):
d = []
for i in range(6):
d.append(random.randint(50,200))
data.append(d)
feat1 = 'brightness'
feat2, feat3, feat4 = ['h', 's', 'v']
feat5 = 'median hue'
feat6 = 'median value'
features = [feat1, feat2, feat3, feat4, feat5, feat6]
df = pd.DataFrame(data, columns=features, index=index)
df.index.name = 'state'
with open(filepath, 'a') as f:
df.to_csv(f, header=f.tell() == 0)
states = pd.read_csv(filepath, usecols=['state'])
df_partial = pd.read_csv(filepath, usecols=features)
states = states.astype(np.float32)
states = states.values
labels = states
samples = np.array([])
for i, row in df_partial.iterrows():
r = row.values
samples = np.vstack((samples, r)) if samples.size else r
n_neighbors = 5
test_size = .3
labels, test_labels, samples, test_samples = train_test_split(labels, samples, test_size=test_size)
clf1 = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
clf1 = clf1.fit(samples, labels)
score1 = clf1.score(test_samples, test_labels)
print("Here's how the models performed \nknn: %d %%" %(score1 * 100))
Warning:
"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). clf1 = clf1.fit(samples, labels)"
sklearn documentation for fit(self, X, Y)
Try replacing
states = states.values by states = states.values.flatten()
OR
clf1 = clf1.fit(samples, labels) by clf1 = clf1.fit(samples, labels.flatten()).
states = states.values holds the correct labels that were stored in your panda dataframe, however they are getting stored on different rows. Using .flatten() put all those labels on the same row. (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.ndarray.flatten.html)
In Sklearn's KNeighborsClassifier documentation
(https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), they show in their example that the labels must be stored on the same row: y = [0, 0, 1, 1].
When you retrieve data from dataframe states, it is stored in multiple rows (column vector) whereas it expected values in single row.
You can also try using ravel() function which is used to create a contiguous flattened array.
numpy.ravel(array, order = ā€˜Cā€™) : returns contiguous flattened array (1D array with all the input-array elements and with the same type as it)
Try:
states = states.values.ravel() in place of states = states.values

'tuple' object has no attribute 'reshape'

I used a dataset "ex1data1.txt", but when I am running it to convert, it is showing the following error:
AttributeError Traceback (most recent call last)
<ipython-input-52-7c523f7ba9e1> in <module>()
1 # Converting loaded dataset into numpy array
2
----> 3 X = np.concatenate((np.ones(len(population)).reshape(len(population), 1), population.reshape(len(population),1)), axis=1)
4
5
AttributeError: 'tuple' object has no attribute 'reshape'
The code is given below:
import csv
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd
import numpy as np
# Loading Dataset
with open('ex1data1.txt') as csvfile:
population, profit = zip(*[(float(row['Population']), float(row['Profit'])) for row in csv.DictReader(csvfile)])
# Creating DataFrame
df = pd.DataFrame()
df['Population'] = population
df['Profit'] = profit
# Plotting using Seaborn
sns.lmplot(x="Population", y="Profit", data=df, fit_reg=False, scatter_kws={'s':45})
# Converting loaded dataset into numpy array
X = np.concatenate((np.ones(len(population)).reshape(len(population), 1), population.reshape(len(population),1)), axis=1)
y = np.array(profit).reshape(len(profit), 1)
# Creating theta matrix , theta = [[0], [0]]
theta = np.zeros((2, 1))
# Learning rate
alpha = 0.1
# Iterations to be taken
iterations = 1500
# Updated theta and calculated cost
theta, cost = gradientDescent(X, y, theta, alpha, iterations)
I don't know how to solve this reshape problem. Can anyone tell how can I solve this problem?
from your definition, population is a tuple. I'd suggest two options, the first is converting it to an array, i.e.
population = np.asarray(population)
Alternatively, you can use the DataFrame column .values attribute, which is essentially a numpy array:
X = np.concatenate((np.ones(len(population)).reshape(len(population), 1), df['Population'].values.reshape(len(population),1)), axis=1)

Connected components on Pandas DataFrame with Networkx

Action
To cluster points based on distance and label using connected components.
Problem
The back and forth switching between NetworkX nodes storage of attributes and Pandas DataFrame
Seems too complex
Index/key errors when looking up nodes
Tried
Using different functions like Scikit NearestNeighbours, however resulting in the same back and forth moving of data.
Question
Is there a simpler way to perform this connected components operation?
Example
import numpy as np
import pandas as pd
import dask.dataframe as dd
import networkx as nx
from scipy import spatial
#generate example dataframe
pdf = pd.DataFrame({'x':[1.0,2.0,3.0,4.0,5.0],
'y':[1.0,2.0,3.0,4.0,5.0],
'z':[1.0,2.0,3.0,4.0,5.0],
'label':[1,2,1,2,1]},
index=[1, 2, 3, 4, 5])
df = dd.from_pandas(pdf, npartitions = 2)
object_id = 0
def cluster(df, object_id=object_id):
# create kdtree
tree = spatial.cKDTree(df[['x', 'y', 'z']])
# get neighbours within distance for every point, store in dataframe as edges
edges = pd.DataFrame({'src':[], 'tgt':[]}, dtype=int)
for source, target in enumerate(tree.query_ball_tree(tree, r=2)):
target.remove(source)
if target:
edges = edges.append(pd.DataFrame({'src':[source] * len(target), 'tgt':target}), ignore_index=True)
# create graph for points using edges from Balltree query
G = nx.from_pandas_dataframe(edges, 'src', 'tgt')
for i in sorted(G.nodes()):
G.node[i]['label'] = nodes.label[i]
G.node[i]['x'] = nodes.x[i]
G.node[i]['y'] = nodes.y[i]
G.node[i]['z'] = nodes.z[i]
# remove edges between points of different classes
G.remove_edges_from([(u,v) for (u,v) in G.edges_iter() if G.node[u]['label'] != G.node[v]['label']])
# find connected components, create dataframe and assign object id
components = list(nx.connected_component_subgraphs(G))
df_objects = pd.DataFrame()
for c in components:
df_object = pd.DataFrame([[i[0], i[1]['x'], i[1]['y'], i[1]['z'], i[1]['label']] for i in c.nodes(data=True)]
, columns=['point_id', 'x', 'y', 'z', 'label']).set_index('point_id')
df_object['object_id'] = object_id
df_objects.append(df_object)
object_id += 1
return df_objects
meta = pd.DataFrame(np.empty(0, dtype=[('x',float),('y',float),('z',float), ('label',int), ('object_id', int)]))
df.apply(cluster, axis=1, meta=meta).head(10)
You can use DBSCAN from scikit-learn. For min_samples=1 it basically finds connected components. It can use different algorithms for nearest neighbors computation and is configured through the parameter algorithm (kd-tree is one of the options).
My other suggestion is to do the computation separately for different labels. This simplifies the implementation and allows for parallelization.
These two suggestions can be implemented as follows:
from sklearn.cluster import DBSCAN
def add_cluster(df, distance):
db = DBSCAN(eps=distance, min_samples=1).fit(df[["x", "y", ...]])
return df.assign(cluster=db.labels_)
df = df.groupby("label", group_keys=False).apply(add_cluster, distance)
It should work for both Pandas and Dask dataframes. Note that the cluster-id starts from 0 for each label, i.e. a cluster is uniquely identified by the tuple (label, cluster).
Here is a complete example with artificial data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
plt.rc("figure", dpi=100)
plt.style.use("ggplot")
# create fake data
centers = [[1, 1], [-1, -1], [1, -1], [-1, 1]]
XY, labels = make_blobs(n_samples=100, centers=centers, cluster_std=0.2, random_state=0)
inp = (
pd.DataFrame(XY, columns=["x", "y"])
.assign(label=labels)
.replace({"label": {2: 0, 3: 1}})
)
def add_cluster(df, distance):
db = DBSCAN(eps=distance, min_samples=1).fit(df[["x", "y"]])
return df.assign(cluster=db.labels_)
out = inp.groupby("label", group_keys=False).apply(add_cluster, 0.5)
# visualize
label_marker = ["o", "s"]
ax = plt.gca()
ax.set_aspect('equal')
for (label, cluster), group in out.groupby(["label", "cluster"]):
plt.scatter(group.x, group.y, marker=label_marker[label])
The resulting dataframe looks like this:
The plot of the clusters looks as follows. Labels are indicated by the marker shape and clusters by the color.

One-hot encoding Tensorflow Strings

I have a list of strings as labels for training a neural network. Now I want to convert them via one_hot encoding so that I can use them for my tensorflow network.
My input list looks like this:
labels = ['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']
The requested outcome should be something like
one_hot [0,1,0,2,0]
What is the easiest way to do this? Any help would be much appreciated.
Cheers,
Andi
the desired outcome looks like LabelEncoder in sklearn, not like OneHotEncoder - in tf you need CategoryEncoder - BUT it is A preprocessing layer which encodes integer features.:
inp = layers.Input(shape=[X.shape[0]])
x0 = layers.CategoryEncoding(
num_tokens=3, output_mode="multi_hot")(inp)
model = keras.Model(inputs=[inp], outputs=[x0])
model.compile(optimizer= 'adam',
loss='categorical_crossentropy',
metrics=[tf.keras.metrics.CategoricalCrossentropy()])
print(model.summary())
this part gets encoding of unique values... And you can make another branch in this model to input your initial vector & fit it according labels from this reference-branch (it is like join reference-table with fact-table in any database) -- here will be ensemble of referenced-data & your needed data & output...
pay attention to -- num_tokens=3, output_mode="multi_hot" -- are being given explicitly... AND numbers from class_names get apriory to model use, as is Feature Engineering - like this (in pd.DataFrame)
import numpy as np
import pandas as pd
d = {'transport_col':['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']}
dataset_df = pd.DataFrame(data=d)
classes = dataset_df['transport_col'].unique().tolist()
print(f"Label classes: {classes}")
df= dataset_df['transport_col'].map(classes.index).copy()
print(df)
from manual example REF: Encode the categorical label into an integer.
Details: This stage is necessary if your classification label is represented as a string. Note: Keras expected classification labels to be integers.
in another architecture, perhaps, you could use StringLookup
vocab= np.array(np.unique(labels))
inp = tf.keras.Input(shape= labels.shape[0], dtype=tf.string)
x = tf.keras.layers.StringLookup(vocabulary=vocab)(inp)
but labels are dependent vars usually, as opposed to features, and shouldn't be used at Input
Everything in keras.docs
possible FULL CODE:
import numpy as np
import pandas as pd
import keras
X = np.array([['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']])
vocab= np.unique(X)
print(vocab)
y= np.array([[0,1,0,2,0]])
inp = layers.Input(shape=[X.shape[0]], dtype='string')
x0= tf.keras.layers.StringLookup(vocabulary=vocab, name='finish')(inp)
model = keras.Model(inputs=[inp], outputs=[x0])
model.compile(optimizer= 'adam',
loss='categorical_crossentropy',
metrics=[tf.keras.metrics.categorical_crossentropy])
print(model.summary())
from tensorflow.keras import backend as K
for layerIndex, layer in enumerate(model.layers):
print(layerIndex)
func = K.function([model.get_layer(index=0).input], layer.output)
layerOutput = func([X]) # input_data is a numpy array
print(layerOutput)
if layerIndex==1: # the last layer here
scale = lambda x: x - 1
print(scale(layerOutput))
res:
[[0 1 0 2 0]]
another possible Solution for your case - layers.TextVectorization
import numpy as np
import keras
input_array = np.atleast_2d(np.array(['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']))
vocab= np.unique(input_array)
input_data = keras.Input(shape=(None,), dtype='string')
layer = layers.TextVectorization( max_tokens=None, standardize=None, split=None, output_mode="int", vocabulary=vocab)
int_data = layer(input_data)
model = keras.Model(inputs=input_data, outputs=int_data)
output_dataset = model.predict(input_array)
print(output_dataset) # starts from 2 ... probably [0, 1] somehow concerns binarization ?
scale = lambda x: x - 2
print(scale(output_dataset))
result:
array([[0, 1, 0, 2, 0]])

Visualizing class labels in self-organizing map plot or iris dataset

I am trying to produce a visualization of the SOM mapping for the Iris dataset ( https://archive.ics.uci.edu/ml/datasets/Iris).
My code so far:
from sklearn.datasets import load_iris
from mvpa2.suite import *
import pandas as pd
import numpy as np
df = pd.read_csv(filepath_or_buffer='data/iris.data', header=None, sep=',')
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end
# split the data table into feature data x and class labels y
x = df.ix[:,0:4].values # the first 4 columns are the features
y = df.ix[:,4].values # the last column is the class label
t = np.zeros(len(y), dtype=int)
t[y == 'Iris-setosa'] = 0
t[y == 'Iris-versicolor'] = 1
t[y == 'Iris-virginica'] = 2
som = SimpleSOMMapper((240, 320), 100, learning_rate=0.05)
som.train(x)
pl.imshow(som.K, origin='lower')
mapped = som(x)
for i, m in enumerate(mapped):
pl.text(m[1], m[0], t[i], ha='center', va='center',
bbox=dict(facecolor='white', alpha=0.5, lw=0))
pl.show()
which produces this mapping:
Is there any way to customize the palette so it looks nicer like this one? (taken from https://github.com/JustGlowing/minisom)?
Basically I am trying to use a nicer palette (perhaps with fewer colors) and mark the class labels in a nicer way.
Thank you.
I will answer my own question: it turns out that I forgot to slice my data:
pl.imshow(som.K[:,:,0], origin='lower')
Everything looks fine now: