how to add a column into the numpy matrix? - numpy

here is my code to add additional column into x_vals, but i keep getting this error:
x_vals = np.array([x[0:4] for x in iris.data])
np.concatenate([x_vals, np.array([x[0] for x in iris.data])],1)
ValueError: all the input arrays must have same number of dimensions
can anyone help me ?

Related

How to build a numpy matrix one row at a time?

I'm trying to build a matrix one row at a time.
import numpy as np
f = np.matrix([])
f = np.vstack([ f, np.matrix([1]) ])
This is the error message.
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 1
As you can see, np.matrix([]) is NOT an empty list. I'm going to have to do this some other way. But what? I'd rather not do an ugly workaround kludge.
you have to pass some dimension to the initial matrix. Either fill it with some zeros or use np.empty():
f = np.empty(shape = [1,1])
f = np.vstack([f,np.matrix([1])])
you can use np.hstack instead for the first case, then use vstack iteratively.
arr = np.array([])
arr = np.hstack((arr, np.array([1,1,1])))
arr = np.vstack((arr, np.array([2,2,2])))
Now you can convert into a matrix.
mat = np.asmatrix(arr)
Good grief. It appears there is no way to do what I want. Kludgetown it is. I'll build an array with a bogus first entry, then when I'm done make a copy without the bogosity.

iloc using scikit learn random forest classifier

I am trying to build a random forest classifier to determine the 'type' of an object based on different attributes. I am having trouble understanding iloc and separating the predictors from the classification. If the 50th column is the 'type' column, I am wondering why the iloc (commented out) line does not work, but the line y = dataset["type"] does. I have attached the code below. Thank you!
X = dataset.iloc[:, 0:50].values
y = dataset["type"]
#y = dataset.iloc[:,50].values
Let's assume that the first column in your dataframe is named "0" and the following columns are named consequently. Like the result of the following lines
last_col=50
tab=pd.DataFrame([[x for x in range(last_col)] for c in range(10)])
now, please try tab.iloc[:,0:50] - it will work because you used slice to select columns indexes.
But if you try tab.iloc[:,50] - it will not work, because there is no column with index 50.
Slicing and selecting column by its index is just a bit different. From pandas documentation:
.iloc[] is primarily integer position based (from 0 to length-1 of the axis)
I hope this help.

Getting Error while performing Undersampling for Sklearn

I am trying built an randomforest classifier for binary classification . My data is inbalanced hence I am performing undersampling.
train = data.drop(['Co_Name','Cust_ID','Phone','Shpr_ID','Resi_Cnt','Buz_Cnt','Nearby_Cnt','parseNumber','removeString','Qty','bins','Adj_Addr','Resi','Weight','Resi_Area','Lat','Lng'], axis=1)
Y = data['Resi']
from sklearn import metrics
rus = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(train, Y)
I am getting the below error
446 # make sure we actually converted to numeric:
447 if dtype_numeric and array.dtype.kind == "O":
--> 448 array = array.astype(np.float64)
449 if not allow_nd and array.ndim >= 3:
450 raise ValueError("Found array with dim %d. %s expected <= 2."
ValueError: setting an array element with a sequence.
How to fix this.
Can you share the dataframe? or a sample of that!
This error can be a lot of things, for example:
If you try:
np.asarray(
[
[1, 2],
[2, 3, 4]
],
dtype=np.float)
You will get:
ValueError: setting an array element with a sequence.
This is because the array have incorrect shape of columns. So you can't create an array from lists, with a column length different on the second list. So doesn't match column length.
But your error probably it's related to train vs Y shape or the type in the train(data). During the Under-sampled fit function should have some conversion that throws this error. Confirm if train (data) have the appropriate type before to do the RandomUnderSampler.

Sklearn and Sparse Matrices ValueError

I'm aware similar questions have been asked before, and I've tried everything suggested in them, but I'm still stumped. I have a dataset with 2 columns: The first with vectors representing words stored as a 1x10000 sparse csr matrix (so a matrix in each cell), and the second contains integer ratings which I will use for classification. When I run the following code
for index, row in data.iterrows():
print(row)
print(row[0].shape)
I get the correct output for all the rows
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
Now when I try passing my data in any SKlearn classifier like so:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
I get the following error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
What am I doing wrong? I've made sure all my sparse matrices are the same size and I've tried reshaping my data in various ways, but with no luck, and the Sklearn classifiers are supposed to be able to deal with csr matrices.
Update: Converting the entire "Vectors" column into one large 2-D matrix did the trick, but for completeness sake the following is the code I used to generate my dataframe if anyone is curious and wants to try solving the original issue. Assume data is a pandas dataframe with rows that look like
"560 420 222" 5.0
"2345 2344 2344 5" 3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data

error performing np.std for array

this is my code, im trying to calculate the standard deviation of an imported list which is shown below
b=[]
#time=[]
with open('nt.txt') as csvfile:
data=csv.reader(csvfile,delimiter=('\t'))
index=0
for line in data:
b.append(line[1])
#out=line[0]
#new=out.split(" ")
#b.append(new[0])
#else:break
x=statistics.stdev(b)
print(x)
with b =['-0,002549', '-0,002040', '-0,001530'] as my output
i get ...
raise TypeError(msg.format(type(x).__name__)) from None
TypeError: can't convert type 'str' to numerator/denominator
results=np.array([[x],[b]]).astype(np.float32)
You have to set the type of the numpy array, not the list.