train_test_split errors with two csv files - pandas

I am working with 2 csv files and I want to compares values from both using the train_test_split function.
My code is the following:
X = np.append(y1[:100])
X_train, X_test, y_train, y_test = train_test_split(X, y1)
I know that X and y1 are not of the same length and I was trying to fix this error:
ValueError: Found input variables with inconsistent numbers of samples: [4840242, 44898]
However, with the first line I am currently getting this error:
File "<array_function internals>", line 179, in append
TypeError: _append_dispatcher() missing 1 required positional argument: 'values'
How would I be able to fix this?

You are using the numpy append function wrong. The function expects an array to append to. You are just giving it values, but not the array that these values are supposed to be appended on (or the other way around, you are giving the array, but not the values). If the first 100 entries in y are supposed to be your X, simply writing X = y[:100] will suffice.

Related

How to import a CSV file, split it 70/30 and then use first column as my 'y' value?

I am having an issue at the moment, I think im making it far more complicated than it needs to be. my csv file is 31 rows by 500. I need to import this, split it in a 70/30 ratio and then be able to use the first column as my 'y' value for a neural network, and the remaining 30 columns need to be my 'x' value.
ive implemented the below code to do this, but when I run it through my basic sigmoid and testing functions, it provides results in a weird format i.e. [6.54694655e-06].
I believe this is due to my splitting/importing of the data, which I think I have done wrong. I need to import the data into arrays that are readable by my functions, and be able to separate my first column specifically to a 'y' value. how do I go about this?
df = pd.read_csv(r'data.csv', header=None)
df.to_numpy()
#splitting data 70/30
trainingdata= df[:329]
testingdata= df[:141]
#converting data to seperate arrays for training and testing
training_features= trainingdata.loc[:, trainingdata.columns != 0].values.reshape(329,30)
training_labels = trainingdata[0]
training_labels = training_labels.values.reshape(329,1)
testing_features = testingdata[0]
testing_labels = testingdata.loc[:, testingdata.columns != 0]
Usually for splitting the dataframe on test and train data I use sklearn.model_selection.train_test_split. Documentation here.
Some other methods are described here Hope this will help you!
Make you train/test split easy by using sklearn.model_selection.train_test_split.
If you don't have sklearn installed, first install it by running pip install -U scikit-learn.
Then
from sklearn.model_selection import train_test_split
df = pd.read_csv(r'data.csv', header=None)
# X is your features, y is your target column
X = df.loc[:,1:]
y = df.loc[:,0]
# Use train_test_split function with test size of 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
df = pd.read_csv(r'data.csv')
df.to_numpy()
print(df)

Python numpy: (IndexError: too many indices for array) How to choose specific index to my matrix?

I'm trying to build a model from an array with 572 rows and 8 columns loaded with NumPy. Define the sets using the line address for a new array:
train_x = x_vals[(11:34, 46:98, 110:268, 280:342, 354:408, 420:428, 440:478, 490:538, 550:571]
test_x = x_vals[0:10, 35:45, 99:109, 269:279, 343:353, 409:419, 429:439, 479:489, 539:549]
train_y = y_vals[11:34, 46:98, 110:268, 280:342, 354:408, 420:428, 440:478, 490:538, 550:571]
test_y = y_vals[0:10, 35:45, 99:109, 269:279, 343:353, 409:419, 429:439, 479:489, 539:549]
I'm trying to test my model with 99 samples and calibrate with 473. Although the Spyder environment accepts the declarations of the lines above, at the time of running the program it appears:
train_x = x_vals[11:34, 46:98, 110:268, 280:342, 354:408, 420:428, 440:478, 490:538, 550:571]
IndexError: too many indices for array
What is missing in the declaration of the sets above?

Stacking list of lists vertically using np.vstack is throwing an error

I am following this piece of code http://queirozf.com/entries/scikit-learn-pipeline-examples in order to develop a Multilabel OnevsRest classifier for text. I would like to compute the hamming_score and thus would need to binarize my test labels as well. I thus have:
X_train, X_test, labels_train, labels_test = train_test_split(meetings, labels, test_size=0.4)
Here, labels_train and labels_test are list of lists
[['dog', 'cat'], ['cat'], ['people'], ['nice', 'people']]
Now I need to binarize all my labels, I am therefore doing this...
all_labels = np.vstack([labels_train, labels_test])
mlb = MultiLabelBinarizer().fit(all_labels)
As directed by in the link. But that throws
ValueError: all the input array dimensions except for the concatenation axis must match exactly
I used np.column_stack as directed here
numpy array concatenate: "ValueError: all the input arrays must have same number of dimensions"
but that throws the same error.
How can the dimensions be the same if I am splitting on train and test, I am bound to get different shapes right? Please help, thank you.
MultilabelBinarizer works on list of lists directly, so you dont need to stack them using numpy. Directly send the list without stacking.
all_labels = labels_train + labels_test
mlb = MultiLabelBinarizer().fit(all_labels)

PyPlot throws an error when DataFrame-Column has missing values

I have the following problem:
I would like to plot a variable from a Dataframe with missing values, which are denoted as "NA". However, if I just go ahead and use with Pyplot
x = df[df[:country] .== "Belgium",:year]
y = df[df[:country] .== "Belgium",:hpNormLog]
plot(x, y, "b-", linewidth=2)
I get the following error message:
PyError (:PyObject_Call) <class 'TypeError'> TypeError("float() argument must be a string or a number, not 'PyCall.jlwrap'",)
File "C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 3154, in plot
ret = ax.plot(*args, **kwargs) File "C:\Anaconda3\lib\site-packages\matplotlib\__init__.py", line 1811, in inner
return func(ax, *args, **kwargs) File "C:\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py", line 1425, in plot
self.add_line(line) File "C:\Anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 1708, in add_line
self._update_line_limits(line) File "C:\Anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 1730, in _update_line_limits
path = line.get_path() File "C:\Anaconda3\lib\site-packages\matplotlib\lines.py", line 925, in get_path
self.recache() File "C:\Anaconda3\lib\site-packages\matplotlib\lines.py", line 621, in recache
y = np.asarray(yconv, np.float_) File "C:\Anaconda3\lib\site-packages\numpy\core\numeri...
I would be very grateful, if I had a solution around it.
Best,
Ilja
I found the following solution. I am not deep enough into how Julia works, so I can only say what works and what does not. Arrays with NaN can be plotted with the code written above, columns of DataFrames however do not permit the same thing. The column needs to be converted to an Array, before it can be plotted with missing values. The following code solves the problem:
x = df[df[:country] .== "Belgium",:year]
ytest = df[df[:country] .== "Belgium",:hpNormLog]
y = convert(Array,ytest,NaN)
plot(x, y, "b-", linewidth=2)
x does not contain missing values and therefore I can keep using the DataFrame, but y does contain missing values, so it needs to be converted to an Array. The third argument of convert specifies to what missing values should be converted, in this case to NaN.
Why don't you perform error-handling?
try:
plot(x, y, "b-", linewidth=2)
except PyError:
pass
Escape the error when it works most of the time for your input but skip plotting of "NA"-values....

Tuple indices must be integers not tuple, matplot

I'm trying to code a program that will integrate a function using diferent ways (Euler, Runge...) and using the build-in function scipy.integrate.odeint.
Everything and I'm getting the right results but I also need to create a graph with the results and that's when everything goes wrong.
For the odeint function I can't draw the graph.
Here is my code and the ERROR, I hope someone will be able to help me.
def odeint(phi, t0tf, Y0, N):
T6=numpy.zeros((N+1))
T6[0]=t0tf[0]
h=(t0tf[1]-t0tf[0])/N
for i in range (N):
T6[i+1]=T6[i]+h
def f(t,x):
return phi(x,t)
Y6 = scipy.integrate.odeint(f,Y0,T6, full_output=True)
return Y6
Y6 = edo.odeint(phi, t0tf, Y0, N)
T6Y6 = numpy.hstack([Y6])
print("Solutions Scipy :")
print()
print(T6Y6)
print()
mpl.figure("Courbes")
mpl.plot(Y6[0:N,0],Y6[0:N,1],color="yellow",label="Isoda")
mpl.show()
And the error is :
mpl.plot(Y6[0:N,0],Y6[0:N,1],color="yellow",label="Isoda")
TypeError: tuple indices must be integers, not tuple
Thanks in advance (PS: I'm french so my sentences might be kinda shaky)
Y6 seems to be a tuple that you are calling in an incorrect way. It's difficult to point out exactly what is wrong since you didn't provide the data but the following example shows you how to call elements from a tuple:
y = ((1,2,3,4,5),)
print('This one works: ',y[0][1:])
print(y[1:,0])
, the result is this:
This one works: (2, 3, 4, 5)
Traceback (most recent call last):
File "E:\armatita\stackoverflow\test.py", line 9, in <module>
print(y[1:,0])
TypeError: tuple indices must be integers, not tuple