Expected 2D array, got 1D array instead: - pandas

I used to built the model with Linear Regression by tuning the parameter using GridSearchCV.
In case of finding a score, i cant able to find that . It shows,
array=[1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1]
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I used the code.
model.score(y_pred,y_test)
whats the problem is this!!!
and model.score(takes which type of data)
y_pred is an single dimensinal array that gives from the regressor. but it cant take into the score function. what can i do? and what the solution for it?

It looks like you have switched the arguments.
The .score() function takes at least two arguments. The first one is an array-like of whatever input format you need (in this case (n_examples, n_features), the second is an array-like of shape (n_examples,) containing the correct target outputs corresponding to those inputs.
When in doubt, look in the documentation. In this case, the docstring of .score() would have helped you locate your problem.

Related

Handling features with multiple values per instance in Python for Machine Learning model

I am trying to handle my data set which contain some features that has some multiple values per instances as shown on the image
https://i.stack.imgur.com/D78el.png
I am trying to separate each value by '|' symbol to apply One-Hot encoding technique but I can't find any suitable solution to my problem
My idea is to keep every multiple values in one row or by another word convert each cell to list of integers
Maybe this is what you want:
df = pd.DataFrame(['465','444','465','864|857|850|843'],columns=['genre_ids'])
df
genre_ids
0 465
1 444
2 465
3 864|857|850|843
df['genre_ids'].str.get_dummies(sep='|')
444 465 843 850 857 864
0 0 1 0 0 0 0
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 1 1 1

Select rows of dataframe based on column values

Problem
I am working on a machine learning project which aims to see on what kind of raw data (text) the classifiers tend to make mistakes and on what kind of data they have no consensus.
Now I have a dataframe with labels, prediction results of 2 classifiers and text data. I am wondering if there is a simple way I could select rows based on some set operations of those columns with predictions or labels.
Data might look like
score review svm_pred dnn_pred
0 0 I went and saw this movie last night after bei... 0 1
1 1 Actor turned director Bill Paxton follows up h... 1 1
2 1 As a recreational golfer with some knowledge o... 0 1
3 1 I saw this film in a sneak preview, and it is ... 1 1
4 1 Bill Paxton has taken the true story of the 19... 1 1
5 1 I saw this film on September 1st, 2005 in Indi... 1 1
6 1 Maybe I'm reading into this too much, but I wo... 0 1
7 1 I felt this film did have many good qualities.... 1 1
8 1 This movie is amazing because the fact that th... 1 1
9 0 "Quitting" may be as much about exiting a pre-... 1 1
For example, I want to select rows both make mistakes, then the index 9 will be returned.
A made-up MWE data example is provided here
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3), columns=["score", "svm_pred", "dnn_pred"])
which returns
score svm_pred dnn_pred
0 0 1 0
1 0 0 1
2 0 0 0
3 1 0 0
4 0 0 1
5 0 1 1
6 1 0 1
7 0 1 1
8 1 1 1
9 1 1 1
What I Have Done
I know I could list all possible combinations, 000, 001, etc. However,
This is not doable when I want to compare more classifiers.
This will not work for multi-class classification problem.
Could someone help me, thank you in advance.
Why This Question is Not a Duplicate
The existing answers only consider the case where the number of columns are limited. However, in my application, the number of predictions given by classifier (i.e. columns) could be large and this makes the existing answer not quite applicable.
At the same time, the use of pd.Series.ne function is first seen to use this in particular application, which might shed some light to people with similar confusion.
Create a helper Series of "number of incorrect classifers" that you can do logical operations on. This makes the assumption that true score is in column 1 and subsequent prediction values are in columns 2-onwards - You may need to update the slicing indices accordingly
s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)
Example Usage:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3),
columns=["score", "svm_pred", "dnn_pred"])
s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)
# Return rows where all classifers got it right
df[s.eq(0)]
score svm_pred dnn_pred
2 0 0 0
8 1 1 1
9 1 1 1
# Return rows where 1 classifer got it wrong
df[s.eq(1)]
score svm_pred dnn_pred
0 0 1 0
1 0 0 1
4 0 0 1
6 1 0 1
# Return rows where all classifers got it wrong
df[s.eq(2)]
score svm_pred dnn_pred
3 1 0 0
5 0 1 1
7 0 1 1
You can use set operations on the selection of rows:
# returns indexes of those rows where score is equal to svm prediction and dnn prediction
df[(df['score'] == df['svm_pred']) & (df['score'] == df['dnn_pred'])].index
# returns indexes of those rows where both predictions are wrong
df[(df['score'] != df['svm_pred']) & (df['score'] != df['dnn_pred'])].index
# returns indexes of those rows where either predictions are wrong
df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])].index
If you are not only interested in the index, but the complete row, omit the last part:
# returns rows where either predictions are wrong
df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])]

Encode categorical features with multiple categories per example - sklearn

I'm working on a movie dataset which contains genre as a feature. The examples in the dataset may belong to multiple genres at the same time. So, they contain a list of genre labels.
The data looks like this-
movieId genres
0 1 [Adventure, Animation, Children, Comedy, Fantasy]
1 2 [Adventure, Children, Fantasy]
2 3 [Comedy, Romance]
3 4 [Comedy, Drama, Romance]
4 5 [Comedy]
I want to vectorize this feature. I have tried LabelEncoder and OneHotEncoder, but they can't seem to handle these lists directly.
I could vectorize this manually, but I have other similar features that contain too many categories. For those I'd prefer some way to use the FeatureHasher class directly.
Is there some way to get these encoder classes to work on such a feature? Or is there a better way to represent such a feature that will make encoding easier? I'd gladly welcome any suggestions.
This SO question has some impressive answers. On your example data, the last answer by Teoretic (using sklearn.preprocessing.MultiLabelBinarizer) is 14 times faster than the solution by Paulo Alves (and both are faster than the accepted answer!):
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
encoded = pd.DataFrame(mlb.fit_transform(df['genres']), columns=mlb.classes_, index=df.index)
result = pd.concat([df['movieId'], encoded], axis=1)
# Increase max columns to print the entire resulting DataFrame
pd.options.display.max_columns = 50
result
movieId Adventure Animation Children Comedy Drama Fantasy Romance
0 1 1 1 1 1 0 1 0
1 2 1 0 1 0 0 1 0
2 3 0 0 0 1 0 0 1
3 4 0 0 0 1 1 0 1
4 5 0 0 0 1 0 0 0

TetGen generates tets for empty chambers/holes in a model

I am using tetgen to generate meshes for my research.
My models have empty internal chambers inside them. For example, an empty box of size (5,5,5) inside a box of size (10, 10, 10). See image:
The problem is that tetgen generates tetrahedrons inside the empty chamber. Why? Is there a way to avoid it?
I tried using -YY, -q, -CC, -c, and their combinations, but all had the same problem, and did not give insight on the error. (http://wias-berlin.de/software/tetgen/1.5/doc/manual/manual005.html).
The way I solved it was to create a .poly file (http://wias-berlin.de/software/tetgen/fformats.poly.html). I created a .poly file from a .off file (https://en.wikipedia.org/wiki/OFF_(file_format)), which I could export from OpenScad.
.poly file has 4 parts, from which the 3rd specifies holes in the object. You need to tell TetGen where you have holes in the object.
The way to do it, is by specifying one point in the hole/chamber.
A possible .poly file would look like this:
part1 - vertices:
40 3 0 0
0 0.2 0 1
1 0.161803 0.117557 0
...
part2 - faces:
72 0
1
3 0 1 2
1
3 1 0 3
...
part3 - holes <============== the one I needed
1
1 0 0 0.5 <=== this is a point, which I know is inside my hole/chamber
So here is the file, without any breaks, just in case:
40 3 0 0
0 0.2 0 1
1 0.161803 0.117557 0
...
72 0
1
3 0 1 2
1
3 1 0 3
...
1
1 0 0 0.5

Conditional Logistic Regression in SPSS: data preparation

I need to perform a conditional logistic regression, and unfortunately must use SPSS in this case. The data should be composed of groups, in which there is a subject and 3 matched control. Each combined group should be numbered serially, so it could be used as strata. Suppose I have column A:
A. Diagnosis
1
0
0
0
..
How do I add to it column B with group numbers?
A.Diagnosis B.Group
1 1
0 1
0 1
0 1
1 2
0 2
0 2
0 2
..
Thanks a lot in advance
This will do it:
compute group=trunc(($casenum-0.1)/4)+1.
Note that this will not work if the order changes or if there are groups with less or more than 4 lines.