Conditional Logistic Regression in SPSS: data preparation - conditional-statements

I need to perform a conditional logistic regression, and unfortunately must use SPSS in this case. The data should be composed of groups, in which there is a subject and 3 matched control. Each combined group should be numbered serially, so it could be used as strata. Suppose I have column A:
A. Diagnosis
1
0
0
0
..
How do I add to it column B with group numbers?
A.Diagnosis B.Group
1 1
0 1
0 1
0 1
1 2
0 2
0 2
0 2
..
Thanks a lot in advance

This will do it:
compute group=trunc(($casenum-0.1)/4)+1.
Note that this will not work if the order changes or if there are groups with less or more than 4 lines.

Related

Expected 2D array, got 1D array instead:

I used to built the model with Linear Regression by tuning the parameter using GridSearchCV.
In case of finding a score, i cant able to find that . It shows,
array=[1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1]
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I used the code.
model.score(y_pred,y_test)
whats the problem is this!!!
and model.score(takes which type of data)
y_pred is an single dimensinal array that gives from the regressor. but it cant take into the score function. what can i do? and what the solution for it?
It looks like you have switched the arguments.
The .score() function takes at least two arguments. The first one is an array-like of whatever input format you need (in this case (n_examples, n_features), the second is an array-like of shape (n_examples,) containing the correct target outputs corresponding to those inputs.
When in doubt, look in the documentation. In this case, the docstring of .score() would have helped you locate your problem.

Select rows of dataframe based on column values

Problem
I am working on a machine learning project which aims to see on what kind of raw data (text) the classifiers tend to make mistakes and on what kind of data they have no consensus.
Now I have a dataframe with labels, prediction results of 2 classifiers and text data. I am wondering if there is a simple way I could select rows based on some set operations of those columns with predictions or labels.
Data might look like
score review svm_pred dnn_pred
0 0 I went and saw this movie last night after bei... 0 1
1 1 Actor turned director Bill Paxton follows up h... 1 1
2 1 As a recreational golfer with some knowledge o... 0 1
3 1 I saw this film in a sneak preview, and it is ... 1 1
4 1 Bill Paxton has taken the true story of the 19... 1 1
5 1 I saw this film on September 1st, 2005 in Indi... 1 1
6 1 Maybe I'm reading into this too much, but I wo... 0 1
7 1 I felt this film did have many good qualities.... 1 1
8 1 This movie is amazing because the fact that th... 1 1
9 0 "Quitting" may be as much about exiting a pre-... 1 1
For example, I want to select rows both make mistakes, then the index 9 will be returned.
A made-up MWE data example is provided here
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3), columns=["score", "svm_pred", "dnn_pred"])
which returns
score svm_pred dnn_pred
0 0 1 0
1 0 0 1
2 0 0 0
3 1 0 0
4 0 0 1
5 0 1 1
6 1 0 1
7 0 1 1
8 1 1 1
9 1 1 1
What I Have Done
I know I could list all possible combinations, 000, 001, etc. However,
This is not doable when I want to compare more classifiers.
This will not work for multi-class classification problem.
Could someone help me, thank you in advance.
Why This Question is Not a Duplicate
The existing answers only consider the case where the number of columns are limited. However, in my application, the number of predictions given by classifier (i.e. columns) could be large and this makes the existing answer not quite applicable.
At the same time, the use of pd.Series.ne function is first seen to use this in particular application, which might shed some light to people with similar confusion.
Create a helper Series of "number of incorrect classifers" that you can do logical operations on. This makes the assumption that true score is in column 1 and subsequent prediction values are in columns 2-onwards - You may need to update the slicing indices accordingly
s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)
Example Usage:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3),
columns=["score", "svm_pred", "dnn_pred"])
s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)
# Return rows where all classifers got it right
df[s.eq(0)]
score svm_pred dnn_pred
2 0 0 0
8 1 1 1
9 1 1 1
# Return rows where 1 classifer got it wrong
df[s.eq(1)]
score svm_pred dnn_pred
0 0 1 0
1 0 0 1
4 0 0 1
6 1 0 1
# Return rows where all classifers got it wrong
df[s.eq(2)]
score svm_pred dnn_pred
3 1 0 0
5 0 1 1
7 0 1 1
You can use set operations on the selection of rows:
# returns indexes of those rows where score is equal to svm prediction and dnn prediction
df[(df['score'] == df['svm_pred']) & (df['score'] == df['dnn_pred'])].index
# returns indexes of those rows where both predictions are wrong
df[(df['score'] != df['svm_pred']) & (df['score'] != df['dnn_pred'])].index
# returns indexes of those rows where either predictions are wrong
df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])].index
If you are not only interested in the index, but the complete row, omit the last part:
# returns rows where either predictions are wrong
df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])]

How to avoid dummy variable trap for multiple category in one column

I am working on a regression problem. I have a categorical column which has 24 categorical value.One-hot encoding is showing too many dummy variable. Is there a way to avoid multiple dummy variable trap. Kindly guide me
here is my sample of the categorical column
After label encoding
Thank you
You can use this:
df['column'] = df['column'].astype('category').cat.codes
Example:
df = pd.DataFrame(['a','b','c','d','a','c','a','d'], columns=['column'])
Output:
column
0 0
1 1
2 2
3 3
4 0
5 2
6 0
7 3

How to generate a list of contemporaneous loans from loan level data?

I am trying to detect multiple borrowing in a loan level data set that looks as follows:
d = {'start_month': [1,2,4,1,14],
'customer': ['A','A','A','C','C'],
'branch': [1,2,3,2,1],
'maturity_month': [13,14,16,13,26]}
df = pd.DataFrame(data=d)
I want to reshape these data into a month/branch panel that indicates for each branch i the branches j that are also currently loaning to the same customer as branch i.
For branch i, loaning to the same customer as branch j in some month is defined as maturity_month_i >= maturity_month_j > start_month_i
d2 = {'start_month': [1,1,2,4,14],
'branch': [1,2,2,3,1],
'contemp_branch1':[0,0,1,1,0],
'contemp_branch2':[0,0,0,1,0],
'contemp_branch3':[0,0,0,0,0]}
df2 = pd.DataFrame(data=d2)
Desired output
I assume that I will need to (i) generate a long data set, which, for every loan, lists all contemporaneous loans and their respective branches, and then (ii) reshape. I am struggling primarily with (i), especially since my data set is very large and I need an efficient solution.
Thanks a lot!
Let's use get_dummies, add_prefix and assign:
df[['branch','start_month']].assign(**df.branch.astype(str).str.get_dummies()
.add_prefix('contemp_branch'))
Output:
branch start_month contemp_branch1 contemp_branch2 contemp_branch3
0 1 1 1 0 0
1 2 2 0 1 0
2 3 4 0 0 1
3 2 1 0 1 0
4 1 14 1 0 0

Sql Server Row Concatenation

I have a table (table variable in-fact) that holds several thousand (50k approx) rows of the form:
group (int) isok (bit) x y
20 0 1 1
20 1 2 1
20 1 3 1
20 0 1 2
20 0 2 1
21 1 1 1
21 0 2 1
21 1 3 1
21 0 1 2
21 1 2 2
And to pull this back to the client is a fairly hefty task (especially since isok is a bit). What I would like to do is transform this into the form:
group mask
20 01100
21 10101
And maybe go even a step further by encoding this into a long etc.
NOTE: The way in which the data is stored currently cannot be changed.
Is something like this possible in SQL Server 2005, and if possible even 2000 (quite important)?
EDIT: I forgot to make it clear that the original table is already in an implicit ordering that needs to be maintained, there isnt one column that acts as a linear sequence, but rather the ordering is based on two other columns (integers) as above (x & y)
You can treat the bit as a string ('0', '1') and deploy one of the many string aggregate concatenation methods described here: http://www.simple-talk.com/sql/t-sql-programming/concatenating-row-values-in-transact-sql/