set column value based on distinct values in another column - sql

I am trying to do something very similar to this question: mysql - UPDATEing row based on other rows
I have a table, called modset, of the following form:
member year y1 y2 y3 y1y2 y2y3 y1y3 y1y2y3
a 1 0 0 0 0 0 0 0
a 2 0 0 0 0 0 0 0
a 3 0 0 0 0 0 0 0
b 1 0 0 0 0 0 0 0
b 2 0 0 0 0 0 0 0
c 1 0 0 0 0 0 0 0
c 3 0 0 0 0 0 0 0
d 2 0 0 0 0 0 0 0
Columns 3:9 are binary flags to indicate which combination of years the member has records in. So I wish the result of an SQL update to look as follows:
member year y1 y2 y3 y1y2 y2y3 y1y3 y1y2y3
a 1 0 0 0 0 0 0 1
a 2 0 0 0 0 0 0 1
a 3 0 0 0 0 0 0 1
b 1 0 0 0 1 0 0 0
b 2 0 0 0 1 0 0 0
c 1 0 0 0 0 0 1 0
c 3 0 0 0 0 0 1 0
d 2 0 1 0 0 0 0 0
The code in the question linked above does something very close but only when it is a count of the distinct years in which the member has records. I need to base the columns on the specific values of the years in which the member has records.
Thanks in advance!
SOLUTION
SELECT member,
case when min(distinct(year)) = 1 and max(distinct(year)) = 1 then 1 else 0 end y1,
case when min(distinct(year)) = 1 and max(distinct(year)) = 2 then 1 else 0 end y1y2,
case when min(distinct(year)) = 1 and max(distinct(year)) = 3 and count(distinct(year)) = 2 then 1 else 0 end y1y3,
case when min(distinct(year)) = 1 and max(distinct(year)) = 3 and count(distinct(year)) = 3 then 1 else 0 end y1y2y3,
case when min(distinct(year)) = 2 and max(distinct(year)) = 2 then 1 else 0 end y2,
case when min(distinct(year)) = 2 and max(distinct(year)) = 3 then 1 else 0 end y2y3,
case when min(distinct(year)) = 3 then 1 else 0 end y3
INTO temp5
FROM modset
GROUP BY member;
UPDATE modset M
SET y1 = T.y1, y2 = T.y1, y3 = T.y3, y1y2 = T.y1y2, y1y3 = T.y1y3, y2y3 = T.y2y3, y1y2y3 = T.y1y2y3
FROM temp5 T
WHERE T.member = M.member;

What is the query you are using to return the indicators of the years the member has records in?
It sounds like you would want take your query results and use it in your update:
http://dev.mysql.com/doc/refman/5.0/en/update.html
It may look something like this:
UPDATE targetTable t, sourceTable s
SET t.y1 = s.y1, t.y2 = s.y2 -- (and so on...)
WHERE t.member = s.member AND t.year = m.year;

Related

Get_dummies produces more columns than its supposed to

I'm using get_dummies on a column of data that has zeroes or 'D' or "E". Instead of producing 2 columns it produces 5 - C, D, E, N, O. I'm not sure what they are and how to make it do just 2 as its supposed to.
When I just pull that column shows 0's and D and E, but when I put it in get_dummies adds extra columns
data[[2]]
0
0
D
0
0
0
0
D
0
0
When I do this:
dummy = pd.get_dummies(data[2], dummy_na = False)
dummy.head()
I get
0 C D E N O PreferredContactTime
0 0 0 0 0 0 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
0 0 1 0 0 0 0
1 0 0 0 0 0 0
What are C , N and O? I don't understand what it is displaying at all.
Setup
dtype = pd.CategoricalDtype([0, 'C', 'D', 'E', 'N', 'O', 'PreferredContactTime'])
data = pd.DataFrame({2: [
'PreferredContactTime', 0, 0, 'D', 0, 0, 0, 0, 'D', 0, 0
]}).astype(dtype)
Your result
dummy = pd.get_dummies(data[2], dummy_na=False )
dummy.head()
0 C D E N O PreferredContactTime
0 0 0 0 0 0 0 1
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0
4 1 0 0 0 0 0 0

how to convert pandas dataframe to libsvm format?

I have pandas data frame like below.
df
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 \
0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
1 0 1 1 1 0 0 1 1 1 1 ... 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
5 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
7 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
[8 rows x 100 columns]
I have target variable as an array as below.
[1, -1, -1, 1, 1, -1, 1, 1]
How can I map this target variable to a data frame and convert it into lib SVM format?.
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.index.map[(equi)]
d = df[np.setdiff1d(df.columns,['indx','labels'])]
e = df.label
dump_svmlight_file(d,e,'D:/result/smvlight2.dat')er code here
ERROR:
File "D:/spyder/april.py", line 54, in <module>
df["labels"] = df.index.map[(equi)]
TypeError: 'method' object is not subscriptable
When I use
df["labels"] = df.index.list(map[(equi)])
ERROR:
AttributeError: 'RangeIndex' object has no attribute 'list'
Please help me to solve those errors.
I think you need convert index to_series and then call map:
df["labels"] = df.index.to_series().map(equi)
Or use rename of index:
df["labels"] = df.rename(index=equi).index
All together:
For difference of columns pandas has difference:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.rename(index=equi).index
e = df["labels"]
d = df[df.columns.difference(['indx','labels'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')
Also it seems label column is not necessary:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
e = df.rename(index=equi).index
d = df[df.columns.difference(['indx'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')

SQL: Count columns equal to a value

I have a table and I would like to write query that shows lines which have all columns to '0' but one .
T_CELKO_TEN_IN_ON_TIO(TIO_ID,TIO_1,TIO_2,TIO_3,TIO_4,TIO_5,TIO_6,TIO_7,TIO_8,TIO_9,TIO_10);
I have numbers in it
for example if I have:
1(id) 0 1 1 0 0 0 0 0 0 0
2(id) 0 0 0 0 0 0 0 1 0 0
3(id) 0 1 -2 3 -4 5 -6 7 -8 5
So the query should prints:
2(id) 0 0 0 0 0 0 0 1 0 0
I have wrote this query:
Select * from T_CELKO_TEN_IN_ON_TIO where SUM (CASE WHEN TIO_1='0' THEN 1 ELSE 0 END OR
TIO_2='0' THEN 1 ELSE 0 END OR
TIO_3='0' THEN 1 ELSE 0 END OR
TIO_4='0' THEN 1 ELSE 0 END OR
TIO_5='0' THEN 1 ELSE 0 END OR
TIO_6='0' THEN 1 ELSE 0 END OR
TIO_7='0' THEN 1 ELSE 0 END OR
TIO_8='0' THEN 1 ELSE 0 END OR
TIO_9='0' THEN 1 ELSE 0 END OR
TIO_10='0' THEN 1 ELSE 0 END)=9;
I get an error: An expression of non-boolean type specified in a context where a condition is expected, near 'OR'
but I think even the my query does not work.
I hope I understand this correctly:
SUM is an aggregate function and cannot be used in this context. In my code I test each value (assuming they are numeric) if they are 1, zero or other. All other values are returned as 1000. So the pure summation of these values should only be "1" if there are many "0" and only one single "1" value...
Select * from T_CELKO_TEN_IN_ON_TIO
where ( CASE TIO_1 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END
+ CASE TIO_2 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END
+ CASE TIO_3 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END
+ CASE TIO_4 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END
+ CASE TIO_5 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END
+ CASE TIO_6 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END
+ CASE TIO_7 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END
+ CASE TIO_8 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END
+ CASE TIO_9 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END
+ CASE TIO_10 WHEN 1 THEN 1 WHEN 0 THEN 0 ELSE 1000 END)=1;
UPDATE
I got this wrong, as I thought you want to handle the "1" separately. This should be what you really needed:
Select * from T_CELKO_TEN_IN_ON_TIO
where ( CASE TIO_1 WHEN 0 THEN 1 ELSE 0 END
+ CASE TIO_2 WHEN 0 THEN 1 ELSE 0 END
+ CASE TIO_3 WHEN 0 THEN 1 ELSE 0 END
+ CASE TIO_4 WHEN 0 THEN 1 ELSE 0 END
+ CASE TIO_5 WHEN 0 THEN 1 ELSE 0 END
+ CASE TIO_6 WHEN 0 THEN 1 ELSE 0 END
+ CASE TIO_7 WHEN 0 THEN 1 ELSE 0 END
+ CASE TIO_8 WHEN 0 THEN 1 ELSE 0 END
+ CASE TIO_9 WHEN 0 THEN 1 ELSE 0 END
+ CASE TIO_10 WHEN 0 THEN 1 ELSE 0 END)=9;
May be You want something like this.
Select * from T_CELKO_TEN_IN_ON_TIO
where SUM (CASE WHEN TIO_1='0' THEN 1
WHEN TIO_2='0' THEN 1
WHEN TIO_3='0' THEN 1
WHEN TIO_4='0' THEN 1
WHEN TIO_5='0' THEN 1
WHEN TIO_6='0' THEN 1
WHEN TIO_7='0' THEN 1
WHEN TIO_8='0' THEN 1
WHEN TIO_9='0' THEN 1
WHEN TIO_10='0' THEN 1 ELSE 0 END) = 9;

Complex Excel Formula in Pandas

Excel Formulas I am trying to replicate in pandas:
Click here to download workbook
* Look at columns D, E and F
entsig and exsig are manual and can be changed. In real life they would be derived from the value of another column or a comparison of two other columns
ent = 1 if entsig previous = 1 and in = 0
in = 1 if ent previous = 1 or (in previous = 1 and ex = 0)
ex = 1 if exsig previous = 1 and in previous = 1
so either ent, in, or ex will always be = 1 but never more than one of them
import pandas as pd
df = pd.DataFrame(
[[0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,1,0,0,0], [0,1,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0],
[1,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0]],
columns=['entsig', 'exsig','ent', 'in', 'ex'])
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
df
results in
entsig exsig ent in ex
0 0 0 0 0 0
1 1 0 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 0 0 0 1 0
5 0 1 0 1 0
6 0 1 0 0 1
7 1 0 0 0 0
8 1 0 1 0 0
9 0 0 0 1 0
10 0 0 0 1 0
11 0 0 0 1 0
12 0 1 0 1 0
13 0 1 0 0 1
14 0 1 0 0 0
15 0 0 0 0 0
16 0 0 0 0 0
17 1 0 0 0 0
18 1 0 1 0 0
19 1 0 0 1 0
20 1 1 0 1 0
21 0 1 0 0 1
22 0 1 0 0 0
23 0 1 0 0 0
Question
How can I make this code faster? It runs slow because it's a loop but I have not been able to come up with a solution that does not use loops. Any ideas or comments are appreciated.
If we can assume every group of 1's in entsig is followed by at least one 1 in
exsig, then you could compute ent, ex and in like this:
def ent_in_ex(df):
entsig_mask = (df['entsig'].diff().shift(1) == 1)
exsig_mask = (df['exsig'].diff().shift(1) == 1)
df.loc[entsig_mask, 'ent'] = 1
df.loc[exsig_mask, 'ex'] = 1
df['in'] = df['ent'].shift(1).cumsum().subtract(df['ex'].cumsum(), fill_value=0)
return df
If we can make this assumption, then ent_in_ex is significantly faster:
In [5]: %timeit orig(df)
10 loops, best of 3: 185 ms per loop
In [6]: %timeit ent_in_ex(df)
100 loops, best of 3: 2.23 ms per loop
In [95]: orig(df).equals(ent_in_ex(df))
Out[95]: True
where orig is the original code:
def orig(df):
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
return df

SKlearn metrics fails with expected y object and predicted y object

In Sci-kit learn have created a few models with train and test data.
The models work fine, but when I try to compute any accuracy metrics, it fails. I assume something is wrong with either my prediction object (pred y) or expected object (true y).
For this test, I have looked at the pred y. It is an object and have 119 0/1 values.
The true y is also an object and has 119 0/1 values.
My code and the error is below, as well as an object comparison. It is the error I do not understand.
"expected" is my true y and "target_predicted" is the predicted y.
I have tried other metrics and other models- it always fails when I am at this stage.
Any assistance?
#Basic Decsion Tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(bank_train, bank_train_target)
print clf
DecisionTreeClassifier(compute_importances=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_density=None, min_samples_leaf=1, min_samples_split=2,
random_state=None, splitter='best')
#test model using test data
target_predicted = clf.predict(bank_test)
accuracy_score(expected,target_predicted)
#error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-23d1a990a192> in <module>()
1 #test model using test data
2 target_predicted = clf.predict(bank_test)
----> 3 accuracy_score(expected,target_predicted)
/Users/mpgartland1/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in accuracy_score(y_true, y_pred, normalize, sample_weight)
1295
1296 # Compute accuracy for each possible representation
-> 1297 y_type, y_true, y_pred = _check_clf_targets(y_true, y_pred)
1298 if y_type == 'multilabel-indicator':
1299 score = (y_pred != y_true).sum(axis=1) == 0
/Users/mpgartland1/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in _check_clf_targets(y_true, y_pred)
125 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
126 "multilabel-sequences"]):
--> 127 raise ValueError("{0} is not supported".format(y_type))
128
129 if y_type in ["binary", "multiclass"]:
ValueError: unknown is not supported
Here is a comparison of the two objects.
print target_predicted.size
print expected.size
print target_predicted.dtype
print expected.dtype
print target_predicted
print expected
119
119
object
object
[1 0 0 1 0 0 1 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1
0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0
0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1
0 1 0 1 0 0 0 1]
[1 0 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 0 0 0 0
0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1
0 1 0 0 0 1 0 1]
If also fails when I try a confusion matrix or other metric- using very cookie cutter code. So, my guess is in the object(s).
Thanks