Complex Excel Formula in Pandas - pandas

Excel Formulas I am trying to replicate in pandas:
Click here to download workbook
* Look at columns D, E and F
entsig and exsig are manual and can be changed. In real life they would be derived from the value of another column or a comparison of two other columns
ent = 1 if entsig previous = 1 and in = 0
in = 1 if ent previous = 1 or (in previous = 1 and ex = 0)
ex = 1 if exsig previous = 1 and in previous = 1
so either ent, in, or ex will always be = 1 but never more than one of them
import pandas as pd
df = pd.DataFrame(
[[0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,1,0,0,0], [0,1,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0],
[1,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0]],
columns=['entsig', 'exsig','ent', 'in', 'ex'])
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
df
results in
entsig exsig ent in ex
0 0 0 0 0 0
1 1 0 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 0 0 0 1 0
5 0 1 0 1 0
6 0 1 0 0 1
7 1 0 0 0 0
8 1 0 1 0 0
9 0 0 0 1 0
10 0 0 0 1 0
11 0 0 0 1 0
12 0 1 0 1 0
13 0 1 0 0 1
14 0 1 0 0 0
15 0 0 0 0 0
16 0 0 0 0 0
17 1 0 0 0 0
18 1 0 1 0 0
19 1 0 0 1 0
20 1 1 0 1 0
21 0 1 0 0 1
22 0 1 0 0 0
23 0 1 0 0 0
Question
How can I make this code faster? It runs slow because it's a loop but I have not been able to come up with a solution that does not use loops. Any ideas or comments are appreciated.

If we can assume every group of 1's in entsig is followed by at least one 1 in
exsig, then you could compute ent, ex and in like this:
def ent_in_ex(df):
entsig_mask = (df['entsig'].diff().shift(1) == 1)
exsig_mask = (df['exsig'].diff().shift(1) == 1)
df.loc[entsig_mask, 'ent'] = 1
df.loc[exsig_mask, 'ex'] = 1
df['in'] = df['ent'].shift(1).cumsum().subtract(df['ex'].cumsum(), fill_value=0)
return df
If we can make this assumption, then ent_in_ex is significantly faster:
In [5]: %timeit orig(df)
10 loops, best of 3: 185 ms per loop
In [6]: %timeit ent_in_ex(df)
100 loops, best of 3: 2.23 ms per loop
In [95]: orig(df).equals(ent_in_ex(df))
Out[95]: True
where orig is the original code:
def orig(df):
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
return df

Related

How to speed up Pandas' "iterrows"

I have a Pandas dataframe which I want to transform in the following way: I have some sensor data from an intelligent floor which is in column "CAPACITANCE" (split by ",") and that data comes from the device indicated in column "DEVICE". Now I want to have one row with a column per sensor - each device has 8 sensors, so I want to have devices x 8 columns and in that column I want the sensor data from exactly that sensor.
But my code seems to be super slow since I have about 90.000 rows in that dataframe! Does anyone have a suggestion how to speed it up?
BEFORE:
CAPACITANCE DEVICE TIMESTAMP \
0 0.00,-1.00,0.00,1.00,1.00,-2.00,13.00,1.00 01,07 2017/11/15 12:24:42
1 0.00,0.00,-1.00,-1.00,-1.00,0.00,-1.00,0.00 01,07 2017/11/15 12:24:42
2 0.00,-1.00,-2.00,0.00,0.00,1.00,0.00,-2.00 01,07 2017/11/15 12:24:43
3 2.00,0.00,-2.00,-1.00,0.00,0.00,1.00,-2.00 01,07 2017/11/15 12:24:43
4 1.00,0.00,-2.00,1.00,1.00,-3.00,5.00,1.00 01,07 2017/11/15 12:24:44
AFTER:
01,01-0 01,01-1 01,01-2 01,01-3 01,01-4 01,01-5 01,01-6 01,01-7 \
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
01,02-0 01,02-1 ... 05,07-1 05,07-2 05,07-3 05,07-4 05,07-5 \
0 0 0 ... 0 0 0 0 0
1 0 0 ... 0 0 0 0 0
2 0 0 ... 0 0 0 0 0
3 0 0 ... 0 0 0 0 0
4 0 0 ... 0 0 0 0 0
05,07-6 05,07-7 TIMESTAMP 01,07-8
0 0 0 2017-11-15 12:24:42 1.00
1 0 0 2017-11-15 12:24:42 0.00
2 0 0 2017-11-15 12:24:43 -2.00
3 0 0 2017-11-15 12:24:43 -2.00
4 0 0 2017-11-15 12:24:44 1.00
# creating new dataframe based on the old one
floor_df_resampled = floor_df.copy()
floor_device = ["01,01", "01,02", "01,03", "01,04", "01,05", "01,06", "01,07", "01,08", "01,09", "01,10",
"02,01", "02,02", "02,03", "02,04", "02,05", "02,06", "02,07", "02,08", "02,09", "02,10",
"03,01", "03,02", "03,03", "03,04", "03,05", "03,06", "03,07", "03,08", "03,09",
"04,01", "04,02", "04,03", "04,04", "04,05", "04,06", "04,07", "04,08", "04,09",
"05,06", "05,07"]
# creating new columns
floor_objects = []
for device in floor_device:
for sensor in range(8):
floor_objects.append(device + "-" + str(sensor))
# merging new columns
floor_df_resampled = pd.concat([floor_df_resampled, pd.DataFrame(columns=floor_objects)], ignore_index=True, sort=True)
# part that takes loads of time
for index, row in floor_df_resampled.iterrows():
obj = row["DEVICE"]
sensor_data = row["CAPACITANCE"].split(',')
for idx, val in enumerate(sensor_data):
col = obj + "-" + str(idx + 1)
floor_df_resampled.loc[index, col] = val
floor_df_resampled.drop(["DEVICE"], axis=1, inplace=True)
floor_df_resampled.drop(["CAPACITANCE"], axis=1, inplace=True)
Like commented, I'm not sure why you want that many columns, but the new columns can be created as follows:
def explode(x):
dev_name = x.DEVICE.iloc[0]
ret_df = x.CAPACITANCE.str.split(',', expand=True).astype(float)
ret_df.columns = [f'{dev_name}-{col}' for col in ret_df.columns]
return ret_df
new_df = df.groupby('DEVICE').apply(explode).fillna(0)
and then you can merge this with the old data frame:
df = df.join(new_df)

how to convert pandas dataframe to libsvm format?

I have pandas data frame like below.
df
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 \
0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
1 0 1 1 1 0 0 1 1 1 1 ... 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
5 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
7 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
[8 rows x 100 columns]
I have target variable as an array as below.
[1, -1, -1, 1, 1, -1, 1, 1]
How can I map this target variable to a data frame and convert it into lib SVM format?.
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.index.map[(equi)]
d = df[np.setdiff1d(df.columns,['indx','labels'])]
e = df.label
dump_svmlight_file(d,e,'D:/result/smvlight2.dat')er code here
ERROR:
File "D:/spyder/april.py", line 54, in <module>
df["labels"] = df.index.map[(equi)]
TypeError: 'method' object is not subscriptable
When I use
df["labels"] = df.index.list(map[(equi)])
ERROR:
AttributeError: 'RangeIndex' object has no attribute 'list'
Please help me to solve those errors.
I think you need convert index to_series and then call map:
df["labels"] = df.index.to_series().map(equi)
Or use rename of index:
df["labels"] = df.rename(index=equi).index
All together:
For difference of columns pandas has difference:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.rename(index=equi).index
e = df["labels"]
d = df[df.columns.difference(['indx','labels'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')
Also it seems label column is not necessary:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
e = df.rename(index=equi).index
d = df[df.columns.difference(['indx'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')

How to set (1) to max elements in pandas dataframe and (0) to everything else?

Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?
Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower
import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)

SKlearn metrics fails with expected y object and predicted y object

In Sci-kit learn have created a few models with train and test data.
The models work fine, but when I try to compute any accuracy metrics, it fails. I assume something is wrong with either my prediction object (pred y) or expected object (true y).
For this test, I have looked at the pred y. It is an object and have 119 0/1 values.
The true y is also an object and has 119 0/1 values.
My code and the error is below, as well as an object comparison. It is the error I do not understand.
"expected" is my true y and "target_predicted" is the predicted y.
I have tried other metrics and other models- it always fails when I am at this stage.
Any assistance?
#Basic Decsion Tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(bank_train, bank_train_target)
print clf
DecisionTreeClassifier(compute_importances=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_density=None, min_samples_leaf=1, min_samples_split=2,
random_state=None, splitter='best')
#test model using test data
target_predicted = clf.predict(bank_test)
accuracy_score(expected,target_predicted)
#error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-23d1a990a192> in <module>()
1 #test model using test data
2 target_predicted = clf.predict(bank_test)
----> 3 accuracy_score(expected,target_predicted)
/Users/mpgartland1/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in accuracy_score(y_true, y_pred, normalize, sample_weight)
1295
1296 # Compute accuracy for each possible representation
-> 1297 y_type, y_true, y_pred = _check_clf_targets(y_true, y_pred)
1298 if y_type == 'multilabel-indicator':
1299 score = (y_pred != y_true).sum(axis=1) == 0
/Users/mpgartland1/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in _check_clf_targets(y_true, y_pred)
125 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
126 "multilabel-sequences"]):
--> 127 raise ValueError("{0} is not supported".format(y_type))
128
129 if y_type in ["binary", "multiclass"]:
ValueError: unknown is not supported
Here is a comparison of the two objects.
print target_predicted.size
print expected.size
print target_predicted.dtype
print expected.dtype
print target_predicted
print expected
119
119
object
object
[1 0 0 1 0 0 1 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1
0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0
0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1
0 1 0 1 0 0 0 1]
[1 0 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 0 0 0 0
0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1
0 1 0 0 0 1 0 1]
If also fails when I try a confusion matrix or other metric- using very cookie cutter code. So, my guess is in the object(s).
Thanks

set column value based on distinct values in another column

I am trying to do something very similar to this question: mysql - UPDATEing row based on other rows
I have a table, called modset, of the following form:
member year y1 y2 y3 y1y2 y2y3 y1y3 y1y2y3
a 1 0 0 0 0 0 0 0
a 2 0 0 0 0 0 0 0
a 3 0 0 0 0 0 0 0
b 1 0 0 0 0 0 0 0
b 2 0 0 0 0 0 0 0
c 1 0 0 0 0 0 0 0
c 3 0 0 0 0 0 0 0
d 2 0 0 0 0 0 0 0
Columns 3:9 are binary flags to indicate which combination of years the member has records in. So I wish the result of an SQL update to look as follows:
member year y1 y2 y3 y1y2 y2y3 y1y3 y1y2y3
a 1 0 0 0 0 0 0 1
a 2 0 0 0 0 0 0 1
a 3 0 0 0 0 0 0 1
b 1 0 0 0 1 0 0 0
b 2 0 0 0 1 0 0 0
c 1 0 0 0 0 0 1 0
c 3 0 0 0 0 0 1 0
d 2 0 1 0 0 0 0 0
The code in the question linked above does something very close but only when it is a count of the distinct years in which the member has records. I need to base the columns on the specific values of the years in which the member has records.
Thanks in advance!
SOLUTION
SELECT member,
case when min(distinct(year)) = 1 and max(distinct(year)) = 1 then 1 else 0 end y1,
case when min(distinct(year)) = 1 and max(distinct(year)) = 2 then 1 else 0 end y1y2,
case when min(distinct(year)) = 1 and max(distinct(year)) = 3 and count(distinct(year)) = 2 then 1 else 0 end y1y3,
case when min(distinct(year)) = 1 and max(distinct(year)) = 3 and count(distinct(year)) = 3 then 1 else 0 end y1y2y3,
case when min(distinct(year)) = 2 and max(distinct(year)) = 2 then 1 else 0 end y2,
case when min(distinct(year)) = 2 and max(distinct(year)) = 3 then 1 else 0 end y2y3,
case when min(distinct(year)) = 3 then 1 else 0 end y3
INTO temp5
FROM modset
GROUP BY member;
UPDATE modset M
SET y1 = T.y1, y2 = T.y1, y3 = T.y3, y1y2 = T.y1y2, y1y3 = T.y1y3, y2y3 = T.y2y3, y1y2y3 = T.y1y2y3
FROM temp5 T
WHERE T.member = M.member;
What is the query you are using to return the indicators of the years the member has records in?
It sounds like you would want take your query results and use it in your update:
http://dev.mysql.com/doc/refman/5.0/en/update.html
It may look something like this:
UPDATE targetTable t, sourceTable s
SET t.y1 = s.y1, t.y2 = s.y2 -- (and so on...)
WHERE t.member = s.member AND t.year = m.year;