generate date feature column using pandas - pandas

I have a timeseries data frame that has columns like these:
Date temp_data holiday day
01.01.2000 10000 0 1
02.01.2000 0 1 2
03.01.2000 2000 0 3
..
..
..
26.01.2000 200 0 26
27.01.2000 0 1 27
28.01.2000 500 0 28
29.01.2000 0 1 29
30.01.2000 200 0 30
31.01.2000 0 1 31
01.02.2000 0 1 1
02.02.2000 2500 0 2
Here, holiday = 0 when there is data present - indicates a working day
holiday = 1 when there is no data present - indicated a non-working day
I am trying to extract three new columns from this data -second_last_working_day_of_month and third_last_working_day_of_month and the fourth_last_wday
the output data frame should look like this
Date temp_data holiday day secondlast_wd thirdlast_wd fouthlast_wd
01.01.2000 10000 0 1 1 0 0
02.01.2000 0 1 2 0 0 0
03.01.2000 2000 0 3 0 0 0
..
..
25.01.2000 345 0 25 0 0 1
26.01.2000 200 0 26 0 1 0
27.01.2000 0 1 27 0 0 0
28.01.2000 500 0 28 1 0 0
29.01.2000 0 1 29 0 0 0
30.01.2000 200 0 30 0 0 0
31.01.2000 0 1 31 0 0 0
01.02.2000 0 1 1 0 0 0
02.02.2000 2500 0 2 0 0 0
Can anyone help me with this?

Example
data = [['26.01.2000', 200, 0, 26], ['27.01.2000', 0, 1, 27], ['28.01.2000', 500, 0, 28],
['29.01.2000', 0, 1, 29], ['30.01.2000', 200, 0, 30], ['31.01.2000', 0, 1, 31],
['26.02.2000', 200, 0, 26], ['27.02.2000', 0, 0, 27], ['28.02.2000', 500, 0, 28],['29.02.2000', 0, 1, 29]]
df = pd.DataFrame(data, columns=['Date', 'temp_data', 'holiday', 'day'])
df
Date temp_data holiday day
0 26.01.2000 200 0 26
1 27.01.2000 0 1 27
2 28.01.2000 500 0 28
3 29.01.2000 0 1 29
4 30.01.2000 200 0 30
5 31.01.2000 0 1 31
6 26.02.2000 200 0 26
7 27.02.2000 0 0 27
8 28.02.2000 500 0 28
9 29.02.2000 0 1 29
Code
for example make secondlast_wd column (n=2)
n = 2
s = pd.to_datetime(df['Date'])
result = df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(n)
result
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
make result to secondlast_wd column
df.assign(secondlast_wd=result.astype('int'))
output:
Date temp_data holiday day secondlast_wd
0 26.01.2000 200 0 26 0
1 27.01.2000 0 1 27 0
2 28.01.2000 500 0 28 1
3 29.01.2000 0 1 29 0
4 30.01.2000 200 0 30 0
5 31.01.2000 0 1 31 0
6 26.02.2000 200 0 26 0
7 27.02.2000 0 0 27 1
8 28.02.2000 500 0 28 0
9 29.02.2000 0 1 29 0
you can change n and can get third, forth and so on..
Update for comment
chk workday(reverse index)
df.iloc[::-1, 2].eq(0) # 2 means location of 'holyday'. can use df.loc[::-1,"holiday"]
9 False
8 True
7 True
6 True
5 False
4 True
3 False
2 True
1 False
0 True
Name: holiday, dtype: bool
reverse cumsum by group(month). then when workday is +1 above value and when holyday is still same value with above.(of course in reverse index)
df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum()
9 0
8 1
7 2
6 3
5 0
4 1
3 1
2 2
1 2
0 3
Name: holiday, dtype: int64
find holiday == 0 and result == 2, that is secondlast_wd
df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(2)
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
This operation returns index as it was.(not reverse)
Other Way
A more understandable code would be:
s = pd.to_datetime(df['Date'])
idx1 = df[df['holiday'].eq(0)].groupby(s.dt.month, as_index=False).nth(-2).index
df.loc[idx1, 'lastsecondary_wd'] = 1
df['lastsecondary_wd'] = df['lastsecondary_wd'].fillna(0).astype('int')
same result

Related

How to populate a column with a value if condition is met across 2+ other colums

My dataframe is similar to the table below. I have 6 columns, each with 'Yes' or 'No' if a specific antibiotic was given.
AZITH
CLIN
CFTX
METRO
CFTN
DOXY
TREATED
Yes
Yes
No
No
No
No
No
Yes
No
Yes
No
No
Yes
Yes
No
No
No
No
No
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
Yes
No
No
No
No
No
Yes
No
No
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
No
Yes
Yes
No
No
Yes
No
No
No
Yes
No
No
No
Yes
Yes
No
Yes
No
No
No
No
Yes
Yes
No
No
No
Yes
No
Yes
I want to fill the column 'TREATED' with 'True' if specific combinations of antibiotic columns contain 'Yes.' If the conditions aren't met, then I would like to fill the 'TREATED' column with a 'False' value.
If ['AZITH'] & ['CLIN'] == 'Yes' |
['AZITH'] & ['CFTX'] & ['CLIN'] == 'Yes' |
['AZITH'] & ['CFTX'] & ['METRO']== 'Yes' |
['AZITH'] & ['CFTN'] == 'Yes' |
['CFTX'] & ['DOXY'] & ['METRO']== 'Yes' |
['CFTN'] & ['DOXY'] == 'Yes' |
['DOXY'] & ['METRO']== 'Yes' ,
Then return 'True' in column 'TREATED'
Else 'False'
What I had in mind was some sort of if statement or use of lambda function, however, I am having trouble.
This must not be exclusive to the above combinations but also include for example if all 6 medications were given. If that's the case, then 'True' should be returned because the condition has been met to give at least 2 of the treatment medications.
The desired output is below:
AZITH
CLIN
CFTX
METRO
CFTN
DOXY
TREATED
Yes
Yes
No
No
No
No
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
No
No
No
No
Yes
No
No
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
Yes
Yes
No
No
No
No
No
Yes
No
No
No
No
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
No
No
No
Yes
No
Yes
No
No
Yes
Yes
No
Yes
Yes
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
No
Yes
Yes
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"AZITH": [
"Yes",
"No",
"Yes",
"No",
"Yes",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
"No",
"No",
"No",
],
"CLIN": [
"Yes",
"Yes",
"Yes",
"No",
"Yes",
"Yes",
"No",
"No",
"Yes",
"No",
"No",
"No",
"No",
"No",
],
"CFTX": [
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"No",
"No",
"Yes",
"Yes",
"No",
"Yes",
"No",
"No",
],
"METRO": [
"No",
"Yes",
"No",
"No",
"Yes",
"Yes",
"No",
"No",
"No",
"Yes",
"No",
"Yes",
"No",
"Yes",
],
"CFTN": [
"No",
"No",
"No",
"No",
"Yes",
"No",
"No",
"No",
"No",
"No",
"Yes",
"No",
"Yes",
"No",
],
"DOXY": [
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
],
}
)
Here is one way to do it:
mask = (
((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes") & (df["CFTX"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTX"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTN"] == "Yes"))
| ((df["CFTX"] == "Yes") & (df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["CFTN"] == "Yes") & (df["DOXY"] == "Yes"))
| ((df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
)
df.loc[mask, "TREATED"] = "Yes"
df = df.fillna("No")
Then:
print(df)
# Output
AZITH CLIN CFTX METRO CFTN DOXY TREATED
0 Yes Yes No No No No Yes
1 No Yes No Yes No No No
2 Yes Yes No No No No Yes
3 No No No No No No No
4 Yes Yes Yes Yes Yes Yes Yes
5 No Yes Yes Yes No Yes Yes
6 No No No No No Yes No
7 No No No No No No No
8 Yes Yes Yes No No No Yes
9 Yes No Yes Yes No No Yes
10 Yes No No No Yes No Yes
11 No No Yes Yes No Yes Yes
12 No No No No Yes Yes Yes
13 No No No Yes No Yes Yes
This is a little abstract but you can use bit flag to represent each Yes (True) assign it a binary value and then do the math in threated based on an if statement.
https://dietertack.medium.com/using-bit-flags-in-c-d39ec6e30f08
You can use set operations, first aggregate as the sets of given medications, then check over all possible combinations if you have a superset:
valid_treatments = [{'AZITH', 'CLIN'}, {'AZITH', 'CFTX', 'CLIN'},
{'AZITH', 'CFTX', 'METRO'}, {'AZITH', 'CFTN'},
{'CFTX', 'DOXY', 'METRO'}, {'CFTN', 'DOXY'},
{'DOXY', 'METRO'},
]
def is_valid(row):
combination = set(df.columns[row])
return 'Yes' if any(
combination.issuperset(v)
for v in valid_treatments
) else 'No'
out = df.assign(TREATED=df.eq('Yes').apply(is_valid, axis=1))
Output:
AZITH CLIN CFTX METRO CFTN DOXY TREATED
0 Yes Yes No No No No Yes
1 No Yes No Yes No No No
2 Yes Yes No No No No Yes
3 No No No No No No No
4 Yes Yes Yes Yes Yes Yes Yes
5 No Yes Yes Yes No Yes Yes
6 No No No No No Yes No
7 No No No No No No No
8 Yes Yes Yes No No No Yes
9 Yes No Yes Yes No No Yes
10 Yes No No No Yes No Yes
11 No No Yes Yes No Yes Yes
12 No No No No Yes Yes Yes
13 No No No Yes No Yes Yes
My answer is going to attempt to vectorize this solution. But I got the idea for the superset from Mozway. Before that I couldn't figure out how to handle all combos
import numpy as np
import pandas as pd
import itertools
df = pd.DataFrame({ "AZITH": ["Yes","No","Yes","No","Yes","No","No","No","Yes","Yes","Yes","No","No","No",],
"CLIN": ["Yes","Yes","Yes","No","Yes","Yes","No","No","Yes","No","No","No","No","No",],
"CFTX": ["No","No","No","No","Yes","Yes","No","No","Yes","Yes","No","Yes","No","No",],
"METRO": ["No","Yes","No","No","Yes","Yes","No","No","No","Yes","No","Yes","No","Yes",],
"CFTN": ["No","No","No","No","Yes","No","No","No","No","No","Yes","No","Yes","No",],
"DOXY": ["No","No","No","No","Yes","Yes","Yes","No","No","No","No","Yes","Yes","Yes",]})
combos = np.array([[1,1,0,0,0,0],[1,1,1,0,0,0],[1,0,1,1,0,0],[1,0,0,0,1,0],[0,0,1,1,0,1],[0,0,0,0,1,1],[0,0,0,1,0,1]])
df = df.replace("Yes",1)
df = df.replace("No",0)
c = []
for l in range(len(combos)):
c.extend(itertools.combinations(range(len(combos)),l))
all_combos = combos
for combo in c[1:]:
combined = np.sum(combos[combo,:],axis=0)
all_combos = np.vstack([all_combos,combined])
all_combos[all_combos!=0]=1
all_combos = np.unique(all_combos,axis=0)
combo_sum = all_combos.sum(axis=1)
all_combos[all_combos==0]=-1
new_df = df.dot(all_combos.transpose())
for i,x in enumerate(combo_sum):
new_df.loc[new_df[i]<x,i] = 0
new_df[new_df>0]=1
new_df["res"] = new_df.sum(axis=1)
new_df.loc[new_df.res>0,"res"] = True
new_df.loc[new_df.res==0,"res"] = False
df["res"] = new_df["res"]
AZITH CLIN CFTX METRO CFTN DOXY res
0 1 1 0 0 0 0 True
1 0 1 0 1 0 0 False
2 1 1 0 0 0 0 True
3 0 0 0 0 0 0 False
4 1 1 1 1 1 1 True
5 0 1 1 1 0 1 False
6 0 0 0 0 0 1 False
7 0 0 0 0 0 0 False
8 1 1 1 0 0 0 True
9 1 0 1 1 0 0 True
10 1 0 0 0 1 0 True
11 0 0 1 1 0 1 True
12 0 0 0 0 1 1 True
13 0 0 0 1 0 1 True
The general explanation of the code is that I create a numpy array of all combos including combinations of combos (sum of two or more combos) that are acceptable. I delete duplicate combinations and this is what is left
np.unique(all_combos,axis=0)
Out[38]:
array([[0, 0, 0, 0, 1, 1],
[0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 0, 1, 1, 1, 1],
[1, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 1, 1],
[1, 0, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 0],
[1, 0, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 1, 1],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 1, 0],
[1, 1, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1]])
Any extra medications that are not part of the combo are penalized by setting the value to -1 in the combo list. (If extra medications are not to be penalized then the superset is not needed, you can just compare to a sum of the original combos variable.)
A dot product between the dataset and the set of all combos is then done and the value is compared against the sum of the combo (prior to replacing the 0s with -1s). This means if the value is 3 and the expected outcome of the combo is 3, then it is a valid combo. Below is the sum of the combos as an array
ipdb> combo_sum
array([2, 2, 3, 3, 4, 2, 3, 4, 3, 4, 4, 5, 2, 3, 4, 4, 5, 3, 4, 5, 4, 5,
5, 6])
ipdb> new_df
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
0 -2 -2 -2 -2 -2 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
1 -2 0 0 0 0 -2 -2 0 0 0 ... 0 2 2 0 0 0 2 2 2 2
2 -2 -2 -2 -2 -2 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 -2 -2 0 0 2 -2 0 2 0 2 ... 2 2 4 0 2 4 2 4 4 6
5 -2 0 0 2 2 -4 -2 0 0 2 ... 0 2 2 0 0 2 2 4 2 4
6 1 1 1 1 1 -1 1 1 -1 1 ... 1 1 1 -1 -1 1 -1 1 -1 1
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 -3 -3 -3 -1 -1 -1 -1 -1 1 1 ... 1 1 1 3 3 3 3 3 3 3
9 -3 -1 -1 1 1 -1 -1 1 3 3 ... -1 1 1 1 1 1 3 3 3 3
10 0 -2 0 -2 0 2 2 2 0 0 ... 2 0 2 0 2 2 0 0 2 2
11 -1 1 1 3 3 -3 -1 1 1 3 ... -1 1 1 -1 -1 1 1 3 1 3
12 2 0 2 0 2 0 2 2 -2 0 ... 2 0 2 -2 0 2 -2 0 0 2
13 0 2 2 2 2 -2 0 2 0 2 ... 0 2 2 -2 -2 0 0 2 0 2
After the dot product, we replace the valid values with 1 and invalid (less than the expected sum) with 0. We sum on all the combinations to see if any are valid. If the sum of combinations >= 1, then at least one combo was valid. Else, all were invalid.
ipdb> new_df
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 res
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 1
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 1
9 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1
10 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
11 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
12 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
13 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
Replace the final summed column with True or false and apply to original dataframe.

add missing values in pandas dataframe - datacleaning

I have measurements stored in a data frame that looks like the one below.
Those are measurements of PMs. Sensors are measuring the four of them pm1, pm2.5, pm5, pm10 contained in the column indicator, under conditions x1..x56, and it gives the measurement in the column area and count. The problem is that under some condition (columns x1..x56) sensors didn't catch all the PMs. And I want for every combination of column conditions (x1..x56) to have all 4 PM values in column indicator. And if the sensor didn't catch it (if there is no PM value for some combination of Xs) I should add it, and area and count column should be 0.
x1 x2 x3 x4 x5 x6 .. x56 indicator area count
0 0 0 0 0 0 .. 0 pm1 10 56
0 0 0 0 0 0 .. 0 pm10 9 1
0 0 0 0 0 0 .. 0 pm5 1 454
.............................................
1 0 0 0 0 0 .. 0 pm1 3 4
ssl ax w 45b g g .. gb pm1 3 4
1 wdf sw d78 b fd .. b pm1 3 4
In this example for the first combination of all zeros, pm2.5 is missing so I should add it and put its area and count to be 0. Similar for the second combination (the one that starts with 1). So my dummy example should look like this after I finish:
x1 x2 x3 x4 x5 x6 .. x56 indicator area count
0 0 0 0 0 0 .. 0 pm1 10 56
0 0 0 0 0 0 .. 0 pm10 9 1
0 0 0 0 0 0 .. 0 pm5 1 454
0 0 0 0 0 0 .. 0 pm2.5 0 0
.............................................
1 0 0 0 0 0 .. 0 pm1 3 4
1 0 0 0 0 0 .. 0 pm10 0 0
1 0 0 0 0 0 .. 0 pm5 0 0
1 0 0 0 0 0 .. 0 pm2.5 0 0
ssl ax w 45b g g .. gb pm1 3 4
ssl ax w 45b g g .. gb pm10 0 0
ssl ax w 45b g g .. gb pm5 0 0
ssl ax w 45b g g .. gb pm2.5 0 0
1 wdf sw d78 b fd .. b pm1 3 4
1 wdf sw d78 b fd .. b pm10 0 0
1 wdf sw d78 b fd .. b pm5 0 0
1 wdf sw d78 b fd .. b pm2.5 0 0
How I can do that? Thanks in advance!
The key here is to create a MultiIndex from all combinations of x and indicator then fill missing records.
Step 1.
Create a vector of x columns:
df['x'] = df.filter(regex='^x\d+').apply(tuple, axis=1)
print(df)
# Output:
x1 x2 x3 x4 x5 x6 x56 indicator area count x
0 0 0 0 0 0 0 0 pm1 10 56 (0, 0, 0, 0, 0, 0, 0)
1 0 0 0 0 0 0 0 pm10 9 1 (0, 0, 0, 0, 0, 0, 0)
2 0 0 0 0 0 0 0 pm5 1 454 (0, 0, 0, 0, 0, 0, 0)
3 1 0 0 0 0 0 0 pm1 3 4 (1, 0, 0, 0, 0, 0, 0)
Step 2.
Create the MultiIindex from vector x and indicator list then reindex your dataframe.
mi = pd.MultiIndex.from_product([df['x'].unique(),
['pm1', 'pm2.5', 'pm5', 'pm10']],
names=['x', 'indicator'])
out = df.set_index(['x', 'indicator']).reindex(mi, fill_value=0)
print(out)
# Output:
x1 x2 x3 x4 x5 x6 x56 area count
x indicator
(0, 0, 0, 0, 0, 0, 0) pm1 0 0 0 0 0 0 0 10 56
pm2.5 0 0 0 0 0 0 0 0 0
pm5 0 0 0 0 0 0 0 1 454
pm10 0 0 0 0 0 0 0 9 1
(1, 0, 0, 0, 0, 0, 0) pm1 1 0 0 0 0 0 0 3 4
pm2.5 *0* 0 0 0 0 0 0 0 0
pm5 *0* 0 0 0 0 0 0 0 0
pm10 *0* 0 0 0 0 0 0 0 0
# Need to be fixed ----^
Step 3.
Group by x index to update x columns by keeping the highest value for each column of the group (1 > 0).
out = out.filter(regex='^x\d+').groupby(level='x') \
.apply(lambda x: pd.Series(dict(zip(x.columns, x.name)))) \
.join(out[['area', 'count']]).reset_index()[df.columns[:-1]]
print(out)
# Output:
x1 x2 x3 x4 x5 x6 x56 indicator area count
0 0 0 0 0 0 0 0 pm1 10 56
1 0 0 0 0 0 0 0 pm2.5 0 0
2 0 0 0 0 0 0 0 pm5 1 454
3 0 0 0 0 0 0 0 pm10 9 1
4 1 0 0 0 0 0 0 pm1 3 4
5 1 0 0 0 0 0 0 pm2.5 0 0
6 1 0 0 0 0 0 0 pm5 0 0
7 1 0 0 0 0 0 0 pm10 0 0

how to convert pandas dataframe to libsvm format?

I have pandas data frame like below.
df
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 \
0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
1 0 1 1 1 0 0 1 1 1 1 ... 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
5 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
7 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
[8 rows x 100 columns]
I have target variable as an array as below.
[1, -1, -1, 1, 1, -1, 1, 1]
How can I map this target variable to a data frame and convert it into lib SVM format?.
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.index.map[(equi)]
d = df[np.setdiff1d(df.columns,['indx','labels'])]
e = df.label
dump_svmlight_file(d,e,'D:/result/smvlight2.dat')er code here
ERROR:
File "D:/spyder/april.py", line 54, in <module>
df["labels"] = df.index.map[(equi)]
TypeError: 'method' object is not subscriptable
When I use
df["labels"] = df.index.list(map[(equi)])
ERROR:
AttributeError: 'RangeIndex' object has no attribute 'list'
Please help me to solve those errors.
I think you need convert index to_series and then call map:
df["labels"] = df.index.to_series().map(equi)
Or use rename of index:
df["labels"] = df.rename(index=equi).index
All together:
For difference of columns pandas has difference:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.rename(index=equi).index
e = df["labels"]
d = df[df.columns.difference(['indx','labels'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')
Also it seems label column is not necessary:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
e = df.rename(index=equi).index
d = df[df.columns.difference(['indx'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')

Complex Excel Formula in Pandas

Excel Formulas I am trying to replicate in pandas:
Click here to download workbook
* Look at columns D, E and F
entsig and exsig are manual and can be changed. In real life they would be derived from the value of another column or a comparison of two other columns
ent = 1 if entsig previous = 1 and in = 0
in = 1 if ent previous = 1 or (in previous = 1 and ex = 0)
ex = 1 if exsig previous = 1 and in previous = 1
so either ent, in, or ex will always be = 1 but never more than one of them
import pandas as pd
df = pd.DataFrame(
[[0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,1,0,0,0], [0,1,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0],
[1,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0]],
columns=['entsig', 'exsig','ent', 'in', 'ex'])
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
df
results in
entsig exsig ent in ex
0 0 0 0 0 0
1 1 0 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 0 0 0 1 0
5 0 1 0 1 0
6 0 1 0 0 1
7 1 0 0 0 0
8 1 0 1 0 0
9 0 0 0 1 0
10 0 0 0 1 0
11 0 0 0 1 0
12 0 1 0 1 0
13 0 1 0 0 1
14 0 1 0 0 0
15 0 0 0 0 0
16 0 0 0 0 0
17 1 0 0 0 0
18 1 0 1 0 0
19 1 0 0 1 0
20 1 1 0 1 0
21 0 1 0 0 1
22 0 1 0 0 0
23 0 1 0 0 0
Question
How can I make this code faster? It runs slow because it's a loop but I have not been able to come up with a solution that does not use loops. Any ideas or comments are appreciated.
If we can assume every group of 1's in entsig is followed by at least one 1 in
exsig, then you could compute ent, ex and in like this:
def ent_in_ex(df):
entsig_mask = (df['entsig'].diff().shift(1) == 1)
exsig_mask = (df['exsig'].diff().shift(1) == 1)
df.loc[entsig_mask, 'ent'] = 1
df.loc[exsig_mask, 'ex'] = 1
df['in'] = df['ent'].shift(1).cumsum().subtract(df['ex'].cumsum(), fill_value=0)
return df
If we can make this assumption, then ent_in_ex is significantly faster:
In [5]: %timeit orig(df)
10 loops, best of 3: 185 ms per loop
In [6]: %timeit ent_in_ex(df)
100 loops, best of 3: 2.23 ms per loop
In [95]: orig(df).equals(ent_in_ex(df))
Out[95]: True
where orig is the original code:
def orig(df):
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
return df

SKlearn metrics fails with expected y object and predicted y object

In Sci-kit learn have created a few models with train and test data.
The models work fine, but when I try to compute any accuracy metrics, it fails. I assume something is wrong with either my prediction object (pred y) or expected object (true y).
For this test, I have looked at the pred y. It is an object and have 119 0/1 values.
The true y is also an object and has 119 0/1 values.
My code and the error is below, as well as an object comparison. It is the error I do not understand.
"expected" is my true y and "target_predicted" is the predicted y.
I have tried other metrics and other models- it always fails when I am at this stage.
Any assistance?
#Basic Decsion Tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(bank_train, bank_train_target)
print clf
DecisionTreeClassifier(compute_importances=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_density=None, min_samples_leaf=1, min_samples_split=2,
random_state=None, splitter='best')
#test model using test data
target_predicted = clf.predict(bank_test)
accuracy_score(expected,target_predicted)
#error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-23d1a990a192> in <module>()
1 #test model using test data
2 target_predicted = clf.predict(bank_test)
----> 3 accuracy_score(expected,target_predicted)
/Users/mpgartland1/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in accuracy_score(y_true, y_pred, normalize, sample_weight)
1295
1296 # Compute accuracy for each possible representation
-> 1297 y_type, y_true, y_pred = _check_clf_targets(y_true, y_pred)
1298 if y_type == 'multilabel-indicator':
1299 score = (y_pred != y_true).sum(axis=1) == 0
/Users/mpgartland1/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in _check_clf_targets(y_true, y_pred)
125 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
126 "multilabel-sequences"]):
--> 127 raise ValueError("{0} is not supported".format(y_type))
128
129 if y_type in ["binary", "multiclass"]:
ValueError: unknown is not supported
Here is a comparison of the two objects.
print target_predicted.size
print expected.size
print target_predicted.dtype
print expected.dtype
print target_predicted
print expected
119
119
object
object
[1 0 0 1 0 0 1 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1
0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0
0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1
0 1 0 1 0 0 0 1]
[1 0 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 0 0 0 0
0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1
0 1 0 0 0 1 0 1]
If also fails when I try a confusion matrix or other metric- using very cookie cutter code. So, my guess is in the object(s).
Thanks