How to populate a column with a value if condition is met across 2+ other colums - pandas

My dataframe is similar to the table below. I have 6 columns, each with 'Yes' or 'No' if a specific antibiotic was given.
AZITH
CLIN
CFTX
METRO
CFTN
DOXY
TREATED
Yes
Yes
No
No
No
No
No
Yes
No
Yes
No
No
Yes
Yes
No
No
No
No
No
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
Yes
No
No
No
No
No
Yes
No
No
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
No
Yes
Yes
No
No
Yes
No
No
No
Yes
No
No
No
Yes
Yes
No
Yes
No
No
No
No
Yes
Yes
No
No
No
Yes
No
Yes
I want to fill the column 'TREATED' with 'True' if specific combinations of antibiotic columns contain 'Yes.' If the conditions aren't met, then I would like to fill the 'TREATED' column with a 'False' value.
If ['AZITH'] & ['CLIN'] == 'Yes' |
['AZITH'] & ['CFTX'] & ['CLIN'] == 'Yes' |
['AZITH'] & ['CFTX'] & ['METRO']== 'Yes' |
['AZITH'] & ['CFTN'] == 'Yes' |
['CFTX'] & ['DOXY'] & ['METRO']== 'Yes' |
['CFTN'] & ['DOXY'] == 'Yes' |
['DOXY'] & ['METRO']== 'Yes' ,
Then return 'True' in column 'TREATED'
Else 'False'
What I had in mind was some sort of if statement or use of lambda function, however, I am having trouble.
This must not be exclusive to the above combinations but also include for example if all 6 medications were given. If that's the case, then 'True' should be returned because the condition has been met to give at least 2 of the treatment medications.
The desired output is below:
AZITH
CLIN
CFTX
METRO
CFTN
DOXY
TREATED
Yes
Yes
No
No
No
No
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
No
No
No
No
Yes
No
No
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
Yes
Yes
No
No
No
No
No
Yes
No
No
No
No
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
No
No
No
Yes
No
Yes
No
No
Yes
Yes
No
Yes
Yes
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
No
Yes
Yes

With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"AZITH": [
"Yes",
"No",
"Yes",
"No",
"Yes",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
"No",
"No",
"No",
],
"CLIN": [
"Yes",
"Yes",
"Yes",
"No",
"Yes",
"Yes",
"No",
"No",
"Yes",
"No",
"No",
"No",
"No",
"No",
],
"CFTX": [
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"No",
"No",
"Yes",
"Yes",
"No",
"Yes",
"No",
"No",
],
"METRO": [
"No",
"Yes",
"No",
"No",
"Yes",
"Yes",
"No",
"No",
"No",
"Yes",
"No",
"Yes",
"No",
"Yes",
],
"CFTN": [
"No",
"No",
"No",
"No",
"Yes",
"No",
"No",
"No",
"No",
"No",
"Yes",
"No",
"Yes",
"No",
],
"DOXY": [
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
],
}
)
Here is one way to do it:
mask = (
((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes") & (df["CFTX"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTX"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTN"] == "Yes"))
| ((df["CFTX"] == "Yes") & (df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["CFTN"] == "Yes") & (df["DOXY"] == "Yes"))
| ((df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
)
df.loc[mask, "TREATED"] = "Yes"
df = df.fillna("No")
Then:
print(df)
# Output
AZITH CLIN CFTX METRO CFTN DOXY TREATED
0 Yes Yes No No No No Yes
1 No Yes No Yes No No No
2 Yes Yes No No No No Yes
3 No No No No No No No
4 Yes Yes Yes Yes Yes Yes Yes
5 No Yes Yes Yes No Yes Yes
6 No No No No No Yes No
7 No No No No No No No
8 Yes Yes Yes No No No Yes
9 Yes No Yes Yes No No Yes
10 Yes No No No Yes No Yes
11 No No Yes Yes No Yes Yes
12 No No No No Yes Yes Yes
13 No No No Yes No Yes Yes

This is a little abstract but you can use bit flag to represent each Yes (True) assign it a binary value and then do the math in threated based on an if statement.
https://dietertack.medium.com/using-bit-flags-in-c-d39ec6e30f08

You can use set operations, first aggregate as the sets of given medications, then check over all possible combinations if you have a superset:
valid_treatments = [{'AZITH', 'CLIN'}, {'AZITH', 'CFTX', 'CLIN'},
{'AZITH', 'CFTX', 'METRO'}, {'AZITH', 'CFTN'},
{'CFTX', 'DOXY', 'METRO'}, {'CFTN', 'DOXY'},
{'DOXY', 'METRO'},
]
def is_valid(row):
combination = set(df.columns[row])
return 'Yes' if any(
combination.issuperset(v)
for v in valid_treatments
) else 'No'
out = df.assign(TREATED=df.eq('Yes').apply(is_valid, axis=1))
Output:
AZITH CLIN CFTX METRO CFTN DOXY TREATED
0 Yes Yes No No No No Yes
1 No Yes No Yes No No No
2 Yes Yes No No No No Yes
3 No No No No No No No
4 Yes Yes Yes Yes Yes Yes Yes
5 No Yes Yes Yes No Yes Yes
6 No No No No No Yes No
7 No No No No No No No
8 Yes Yes Yes No No No Yes
9 Yes No Yes Yes No No Yes
10 Yes No No No Yes No Yes
11 No No Yes Yes No Yes Yes
12 No No No No Yes Yes Yes
13 No No No Yes No Yes Yes

My answer is going to attempt to vectorize this solution. But I got the idea for the superset from Mozway. Before that I couldn't figure out how to handle all combos
import numpy as np
import pandas as pd
import itertools
df = pd.DataFrame({ "AZITH": ["Yes","No","Yes","No","Yes","No","No","No","Yes","Yes","Yes","No","No","No",],
"CLIN": ["Yes","Yes","Yes","No","Yes","Yes","No","No","Yes","No","No","No","No","No",],
"CFTX": ["No","No","No","No","Yes","Yes","No","No","Yes","Yes","No","Yes","No","No",],
"METRO": ["No","Yes","No","No","Yes","Yes","No","No","No","Yes","No","Yes","No","Yes",],
"CFTN": ["No","No","No","No","Yes","No","No","No","No","No","Yes","No","Yes","No",],
"DOXY": ["No","No","No","No","Yes","Yes","Yes","No","No","No","No","Yes","Yes","Yes",]})
combos = np.array([[1,1,0,0,0,0],[1,1,1,0,0,0],[1,0,1,1,0,0],[1,0,0,0,1,0],[0,0,1,1,0,1],[0,0,0,0,1,1],[0,0,0,1,0,1]])
df = df.replace("Yes",1)
df = df.replace("No",0)
c = []
for l in range(len(combos)):
c.extend(itertools.combinations(range(len(combos)),l))
all_combos = combos
for combo in c[1:]:
combined = np.sum(combos[combo,:],axis=0)
all_combos = np.vstack([all_combos,combined])
all_combos[all_combos!=0]=1
all_combos = np.unique(all_combos,axis=0)
combo_sum = all_combos.sum(axis=1)
all_combos[all_combos==0]=-1
new_df = df.dot(all_combos.transpose())
for i,x in enumerate(combo_sum):
new_df.loc[new_df[i]<x,i] = 0
new_df[new_df>0]=1
new_df["res"] = new_df.sum(axis=1)
new_df.loc[new_df.res>0,"res"] = True
new_df.loc[new_df.res==0,"res"] = False
df["res"] = new_df["res"]
AZITH CLIN CFTX METRO CFTN DOXY res
0 1 1 0 0 0 0 True
1 0 1 0 1 0 0 False
2 1 1 0 0 0 0 True
3 0 0 0 0 0 0 False
4 1 1 1 1 1 1 True
5 0 1 1 1 0 1 False
6 0 0 0 0 0 1 False
7 0 0 0 0 0 0 False
8 1 1 1 0 0 0 True
9 1 0 1 1 0 0 True
10 1 0 0 0 1 0 True
11 0 0 1 1 0 1 True
12 0 0 0 0 1 1 True
13 0 0 0 1 0 1 True
The general explanation of the code is that I create a numpy array of all combos including combinations of combos (sum of two or more combos) that are acceptable. I delete duplicate combinations and this is what is left
np.unique(all_combos,axis=0)
Out[38]:
array([[0, 0, 0, 0, 1, 1],
[0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 0, 1, 1, 1, 1],
[1, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 1, 1],
[1, 0, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 0],
[1, 0, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 1, 1],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 1, 0],
[1, 1, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1]])
Any extra medications that are not part of the combo are penalized by setting the value to -1 in the combo list. (If extra medications are not to be penalized then the superset is not needed, you can just compare to a sum of the original combos variable.)
A dot product between the dataset and the set of all combos is then done and the value is compared against the sum of the combo (prior to replacing the 0s with -1s). This means if the value is 3 and the expected outcome of the combo is 3, then it is a valid combo. Below is the sum of the combos as an array
ipdb> combo_sum
array([2, 2, 3, 3, 4, 2, 3, 4, 3, 4, 4, 5, 2, 3, 4, 4, 5, 3, 4, 5, 4, 5,
5, 6])
ipdb> new_df
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
0 -2 -2 -2 -2 -2 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
1 -2 0 0 0 0 -2 -2 0 0 0 ... 0 2 2 0 0 0 2 2 2 2
2 -2 -2 -2 -2 -2 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 -2 -2 0 0 2 -2 0 2 0 2 ... 2 2 4 0 2 4 2 4 4 6
5 -2 0 0 2 2 -4 -2 0 0 2 ... 0 2 2 0 0 2 2 4 2 4
6 1 1 1 1 1 -1 1 1 -1 1 ... 1 1 1 -1 -1 1 -1 1 -1 1
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 -3 -3 -3 -1 -1 -1 -1 -1 1 1 ... 1 1 1 3 3 3 3 3 3 3
9 -3 -1 -1 1 1 -1 -1 1 3 3 ... -1 1 1 1 1 1 3 3 3 3
10 0 -2 0 -2 0 2 2 2 0 0 ... 2 0 2 0 2 2 0 0 2 2
11 -1 1 1 3 3 -3 -1 1 1 3 ... -1 1 1 -1 -1 1 1 3 1 3
12 2 0 2 0 2 0 2 2 -2 0 ... 2 0 2 -2 0 2 -2 0 0 2
13 0 2 2 2 2 -2 0 2 0 2 ... 0 2 2 -2 -2 0 0 2 0 2
After the dot product, we replace the valid values with 1 and invalid (less than the expected sum) with 0. We sum on all the combinations to see if any are valid. If the sum of combinations >= 1, then at least one combo was valid. Else, all were invalid.
ipdb> new_df
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 res
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 1
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 1
9 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1
10 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
11 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
12 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
13 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
Replace the final summed column with True or false and apply to original dataframe.

Related

generate date feature column using pandas

I have a timeseries data frame that has columns like these:
Date temp_data holiday day
01.01.2000 10000 0 1
02.01.2000 0 1 2
03.01.2000 2000 0 3
..
..
..
26.01.2000 200 0 26
27.01.2000 0 1 27
28.01.2000 500 0 28
29.01.2000 0 1 29
30.01.2000 200 0 30
31.01.2000 0 1 31
01.02.2000 0 1 1
02.02.2000 2500 0 2
Here, holiday = 0 when there is data present - indicates a working day
holiday = 1 when there is no data present - indicated a non-working day
I am trying to extract three new columns from this data -second_last_working_day_of_month and third_last_working_day_of_month and the fourth_last_wday
the output data frame should look like this
Date temp_data holiday day secondlast_wd thirdlast_wd fouthlast_wd
01.01.2000 10000 0 1 1 0 0
02.01.2000 0 1 2 0 0 0
03.01.2000 2000 0 3 0 0 0
..
..
25.01.2000 345 0 25 0 0 1
26.01.2000 200 0 26 0 1 0
27.01.2000 0 1 27 0 0 0
28.01.2000 500 0 28 1 0 0
29.01.2000 0 1 29 0 0 0
30.01.2000 200 0 30 0 0 0
31.01.2000 0 1 31 0 0 0
01.02.2000 0 1 1 0 0 0
02.02.2000 2500 0 2 0 0 0
Can anyone help me with this?
Example
data = [['26.01.2000', 200, 0, 26], ['27.01.2000', 0, 1, 27], ['28.01.2000', 500, 0, 28],
['29.01.2000', 0, 1, 29], ['30.01.2000', 200, 0, 30], ['31.01.2000', 0, 1, 31],
['26.02.2000', 200, 0, 26], ['27.02.2000', 0, 0, 27], ['28.02.2000', 500, 0, 28],['29.02.2000', 0, 1, 29]]
df = pd.DataFrame(data, columns=['Date', 'temp_data', 'holiday', 'day'])
df
Date temp_data holiday day
0 26.01.2000 200 0 26
1 27.01.2000 0 1 27
2 28.01.2000 500 0 28
3 29.01.2000 0 1 29
4 30.01.2000 200 0 30
5 31.01.2000 0 1 31
6 26.02.2000 200 0 26
7 27.02.2000 0 0 27
8 28.02.2000 500 0 28
9 29.02.2000 0 1 29
Code
for example make secondlast_wd column (n=2)
n = 2
s = pd.to_datetime(df['Date'])
result = df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(n)
result
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
make result to secondlast_wd column
df.assign(secondlast_wd=result.astype('int'))
output:
Date temp_data holiday day secondlast_wd
0 26.01.2000 200 0 26 0
1 27.01.2000 0 1 27 0
2 28.01.2000 500 0 28 1
3 29.01.2000 0 1 29 0
4 30.01.2000 200 0 30 0
5 31.01.2000 0 1 31 0
6 26.02.2000 200 0 26 0
7 27.02.2000 0 0 27 1
8 28.02.2000 500 0 28 0
9 29.02.2000 0 1 29 0
you can change n and can get third, forth and so on..
Update for comment
chk workday(reverse index)
df.iloc[::-1, 2].eq(0) # 2 means location of 'holyday'. can use df.loc[::-1,"holiday"]
9 False
8 True
7 True
6 True
5 False
4 True
3 False
2 True
1 False
0 True
Name: holiday, dtype: bool
reverse cumsum by group(month). then when workday is +1 above value and when holyday is still same value with above.(of course in reverse index)
df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum()
9 0
8 1
7 2
6 3
5 0
4 1
3 1
2 2
1 2
0 3
Name: holiday, dtype: int64
find holiday == 0 and result == 2, that is secondlast_wd
df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(2)
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
This operation returns index as it was.(not reverse)
Other Way
A more understandable code would be:
s = pd.to_datetime(df['Date'])
idx1 = df[df['holiday'].eq(0)].groupby(s.dt.month, as_index=False).nth(-2).index
df.loc[idx1, 'lastsecondary_wd'] = 1
df['lastsecondary_wd'] = df['lastsecondary_wd'].fillna(0).astype('int')
same result

Broadcasting multi-dimensional array indices of the same shape

I have a mask array which represents a 2-dimensional binary image. Let's say it's simply:
mask = np.zeros((9, 9), dtype=np.uint8)
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
Suppose I want to flip the elements in the middle left ninth:
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 1 1 1 | 0 0 0 | 0 0 0
# 1 1 1 | 0 0 0 | 0 0 0
# 1 1 1 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
My incorrect approach was something like this:
x = np.arange(mask.shape[0])
y = np.arange(mask.shape[1])
mask[np.logical_and(y >= 3, y < 6), x < 3] = 1
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 1 0 0 | 0 0 0 | 0 0 0
# 0 1 0 | 0 0 0 | 0 0 0
# 0 0 1 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
(This is a simplification of the constraints I'm really dealing with, which would not be easily expressed as something like mask[:3,3:6] = 1 as in this case. Consider the constraints arbitrary, like x % 2 == 0 && y % 3 == 0 if you will.)
Numpy's behavior when the two index arrays are the same shape is to take them pairwise, which ends up only selecting the 3 elements above, rather than 9 I would like.
How would I update the right elements with constraints that apply to different axes? Given that the constraints are independent, can I do this by only evaluating my constraints N+M times, rather than N*M?
You can't broadcast the boolean arrays, but you can construct the equivalent numeric indices with ix_:
In [330]: np.ix_((y>=3)&(y<6), x<3)
Out[330]:
(array([[3],
[4],
[5]]), array([[0, 1, 2]]))
Applying it:
In [331]: arr = np.zeros((9,9),int)
In [332]: arr[_330] = 1
In [333]: arr
Out[333]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
Attempting to broadcast the booleans directly raises an error (too many indices):
arr[((y>=3)&(y<6))[:,None], x<3]
Per your comment, let's try this fancier example:
mask = np.zeros((90,90), dtype=np.uint8)
# criteria
def f(x,y): return ((x-20)**2 < 50) & ((y-20)**2 < 50)
# ranges
x,y = np.arange(90), np.arange(90)
# meshgrid
xx,yy = np.meshgrid(x,y)
zz = f(xx,yy)
# mask
mask[zz] = 1
plt.imshow(mask, cnap='gray')
Output:

Get_dummies produces more columns than its supposed to

I'm using get_dummies on a column of data that has zeroes or 'D' or "E". Instead of producing 2 columns it produces 5 - C, D, E, N, O. I'm not sure what they are and how to make it do just 2 as its supposed to.
When I just pull that column shows 0's and D and E, but when I put it in get_dummies adds extra columns
data[[2]]
0
0
D
0
0
0
0
D
0
0
When I do this:
dummy = pd.get_dummies(data[2], dummy_na = False)
dummy.head()
I get
0 C D E N O PreferredContactTime
0 0 0 0 0 0 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
0 0 1 0 0 0 0
1 0 0 0 0 0 0
What are C , N and O? I don't understand what it is displaying at all.
Setup
dtype = pd.CategoricalDtype([0, 'C', 'D', 'E', 'N', 'O', 'PreferredContactTime'])
data = pd.DataFrame({2: [
'PreferredContactTime', 0, 0, 'D', 0, 0, 0, 0, 'D', 0, 0
]}).astype(dtype)
Your result
dummy = pd.get_dummies(data[2], dummy_na=False )
dummy.head()
0 C D E N O PreferredContactTime
0 0 0 0 0 0 0 1
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0
4 1 0 0 0 0 0 0

how to convert pandas dataframe to libsvm format?

I have pandas data frame like below.
df
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 \
0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
1 0 1 1 1 0 0 1 1 1 1 ... 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
5 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
7 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
[8 rows x 100 columns]
I have target variable as an array as below.
[1, -1, -1, 1, 1, -1, 1, 1]
How can I map this target variable to a data frame and convert it into lib SVM format?.
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.index.map[(equi)]
d = df[np.setdiff1d(df.columns,['indx','labels'])]
e = df.label
dump_svmlight_file(d,e,'D:/result/smvlight2.dat')er code here
ERROR:
File "D:/spyder/april.py", line 54, in <module>
df["labels"] = df.index.map[(equi)]
TypeError: 'method' object is not subscriptable
When I use
df["labels"] = df.index.list(map[(equi)])
ERROR:
AttributeError: 'RangeIndex' object has no attribute 'list'
Please help me to solve those errors.
I think you need convert index to_series and then call map:
df["labels"] = df.index.to_series().map(equi)
Or use rename of index:
df["labels"] = df.rename(index=equi).index
All together:
For difference of columns pandas has difference:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.rename(index=equi).index
e = df["labels"]
d = df[df.columns.difference(['indx','labels'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')
Also it seems label column is not necessary:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
e = df.rename(index=equi).index
d = df[df.columns.difference(['indx'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')

copy_blanks(df,column) should copy the value in the original column to the last column for all values where the original column is blank

def copy_blanks(df, column):
like this, Please suggest me.
Input:
e-mail,number
n#gmail.com,0
p#gmail.com,1
h#gmail.com,0
s#gmail.com,0
l#gmail.com,1
v#gmail.com,0
,0
But, here we are having default_value option. In that we can use any value. when we have used this option. that value will adding.like below
e-mail,number
n#gmail.com,0
p#gmail.com,1
h#gmail.com,0
s#gmail.com,0
l#gmail.com,1
v#gmail.com,0
NA,0
But, my output is we have to default value and skip_blank options. when we will use skip_blank like true, then should not work default value,when we will keep skip_blank is false, then should work default value.
my output:
e-mail,number,e-mail_clean
n#gmail.com,0,n#gmail.com
p#gmail.com,1,p#gmail.com
h#gmail.com,0,h#gmail.com
s#gmail.com,0,s#gmail.com
l#gmail.com,1,l#gmail.com
v#gmail.com,0,v#gmail.com
,0,
consider your sample df
df = pd.DataFrame([
['n#gmail.com', 0],
['p#gmail.com', 1],
['h#gmail.com', 0],
['s#gmail.com', 0],
['l#gmail.com', 1],
['v#gmail.com', 0],
['', 0]
], columns=['e-mail','number'])
print(df)
e-mail number
0 n#gmail.com 0
1 p#gmail.com 1
2 h#gmail.com 0
3 s#gmail.com 0
4 l#gmail.com 1
5 v#gmail.com 0
6 0
If I understand you correctly:
def copy_blanks(df, column, skip_blanks=False, default_value='NA'):
df = df.copy()
s = df[column]
if not skip_blanks:
s = s.replace('', default_value)
df['{}_clean'.format(column)] = s
return df
copy_blanks(df, 'e-mail', skip_blanks=False)
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0 NA
copy_blanks(df, 'e-mail', skip_blanks=True)
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0
copy_blanks(df, 'e-mail', skip_blanks=False, default_value='new#gmail.com')
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0 new#gmail.com