Broadcasting multi-dimensional array indices of the same shape - numpy

I have a mask array which represents a 2-dimensional binary image. Let's say it's simply:
mask = np.zeros((9, 9), dtype=np.uint8)
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
Suppose I want to flip the elements in the middle left ninth:
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 1 1 1 | 0 0 0 | 0 0 0
# 1 1 1 | 0 0 0 | 0 0 0
# 1 1 1 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
My incorrect approach was something like this:
x = np.arange(mask.shape[0])
y = np.arange(mask.shape[1])
mask[np.logical_and(y >= 3, y < 6), x < 3] = 1
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 1 0 0 | 0 0 0 | 0 0 0
# 0 1 0 | 0 0 0 | 0 0 0
# 0 0 1 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
(This is a simplification of the constraints I'm really dealing with, which would not be easily expressed as something like mask[:3,3:6] = 1 as in this case. Consider the constraints arbitrary, like x % 2 == 0 && y % 3 == 0 if you will.)
Numpy's behavior when the two index arrays are the same shape is to take them pairwise, which ends up only selecting the 3 elements above, rather than 9 I would like.
How would I update the right elements with constraints that apply to different axes? Given that the constraints are independent, can I do this by only evaluating my constraints N+M times, rather than N*M?

You can't broadcast the boolean arrays, but you can construct the equivalent numeric indices with ix_:
In [330]: np.ix_((y>=3)&(y<6), x<3)
Out[330]:
(array([[3],
[4],
[5]]), array([[0, 1, 2]]))
Applying it:
In [331]: arr = np.zeros((9,9),int)
In [332]: arr[_330] = 1
In [333]: arr
Out[333]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
Attempting to broadcast the booleans directly raises an error (too many indices):
arr[((y>=3)&(y<6))[:,None], x<3]

Per your comment, let's try this fancier example:
mask = np.zeros((90,90), dtype=np.uint8)
# criteria
def f(x,y): return ((x-20)**2 < 50) & ((y-20)**2 < 50)
# ranges
x,y = np.arange(90), np.arange(90)
# meshgrid
xx,yy = np.meshgrid(x,y)
zz = f(xx,yy)
# mask
mask[zz] = 1
plt.imshow(mask, cnap='gray')
Output:

Related

How to populate a column with a value if condition is met across 2+ other colums

My dataframe is similar to the table below. I have 6 columns, each with 'Yes' or 'No' if a specific antibiotic was given.
AZITH
CLIN
CFTX
METRO
CFTN
DOXY
TREATED
Yes
Yes
No
No
No
No
No
Yes
No
Yes
No
No
Yes
Yes
No
No
No
No
No
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
Yes
No
No
No
No
No
Yes
No
No
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
No
Yes
Yes
No
No
Yes
No
No
No
Yes
No
No
No
Yes
Yes
No
Yes
No
No
No
No
Yes
Yes
No
No
No
Yes
No
Yes
I want to fill the column 'TREATED' with 'True' if specific combinations of antibiotic columns contain 'Yes.' If the conditions aren't met, then I would like to fill the 'TREATED' column with a 'False' value.
If ['AZITH'] & ['CLIN'] == 'Yes' |
['AZITH'] & ['CFTX'] & ['CLIN'] == 'Yes' |
['AZITH'] & ['CFTX'] & ['METRO']== 'Yes' |
['AZITH'] & ['CFTN'] == 'Yes' |
['CFTX'] & ['DOXY'] & ['METRO']== 'Yes' |
['CFTN'] & ['DOXY'] == 'Yes' |
['DOXY'] & ['METRO']== 'Yes' ,
Then return 'True' in column 'TREATED'
Else 'False'
What I had in mind was some sort of if statement or use of lambda function, however, I am having trouble.
This must not be exclusive to the above combinations but also include for example if all 6 medications were given. If that's the case, then 'True' should be returned because the condition has been met to give at least 2 of the treatment medications.
The desired output is below:
AZITH
CLIN
CFTX
METRO
CFTN
DOXY
TREATED
Yes
Yes
No
No
No
No
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
No
No
No
No
Yes
No
No
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
Yes
Yes
No
No
No
No
No
Yes
No
No
No
No
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
No
No
No
Yes
No
Yes
No
No
Yes
Yes
No
Yes
Yes
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
No
Yes
Yes
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"AZITH": [
"Yes",
"No",
"Yes",
"No",
"Yes",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
"No",
"No",
"No",
],
"CLIN": [
"Yes",
"Yes",
"Yes",
"No",
"Yes",
"Yes",
"No",
"No",
"Yes",
"No",
"No",
"No",
"No",
"No",
],
"CFTX": [
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"No",
"No",
"Yes",
"Yes",
"No",
"Yes",
"No",
"No",
],
"METRO": [
"No",
"Yes",
"No",
"No",
"Yes",
"Yes",
"No",
"No",
"No",
"Yes",
"No",
"Yes",
"No",
"Yes",
],
"CFTN": [
"No",
"No",
"No",
"No",
"Yes",
"No",
"No",
"No",
"No",
"No",
"Yes",
"No",
"Yes",
"No",
],
"DOXY": [
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
],
}
)
Here is one way to do it:
mask = (
((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes") & (df["CFTX"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTX"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTN"] == "Yes"))
| ((df["CFTX"] == "Yes") & (df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["CFTN"] == "Yes") & (df["DOXY"] == "Yes"))
| ((df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
)
df.loc[mask, "TREATED"] = "Yes"
df = df.fillna("No")
Then:
print(df)
# Output
AZITH CLIN CFTX METRO CFTN DOXY TREATED
0 Yes Yes No No No No Yes
1 No Yes No Yes No No No
2 Yes Yes No No No No Yes
3 No No No No No No No
4 Yes Yes Yes Yes Yes Yes Yes
5 No Yes Yes Yes No Yes Yes
6 No No No No No Yes No
7 No No No No No No No
8 Yes Yes Yes No No No Yes
9 Yes No Yes Yes No No Yes
10 Yes No No No Yes No Yes
11 No No Yes Yes No Yes Yes
12 No No No No Yes Yes Yes
13 No No No Yes No Yes Yes
This is a little abstract but you can use bit flag to represent each Yes (True) assign it a binary value and then do the math in threated based on an if statement.
https://dietertack.medium.com/using-bit-flags-in-c-d39ec6e30f08
You can use set operations, first aggregate as the sets of given medications, then check over all possible combinations if you have a superset:
valid_treatments = [{'AZITH', 'CLIN'}, {'AZITH', 'CFTX', 'CLIN'},
{'AZITH', 'CFTX', 'METRO'}, {'AZITH', 'CFTN'},
{'CFTX', 'DOXY', 'METRO'}, {'CFTN', 'DOXY'},
{'DOXY', 'METRO'},
]
def is_valid(row):
combination = set(df.columns[row])
return 'Yes' if any(
combination.issuperset(v)
for v in valid_treatments
) else 'No'
out = df.assign(TREATED=df.eq('Yes').apply(is_valid, axis=1))
Output:
AZITH CLIN CFTX METRO CFTN DOXY TREATED
0 Yes Yes No No No No Yes
1 No Yes No Yes No No No
2 Yes Yes No No No No Yes
3 No No No No No No No
4 Yes Yes Yes Yes Yes Yes Yes
5 No Yes Yes Yes No Yes Yes
6 No No No No No Yes No
7 No No No No No No No
8 Yes Yes Yes No No No Yes
9 Yes No Yes Yes No No Yes
10 Yes No No No Yes No Yes
11 No No Yes Yes No Yes Yes
12 No No No No Yes Yes Yes
13 No No No Yes No Yes Yes
My answer is going to attempt to vectorize this solution. But I got the idea for the superset from Mozway. Before that I couldn't figure out how to handle all combos
import numpy as np
import pandas as pd
import itertools
df = pd.DataFrame({ "AZITH": ["Yes","No","Yes","No","Yes","No","No","No","Yes","Yes","Yes","No","No","No",],
"CLIN": ["Yes","Yes","Yes","No","Yes","Yes","No","No","Yes","No","No","No","No","No",],
"CFTX": ["No","No","No","No","Yes","Yes","No","No","Yes","Yes","No","Yes","No","No",],
"METRO": ["No","Yes","No","No","Yes","Yes","No","No","No","Yes","No","Yes","No","Yes",],
"CFTN": ["No","No","No","No","Yes","No","No","No","No","No","Yes","No","Yes","No",],
"DOXY": ["No","No","No","No","Yes","Yes","Yes","No","No","No","No","Yes","Yes","Yes",]})
combos = np.array([[1,1,0,0,0,0],[1,1,1,0,0,0],[1,0,1,1,0,0],[1,0,0,0,1,0],[0,0,1,1,0,1],[0,0,0,0,1,1],[0,0,0,1,0,1]])
df = df.replace("Yes",1)
df = df.replace("No",0)
c = []
for l in range(len(combos)):
c.extend(itertools.combinations(range(len(combos)),l))
all_combos = combos
for combo in c[1:]:
combined = np.sum(combos[combo,:],axis=0)
all_combos = np.vstack([all_combos,combined])
all_combos[all_combos!=0]=1
all_combos = np.unique(all_combos,axis=0)
combo_sum = all_combos.sum(axis=1)
all_combos[all_combos==0]=-1
new_df = df.dot(all_combos.transpose())
for i,x in enumerate(combo_sum):
new_df.loc[new_df[i]<x,i] = 0
new_df[new_df>0]=1
new_df["res"] = new_df.sum(axis=1)
new_df.loc[new_df.res>0,"res"] = True
new_df.loc[new_df.res==0,"res"] = False
df["res"] = new_df["res"]
AZITH CLIN CFTX METRO CFTN DOXY res
0 1 1 0 0 0 0 True
1 0 1 0 1 0 0 False
2 1 1 0 0 0 0 True
3 0 0 0 0 0 0 False
4 1 1 1 1 1 1 True
5 0 1 1 1 0 1 False
6 0 0 0 0 0 1 False
7 0 0 0 0 0 0 False
8 1 1 1 0 0 0 True
9 1 0 1 1 0 0 True
10 1 0 0 0 1 0 True
11 0 0 1 1 0 1 True
12 0 0 0 0 1 1 True
13 0 0 0 1 0 1 True
The general explanation of the code is that I create a numpy array of all combos including combinations of combos (sum of two or more combos) that are acceptable. I delete duplicate combinations and this is what is left
np.unique(all_combos,axis=0)
Out[38]:
array([[0, 0, 0, 0, 1, 1],
[0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 0, 1, 1, 1, 1],
[1, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 1, 1],
[1, 0, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 0],
[1, 0, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 1, 1],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 1, 0],
[1, 1, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1]])
Any extra medications that are not part of the combo are penalized by setting the value to -1 in the combo list. (If extra medications are not to be penalized then the superset is not needed, you can just compare to a sum of the original combos variable.)
A dot product between the dataset and the set of all combos is then done and the value is compared against the sum of the combo (prior to replacing the 0s with -1s). This means if the value is 3 and the expected outcome of the combo is 3, then it is a valid combo. Below is the sum of the combos as an array
ipdb> combo_sum
array([2, 2, 3, 3, 4, 2, 3, 4, 3, 4, 4, 5, 2, 3, 4, 4, 5, 3, 4, 5, 4, 5,
5, 6])
ipdb> new_df
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
0 -2 -2 -2 -2 -2 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
1 -2 0 0 0 0 -2 -2 0 0 0 ... 0 2 2 0 0 0 2 2 2 2
2 -2 -2 -2 -2 -2 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 -2 -2 0 0 2 -2 0 2 0 2 ... 2 2 4 0 2 4 2 4 4 6
5 -2 0 0 2 2 -4 -2 0 0 2 ... 0 2 2 0 0 2 2 4 2 4
6 1 1 1 1 1 -1 1 1 -1 1 ... 1 1 1 -1 -1 1 -1 1 -1 1
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 -3 -3 -3 -1 -1 -1 -1 -1 1 1 ... 1 1 1 3 3 3 3 3 3 3
9 -3 -1 -1 1 1 -1 -1 1 3 3 ... -1 1 1 1 1 1 3 3 3 3
10 0 -2 0 -2 0 2 2 2 0 0 ... 2 0 2 0 2 2 0 0 2 2
11 -1 1 1 3 3 -3 -1 1 1 3 ... -1 1 1 -1 -1 1 1 3 1 3
12 2 0 2 0 2 0 2 2 -2 0 ... 2 0 2 -2 0 2 -2 0 0 2
13 0 2 2 2 2 -2 0 2 0 2 ... 0 2 2 -2 -2 0 0 2 0 2
After the dot product, we replace the valid values with 1 and invalid (less than the expected sum) with 0. We sum on all the combinations to see if any are valid. If the sum of combinations >= 1, then at least one combo was valid. Else, all were invalid.
ipdb> new_df
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 res
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 1
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 1
9 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1
10 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
11 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
12 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
13 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
Replace the final summed column with True or false and apply to original dataframe.

add missing values in pandas dataframe - datacleaning

I have measurements stored in a data frame that looks like the one below.
Those are measurements of PMs. Sensors are measuring the four of them pm1, pm2.5, pm5, pm10 contained in the column indicator, under conditions x1..x56, and it gives the measurement in the column area and count. The problem is that under some condition (columns x1..x56) sensors didn't catch all the PMs. And I want for every combination of column conditions (x1..x56) to have all 4 PM values in column indicator. And if the sensor didn't catch it (if there is no PM value for some combination of Xs) I should add it, and area and count column should be 0.
x1 x2 x3 x4 x5 x6 .. x56 indicator area count
0 0 0 0 0 0 .. 0 pm1 10 56
0 0 0 0 0 0 .. 0 pm10 9 1
0 0 0 0 0 0 .. 0 pm5 1 454
.............................................
1 0 0 0 0 0 .. 0 pm1 3 4
ssl ax w 45b g g .. gb pm1 3 4
1 wdf sw d78 b fd .. b pm1 3 4
In this example for the first combination of all zeros, pm2.5 is missing so I should add it and put its area and count to be 0. Similar for the second combination (the one that starts with 1). So my dummy example should look like this after I finish:
x1 x2 x3 x4 x5 x6 .. x56 indicator area count
0 0 0 0 0 0 .. 0 pm1 10 56
0 0 0 0 0 0 .. 0 pm10 9 1
0 0 0 0 0 0 .. 0 pm5 1 454
0 0 0 0 0 0 .. 0 pm2.5 0 0
.............................................
1 0 0 0 0 0 .. 0 pm1 3 4
1 0 0 0 0 0 .. 0 pm10 0 0
1 0 0 0 0 0 .. 0 pm5 0 0
1 0 0 0 0 0 .. 0 pm2.5 0 0
ssl ax w 45b g g .. gb pm1 3 4
ssl ax w 45b g g .. gb pm10 0 0
ssl ax w 45b g g .. gb pm5 0 0
ssl ax w 45b g g .. gb pm2.5 0 0
1 wdf sw d78 b fd .. b pm1 3 4
1 wdf sw d78 b fd .. b pm10 0 0
1 wdf sw d78 b fd .. b pm5 0 0
1 wdf sw d78 b fd .. b pm2.5 0 0
How I can do that? Thanks in advance!
The key here is to create a MultiIndex from all combinations of x and indicator then fill missing records.
Step 1.
Create a vector of x columns:
df['x'] = df.filter(regex='^x\d+').apply(tuple, axis=1)
print(df)
# Output:
x1 x2 x3 x4 x5 x6 x56 indicator area count x
0 0 0 0 0 0 0 0 pm1 10 56 (0, 0, 0, 0, 0, 0, 0)
1 0 0 0 0 0 0 0 pm10 9 1 (0, 0, 0, 0, 0, 0, 0)
2 0 0 0 0 0 0 0 pm5 1 454 (0, 0, 0, 0, 0, 0, 0)
3 1 0 0 0 0 0 0 pm1 3 4 (1, 0, 0, 0, 0, 0, 0)
Step 2.
Create the MultiIindex from vector x and indicator list then reindex your dataframe.
mi = pd.MultiIndex.from_product([df['x'].unique(),
['pm1', 'pm2.5', 'pm5', 'pm10']],
names=['x', 'indicator'])
out = df.set_index(['x', 'indicator']).reindex(mi, fill_value=0)
print(out)
# Output:
x1 x2 x3 x4 x5 x6 x56 area count
x indicator
(0, 0, 0, 0, 0, 0, 0) pm1 0 0 0 0 0 0 0 10 56
pm2.5 0 0 0 0 0 0 0 0 0
pm5 0 0 0 0 0 0 0 1 454
pm10 0 0 0 0 0 0 0 9 1
(1, 0, 0, 0, 0, 0, 0) pm1 1 0 0 0 0 0 0 3 4
pm2.5 *0* 0 0 0 0 0 0 0 0
pm5 *0* 0 0 0 0 0 0 0 0
pm10 *0* 0 0 0 0 0 0 0 0
# Need to be fixed ----^
Step 3.
Group by x index to update x columns by keeping the highest value for each column of the group (1 > 0).
out = out.filter(regex='^x\d+').groupby(level='x') \
.apply(lambda x: pd.Series(dict(zip(x.columns, x.name)))) \
.join(out[['area', 'count']]).reset_index()[df.columns[:-1]]
print(out)
# Output:
x1 x2 x3 x4 x5 x6 x56 indicator area count
0 0 0 0 0 0 0 0 pm1 10 56
1 0 0 0 0 0 0 0 pm2.5 0 0
2 0 0 0 0 0 0 0 pm5 1 454
3 0 0 0 0 0 0 0 pm10 9 1
4 1 0 0 0 0 0 0 pm1 3 4
5 1 0 0 0 0 0 0 pm2.5 0 0
6 1 0 0 0 0 0 0 pm5 0 0
7 1 0 0 0 0 0 0 pm10 0 0

Get_dummies produces more columns than its supposed to

I'm using get_dummies on a column of data that has zeroes or 'D' or "E". Instead of producing 2 columns it produces 5 - C, D, E, N, O. I'm not sure what they are and how to make it do just 2 as its supposed to.
When I just pull that column shows 0's and D and E, but when I put it in get_dummies adds extra columns
data[[2]]
0
0
D
0
0
0
0
D
0
0
When I do this:
dummy = pd.get_dummies(data[2], dummy_na = False)
dummy.head()
I get
0 C D E N O PreferredContactTime
0 0 0 0 0 0 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
0 0 1 0 0 0 0
1 0 0 0 0 0 0
What are C , N and O? I don't understand what it is displaying at all.
Setup
dtype = pd.CategoricalDtype([0, 'C', 'D', 'E', 'N', 'O', 'PreferredContactTime'])
data = pd.DataFrame({2: [
'PreferredContactTime', 0, 0, 'D', 0, 0, 0, 0, 'D', 0, 0
]}).astype(dtype)
Your result
dummy = pd.get_dummies(data[2], dummy_na=False )
dummy.head()
0 C D E N O PreferredContactTime
0 0 0 0 0 0 0 1
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0
4 1 0 0 0 0 0 0

Create tensors where all elements up to a given index are 1s, the rest are 0s

I have a placeholder lengths = tf.placeholder(tf.int32, [10]). Each of the 10 values assigned to this placeholder are <= 25. I now want to create a 2-dimensional tensor, called masks, of shape [10, 25], where each of the 10 vectors of length 25 has the first n elements set to 1, and the rest set to 0 - with n being the corresponding value in lengths.
What is the easiest way to do this using TensorFlow's built in methods?
For example:
lengths = [4, 6, 7, ...]
-> masks = [[1, 1, 1, 1, 0, 0, 0, 0, ..., 0],
[1, 1, 1, 1, 1, 1, 0, 0, ..., 0],
[1, 1, 1, 1, 1, 1, 1, 0, ..., 0],
...
]
You can reshape lengths to a (10, 1) tensor, then compare it with another sequence/indices 0,1,2,3,...,25, which due to broadcasting will result in True if the indices are smaller then lengths, otherwise False; then you can cast the boolean result to 1 and 0:
lengths = tf.constant([4, 6, 7])
n_features = 25
​
import tensorflow as tf
​
masks = tf.cast(tf.range(n_features) < tf.reshape(lengths, (-1, 1)), tf.int8)
with tf.Session() as sess:
print(sess.run(masks))
#[[1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

tensorflow 0.8 one hot encoding

the data that i wanna encode looks as follows:
print (train['labels'])
[ 0 0 0 ..., 42 42 42]
there are 43 classes going from 0-42
Now i read that tensorflow in version 0.8 has a new feature for one hot encoding so i tried to use it as following:
trainhot=tf.one_hot(train['labels'], 43, on_value=1, off_value=0)
only problem is that i think the output is not what i need
print (trainhot[1])
Tensor("strided_slice:0", shape=(43,), dtype=int32)
Can someone nudge me in the right direction please :)
The output is correct and expected. trainhot[1] is the label of the second (0-based index) training sample, which is of 1D shape (43,). You can play with the code below to better understand tf.one_hot:
onehot = tf.one_hot([0, 0, 41, 42], 43, on_value=1, off_value=0)
with tf.Session() as sess:
onehot_v = sess.run(onehot)
print("v: ", onehot_v)
print("v shape: ", onehot_v.shape)
print("v[1] shape: ", onehot[1])
output:
v: [[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1]]
v shape: (4, 43)
v[1] shape: Tensor("strided_slice:0", shape=(43,), dtype=int32)