The dataframe would look something similar to this:
start = [0,2,4,5,1]
end = [3,5,5,5,2]
df = pd.DataFrame({'start': start,'end': end})
The result I want look something like this:
Basically marking a value from start to finish across multiple columns. So if one that start on 0 and ends on 3 I want to mark new column 0 to 3 with a value(1) and the rest with 0.
start = [0,2,4,5,1]
end = [3,5,5,5,2]
diff = [3,3,1,0,1]
col_0 = [1,0,0,0,0]
col_1=[1,0,0,0,1]
col_2 = [1,1,0,0,1]
col_3=[1,1,0,0,0]
col_4=[0,1,1,0,0]
col_5=[0,1,1,1,0]
df = pd.DataFrame({'start': start,'end': end, 'col_0':col_0, 'col_1': col_1, 'col_2': col_2, 'col_3':col_3, 'col_4': col_4, 'col_5': col_5})
start end col_0 col_1 col_2 col_3 col_4 col_5
0 3 1 1 1 1 0 0
2 5 0 0 1 1 1 1
4 5 0 0 0 0 1 1
5 5 0 0 0 0 0 1
1 2 0 1 1 0 0 0
Use dict.fromkeys in list comprehension for each row in DataFrame and pass to DataFrame constructor if perfromance is important:
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('col_').fillna(0).astype(int))
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0
If possible some range value is missing like in changed sample data add DataFrame.reindex:
#missing column 6
start = [0,2,4,7,1]
end = [3,5,5,8,2]
df = pd.DataFrame({'start': start,'end': end})
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df1 = (pd.DataFrame(L, index=df.index)
.reindex(columns=range(df['start'].min(), df['end'].max() + 1), fill_value=0)
.add_prefix('col_')
.fillna(0)
.astype(int))
df = df.join(df1)
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8
0 0 3 1 1 1 1 0 0 0 0 0
1 2 5 0 0 1 1 1 1 0 0 0
2 4 5 0 0 0 0 1 1 0 0 0
3 7 8 0 0 0 0 0 0 0 1 1
4 1 2 0 1 1 0 0 0 0 0 0
EDIT: For counts hours use:
start = pd.to_datetime([0,2,4,5,1], format='%H')
end = pd.to_datetime([3,5,5,5,2], format='%H')
df = pd.DataFrame({'start': start,'end': end})
df.loc[[0,1], 'end'] += pd.Timedelta(1, 'day')
#list for hours datetimes
L = [dict.fromkeys(pd.date_range(s, e, freq='H'), 1) for s, e in zip(df['start'], df['end'])]
df1 = pd.DataFrame(L, index=df.index)
#aggregate sum by hours in columns
df1 = df1.groupby(df1.columns.hour, axis=1).sum().astype(int)
print (df1)
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 \
0 2 2 2 2 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
1 1 1 2 2 2 2 1 1 1 1 ... 1 1 1 1 1 1 1
2 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0
4 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
21 22 23
0 1 1 1
1 1 1 1
2 0 0 0
3 0 0 0
4 0 0 0
[5 rows x 24 columns]
Convert your range from start to stop to a list of indices then explode it. Finally, use indexing to set values to 1:
import numpy as np
range_to_ind = lambda x: range(x['start'], x['end']+1)
(i, j) = df.apply(range_to_ind, axis=1).explode().astype(int).reset_index().values.T
a = np.zeros((df.shape[0], max(df['end'])+1), dtype=int)
a[i, j] = 1
df = df.join(pd.DataFrame(a).add_prefix('col_'))
Output:
>>> df
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0
This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc
Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2
get dummies method does not seem to work as expected while using with more than one column.
For e.g. if I have this dataframe...
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread", "Milk"],
["Rice", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
If I use get_dummmies method, the items are repeated across columns like this:
pd.get_dummies(df)
0_Apple 0_Rice 1_Bread 1_Milk 1_Rice 2_Bread 2_Fridge 2_Milk 3_Milk
0 1 0 1 0 0 0 1 0 0
1 0 1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0 0 1
3 0 1 0 1 0 0 0 0 0
4 1 0 1 0 0 0 0 1 0
While the expected result is:
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 1 1
3 0 0 0 1 1
4 1 1 0 1 0
Add parameters prefix and prefix_sep to get_dummies and then add max for avoid duplicated columns names (it aggregate by max):
df = pd.get_dummies(df, prefix='', prefix_sep='').max(axis=1, level=0)
print(df)
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 1 0
3 0 1 0 1 0
4 1 0 1 1 0
I want to create a "choice" variable that indicates what choice was taken among some alternatives.
The alternatives in this example are: 123, 456, 789.
If there is no choice, assign 0 value, if there is multiple choice, assign a 1 value,
otherwise, assign the choice name (by taking the column name).
Data illustration:
ID Date X 123 456 789
A 07/16/2019 .. 1 0 0
A 07/19/2019 .. 0 0 0
A 07/20/2019 .. 0 1 0
A 07/22/2019 .. 1 0 0
A 07/23/2019 .. 0 1 1
B 07/27/2019 .. 0 0 1
B 07/28/2019 .. 0 0 0
B 07/30/2019 .. 0 0 0
Expected result:
ID Date X 123 456 789 choice
A 07/16/2019 .. 1 0 0 123
A 07/19/2019 .. 0 0 0 0
A 07/20/2019 .. 0 1 0 456
A 07/22/2019 .. 1 0 0 123
A 07/23/2019 .. 0 1 1 1
B 07/27/2019 .. 0 0 1 789
B 07/28/2019 .. 0 0 0 0
B 07/30/2019 .. 0 0 0 0
Use numpy.select with DataFrame.idxmax:
#seelct last 3 columns
df1 = df.iloc[:, -3:]
#sum 1 values
s = df1.sum(axis=1)
#set values by conditions
df['choice'] = np.select([s == 1, s == 0], [df1.idxmax(axis=1), 0], default=1)
print (df)
ID Date X 123 456 789 choice
0 A 07/16/2019 .. 1 0 0 123
1 A 07/19/2019 .. 0 0 0 0
2 A 07/20/2019 .. 0 1 0 456
3 A 07/22/2019 .. 1 0 0 123
4 A 07/23/2019 .. 0 1 1 1
5 B 07/27/2019 .. 0 0 1 789
6 B 07/28/2019 .. 0 0 0 0
7 B 07/30/2019 .. 0 0 0 0
Here a ways to do it by selection, using a custom function which does the job and pandas apply:
#list with the names of valid alternative columns
alternatives = ['123', '456', '789']
#custom function to do the selection
def pick_choice(row):
ones = row[alternatives].loc[row[alternatives] == 1]
if len(ones) == 0:
return 0
elif len(ones) > 1:
return 1
elif len(ones) == 1:
return ones.index[0]
df['choice'] = df.apply(pick_choice, axis=1)
Resulting df is:
ID Date X 123 456 789 choice
0 A 07/16/2019 .. 1 0 0 123
1 A 07/19/2019 .. 0 0 0 0
2 A 07/20/2019 .. 0 1 0 456
3 A 07/22/2019 .. 1 0 0 123
4 A 07/23/2019 .. 0 1 1 1
5 B 07/27/2019 .. 0 0 1 789
6 B 07/28/2019 .. 0 0 0 0
7 B 07/30/2019 .. 0 0 0 0
Careful that dtype of column 'choice' is object and not int, because column names are strings (even if you have integers).
I am new to Python. I have created dummy columns on categorical column using pandas get_dummies. How to create dummy columns on ordinal column (say column Rating has values 1,2,3...,10)
Consider the dataframe df
df = pd.DataFrame(dict(Cats=list('abcdcba'), Ords=[3, 2, 1, 0, 1, 2, 3]))
df
Cats Ords
0 a 3
1 b 2
2 c 1
3 d 0
4 c 1
5 b 2
6 a 3
pd.get_dummies
works the same on either column
with df.Cats
pd.get_dummies(df.Cats)
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 0 0 1 0
5 0 1 0 0
6 1 0 0 0
with df.Ords
0 1 2 3
0 0 0 0 1
1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
with both
pd.get_dummies(df)
Ords Cats_a Cats_b Cats_c Cats_d
0 3 1 0 0 0
1 2 0 1 0 0
2 1 0 0 1 0
3 0 0 0 0 1
4 1 0 0 1 0
5 2 0 1 0 0
6 3 1 0 0 0
Notice that it split out Cats but not Ords
Let's expand on this by adding another Cats2 column and calling pd.get_dummies
pd.get_dummies(df.assign(Cats2=df.Cats)))
Ords Cats_a Cats_b Cats_c Cats_d Cats2_a Cats2_b Cats2_c Cats2_d
0 3 1 0 0 0 1 0 0 0
1 2 0 1 0 0 0 1 0 0
2 1 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 1
4 1 0 0 1 0 0 0 1 0
5 2 0 1 0 0 0 1 0 0
6 3 1 0 0 0 1 0 0 0
Interesting, it splits both object columns but not the numeric one.