How do I add a color column to the following dataframe so that color='green' if Set == 'Z', and color='red' otherwise?
Type Set
1 A Z
2 B Z
3 B X
4 C Y
If you only have two choices to select from:
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)
yields
Set Type color
0 Z A green
1 Z B green
2 X B red
3 Y C red
If you have more than two conditions then use np.select. For example, if you want color to be
yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
otherwise purple when (df['Type'] == 'B')
otherwise black,
then use
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
(df['Set'] == 'Z') & (df['Type'] == 'A'),
(df['Set'] == 'Z') & (df['Type'] == 'B'),
(df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)
which yields
Set Type color
0 Z A yellow
1 Z B blue
2 X B purple
3 Y C black
List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.
Example list comprehension:
df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit tests:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop
The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.
Simple example using just the "Set" column:
def set_color(row):
if row["Set"] == "Z":
return "red"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Example with more colours and more columns taken into account:
def set_color(row):
if row["Set"] == "Z":
return "red"
elif row["Type"] == "C":
return "blue"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C blue
Edit (21/06/2019): Using plydata
It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).
from plydata import define, if_else
Simple if_else:
df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Nested if_else:
df = define(df, color=if_else(
'Set=="Z"',
'"red"',
if_else('Type=="C"', '"green"', '"blue"')))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B blue
3 Y C green
Another way in which this could be achieved is
df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:
def map_values(row, values_dict):
return values_dict[row]
values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}
df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})
df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))
What's it look like:
df
Out[2]:
INDICATOR VALUE NEW_VALUE
0 A 10 1
1 B 9 2
2 C 8 3
3 D 7 4
This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).
And of course you could always do this:
df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)
But that approach is more than three times as slow as the apply approach from above, on my machine.
And you could also do this, using dict.get:
df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]
You can simply use the powerful .loc method and use one condition or several depending on your need (tested with pandas=1.0.5).
Code Summary:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"
#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
Explanation:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
# df so far:
Type Set
0 A Z
1 B Z
2 B X
3 C Y
add a 'color' column and set all values to "red"
df['Color'] = "red"
Apply your single condition:
df.loc[(df['Set']=="Z"), 'Color'] = "green"
# df:
Type Set Color
0 A Z green
1 B Z green
2 B X red
3 C Y red
or multiple conditions if you want:
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
You can read on Pandas logical operators and conditional selection here:
Logical operators for boolean indexing in Pandas
You can use pandas methods where and mask:
df['color'] = 'green'
df['color'] = df['color'].where(df['Set']=='Z', other='red')
# Replace values where the condition is False
or
df['color'] = 'red'
df['color'] = df['color'].mask(df['Set']=='Z', other='green')
# Replace values where the condition is True
Alternatively, you can use the method transform with a lambda function:
df['color'] = df['Set'].transform(lambda x: 'green' if x == 'Z' else 'red')
Output:
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
Performance comparison from #chai:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%timeit df['color1'] = 'red'; df['color1'].where(df['Set']=='Z','green')
%timeit df['color2'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color3'] = np.where(df['Set']=='Z', 'red', 'green')
%timeit df['color4'] = df.Set.map(lambda x: 'red' if x == 'Z' else 'green')
397 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
976 ms ± 241 ms per loop
673 ms ± 139 ms per loop
796 ms ± 182 ms per loop
if you have only 2 choices, use np.where()
df = pd.DataFrame({'A':range(3)})
df['B'] = np.where(df.A>2, 'yes', 'no')
if you have over 2 choices, maybe apply() could work
input
arr = pd.DataFrame({'A':list('abc'), 'B':range(3), 'C':range(3,6), 'D':range(6, 9)})
and arr is
A B C D
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
if you want the column E tobe if arr.A =='a' then arr.B elif arr.A=='b' then arr.C elif arr.A == 'c' then arr.D else something_else
arr['E'] = arr.apply(lambda x: x['B'] if x['A']=='a' else(x['C'] if x['A']=='b' else(x['D'] if x['A']=='c' else 1234)), axis=1)
and finally the arr is
A B C D E
0 a 0 3 6 0
1 b 1 4 7 4
2 c 2 5 8 8
One liner with .apply() method is following:
df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')
After that, df data frame looks like this:
>>> print(df)
Type Set color
0 A Z green
1 B Z green
2 B X red
3 C Y red
The case_when function from pyjanitor is a wrapper around pd.Series.mask and offers a chainable/convenient form for multiple conditions:
For a single condition:
df.case_when(
df.col1 == "Z", # condition
"green", # value if True
"red", # value if False
column_name = "color"
)
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
For multiple conditions:
df.case_when(
df.Set.eq('Z') & df.Type.eq('A'), 'yellow', # condition, result
df.Set.eq('Z') & df.Type.eq('B'), 'blue', # condition, result
df.Type.eq('B'), 'purple', # condition, result
'black', # default if none of the conditions evaluate to True
column_name = 'color'
)
Type Set color
1 A Z yellow
2 B Z blue
3 B X purple
4 C Y black
More examples can be found here
If you're working with massive data, a memoized approach would be best:
# First create a dictionary of manually stored values
color_dict = {'Z':'red'}
# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}
# Next, merge the two
color_dict.update(color_dict_other)
# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)
This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4
E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.
A Less verbose approach using np.select:
a = np.array([['A','Z'],['B','Z'],['B','X'],['C','Y']])
df = pd.DataFrame(a,columns=['Type','Set'])
conditions = [
df['Set'] == 'Z'
]
outputs = [
'Green'
]
# conditions Z is Green, Red Otherwise.
res = np.select(conditions, outputs, 'Red')
res
array(['Green', 'Green', 'Red', 'Red'], dtype='<U5')
df.insert(2, 'new_column',res)
df
Type Set new_column
0 A Z Green
1 B Z Green
2 B X Red
3 C Y Red
df.to_numpy()
array([['A', 'Z', 'Green'],
['B', 'Z', 'Green'],
['B', 'X', 'Red'],
['C', 'Y', 'Red']], dtype=object)
%%timeit conditions = [df['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
134 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df2 = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%%timeit conditions = [df2['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
188 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is an easy one-liner you can use when you have one or several conditions:
df['color'] = np.select(condlist=[df['Set']=="Z", df['Set']=="Y"], choicelist=["green", "yellow"], default="red")
Easy and good to go!
See more here: https://numpy.org/doc/stable/reference/generated/numpy.select.html
Related
How do I add a color column to the following dataframe so that color='green' if Set == 'Z', and color='red' otherwise?
Type Set
1 A Z
2 B Z
3 B X
4 C Y
If you only have two choices to select from:
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)
yields
Set Type color
0 Z A green
1 Z B green
2 X B red
3 Y C red
If you have more than two conditions then use np.select. For example, if you want color to be
yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
otherwise purple when (df['Type'] == 'B')
otherwise black,
then use
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
(df['Set'] == 'Z') & (df['Type'] == 'A'),
(df['Set'] == 'Z') & (df['Type'] == 'B'),
(df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)
which yields
Set Type color
0 Z A yellow
1 Z B blue
2 X B purple
3 Y C black
List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.
Example list comprehension:
df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit tests:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop
The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.
Simple example using just the "Set" column:
def set_color(row):
if row["Set"] == "Z":
return "red"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Example with more colours and more columns taken into account:
def set_color(row):
if row["Set"] == "Z":
return "red"
elif row["Type"] == "C":
return "blue"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C blue
Edit (21/06/2019): Using plydata
It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).
from plydata import define, if_else
Simple if_else:
df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Nested if_else:
df = define(df, color=if_else(
'Set=="Z"',
'"red"',
if_else('Type=="C"', '"green"', '"blue"')))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B blue
3 Y C green
Another way in which this could be achieved is
df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:
def map_values(row, values_dict):
return values_dict[row]
values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}
df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})
df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))
What's it look like:
df
Out[2]:
INDICATOR VALUE NEW_VALUE
0 A 10 1
1 B 9 2
2 C 8 3
3 D 7 4
This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).
And of course you could always do this:
df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)
But that approach is more than three times as slow as the apply approach from above, on my machine.
And you could also do this, using dict.get:
df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]
You can simply use the powerful .loc method and use one condition or several depending on your need (tested with pandas=1.0.5).
Code Summary:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"
#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
Explanation:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
# df so far:
Type Set
0 A Z
1 B Z
2 B X
3 C Y
add a 'color' column and set all values to "red"
df['Color'] = "red"
Apply your single condition:
df.loc[(df['Set']=="Z"), 'Color'] = "green"
# df:
Type Set Color
0 A Z green
1 B Z green
2 B X red
3 C Y red
or multiple conditions if you want:
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
You can read on Pandas logical operators and conditional selection here:
Logical operators for boolean indexing in Pandas
You can use pandas methods where and mask:
df['color'] = 'green'
df['color'] = df['color'].where(df['Set']=='Z', other='red')
# Replace values where the condition is False
or
df['color'] = 'red'
df['color'] = df['color'].mask(df['Set']=='Z', other='green')
# Replace values where the condition is True
Alternatively, you can use the method transform with a lambda function:
df['color'] = df['Set'].transform(lambda x: 'green' if x == 'Z' else 'red')
Output:
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
Performance comparison from #chai:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%timeit df['color1'] = 'red'; df['color1'].where(df['Set']=='Z','green')
%timeit df['color2'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color3'] = np.where(df['Set']=='Z', 'red', 'green')
%timeit df['color4'] = df.Set.map(lambda x: 'red' if x == 'Z' else 'green')
397 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
976 ms ± 241 ms per loop
673 ms ± 139 ms per loop
796 ms ± 182 ms per loop
if you have only 2 choices, use np.where()
df = pd.DataFrame({'A':range(3)})
df['B'] = np.where(df.A>2, 'yes', 'no')
if you have over 2 choices, maybe apply() could work
input
arr = pd.DataFrame({'A':list('abc'), 'B':range(3), 'C':range(3,6), 'D':range(6, 9)})
and arr is
A B C D
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
if you want the column E tobe if arr.A =='a' then arr.B elif arr.A=='b' then arr.C elif arr.A == 'c' then arr.D else something_else
arr['E'] = arr.apply(lambda x: x['B'] if x['A']=='a' else(x['C'] if x['A']=='b' else(x['D'] if x['A']=='c' else 1234)), axis=1)
and finally the arr is
A B C D E
0 a 0 3 6 0
1 b 1 4 7 4
2 c 2 5 8 8
One liner with .apply() method is following:
df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')
After that, df data frame looks like this:
>>> print(df)
Type Set color
0 A Z green
1 B Z green
2 B X red
3 C Y red
The case_when function from pyjanitor is a wrapper around pd.Series.mask and offers a chainable/convenient form for multiple conditions:
For a single condition:
df.case_when(
df.col1 == "Z", # condition
"green", # value if True
"red", # value if False
column_name = "color"
)
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
For multiple conditions:
df.case_when(
df.Set.eq('Z') & df.Type.eq('A'), 'yellow', # condition, result
df.Set.eq('Z') & df.Type.eq('B'), 'blue', # condition, result
df.Type.eq('B'), 'purple', # condition, result
'black', # default if none of the conditions evaluate to True
column_name = 'color'
)
Type Set color
1 A Z yellow
2 B Z blue
3 B X purple
4 C Y black
More examples can be found here
If you're working with massive data, a memoized approach would be best:
# First create a dictionary of manually stored values
color_dict = {'Z':'red'}
# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}
# Next, merge the two
color_dict.update(color_dict_other)
# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)
This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4
E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.
A Less verbose approach using np.select:
a = np.array([['A','Z'],['B','Z'],['B','X'],['C','Y']])
df = pd.DataFrame(a,columns=['Type','Set'])
conditions = [
df['Set'] == 'Z'
]
outputs = [
'Green'
]
# conditions Z is Green, Red Otherwise.
res = np.select(conditions, outputs, 'Red')
res
array(['Green', 'Green', 'Red', 'Red'], dtype='<U5')
df.insert(2, 'new_column',res)
df
Type Set new_column
0 A Z Green
1 B Z Green
2 B X Red
3 C Y Red
df.to_numpy()
array([['A', 'Z', 'Green'],
['B', 'Z', 'Green'],
['B', 'X', 'Red'],
['C', 'Y', 'Red']], dtype=object)
%%timeit conditions = [df['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
134 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df2 = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%%timeit conditions = [df2['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
188 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is an easy one-liner you can use when you have one or several conditions:
df['color'] = np.select(condlist=[df['Set']=="Z", df['Set']=="Y"], choicelist=["green", "yellow"], default="red")
Easy and good to go!
See more here: https://numpy.org/doc/stable/reference/generated/numpy.select.html
Consider the following lists short_list and long_list
short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
np.random.choice(list(ascii_letters),
(10000, 2))
).sum(1).tolist()
How do I calculate the cumulative count by unique value?
I want to use numpy and do it in linear time. I want this to compare timings with my other methods. It may be easiest to illustrate with my first proposed solution
def pir1(l):
s = pd.Series(l)
return s.groupby(s).cumcount().tolist()
print(np.array(short_list))
print(pir1(short_list))
['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]
I've tortured myself trying to use np.unique because it returns a counts array, an inverse array, and an index array. I was sure I could these to get at a solution. The best I got is in pir4 below which scales in quadratic time. Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1.
Below are some of my attempts (none of which answer my question)
%%cython
from collections import defaultdict
def get_generator(l):
counter = defaultdict(lambda: -1)
for i in l:
counter[i] += 1
yield counter[i]
def pir2(l):
return [i for i in get_generator(l)]
def pir3(l):
return [i for i in get_generator(l)]
def pir4(l):
unq, inv = np.unique(l, 0, 1, 0)
a = np.arange(len(unq))
matches = a[:, None] == inv
return (matches * matches.cumsum(1)).sum(0).tolist()
setup
short_list = np.array(list('aaabaaacaaadaaac'))
functions
dfill takes an array and returns the positions where the array changes and repeats that index position until the next change.
# dfill
#
# Example with short_list
#
# 0 0 0 3 4 4 4 7 8 8 8 11 12 12 12 15
# [ a a a b a a a c a a a d a a a c]
#
# Example with short_list after sorting
#
# 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15
# [ a a a a a a a a a a a a b c c d]
argunsort returns the permutation necessary to undo a sort given the argsort array. The existence of this method became know to me via this post.. With this, I can get the argsort array and sort my array with it. Then I can undo the sort without the overhead of sorting again.
cumcount will take an array sort it, find the dfill array. An np.arange less dfill will give me cumulative count. Then I un-sort
# cumcount
#
# Example with short_list
#
# short_list:
# [ a a a b a a a c a a a d a a a c]
#
# short_list.argsort():
# [ 0 1 2 4 5 6 8 9 10 12 13 14 3 7 15 11]
#
# Example with short_list after sorting
#
# short_list[short_list.argsort()]:
# [ a a a a a a a a a a a a b c c d]
#
# dfill(short_list[short_list.argsort()]):
# [ 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15]
#
# np.range(short_list.size):
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
#
# np.range(short_list.size) -
# dfill(short_list[short_list.argsort()]):
# [ 0 1 2 3 4 5 6 7 8 9 10 11 0 0 1 0]
#
# unsorted:
# [ 0 1 2 0 3 4 5 0 6 7 8 0 9 10 11 1]
foo function recommended by #hpaulj using defaultdict
div function recommended by #Divakar (old, I'm sure he'd update it)
code
def dfill(a):
n = a.size
b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
return np.arange(n)[b[:-1]].repeat(np.diff(b))
def argunsort(s):
n = s.size
u = np.empty(n, dtype=np.int64)
u[s] = np.arange(n)
return u
def cumcount(a):
n = a.size
s = a.argsort(kind='mergesort')
i = argunsort(s)
b = a[s]
return (np.arange(n) - dfill(b))[i]
def foo(l):
n = len(l)
r = np.empty(n, dtype=np.int64)
counter = defaultdict(int)
for i in range(n):
counter[l[i]] += 1
r[i] = counter[l[i]]
return r - 1
def div(l):
a = np.unique(l, return_counts=1)[1]
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
rng = id_arr.cumsum()
return rng[argunsort(np.argsort(l))]
demonstration
cumcount(short_list)
array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
time testing
code
functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function')
lengths = pd.RangeIndex(100, 1100, 100, 'array length')
results = pd.DataFrame(index=lengths, columns=functions)
from string import ascii_letters
for i in lengths:
a = np.random.choice(list(ascii_letters), i)
for j in functions:
results.set_value(
i, j,
timeit(
'{}(a)'.format(j),
'from __main__ import a, {}'.format(j),
number=1000
)
)
results.plot()
Here's a vectorized approach using custom grouped range creating function and np.unique for getting the counts -
def grp_range(a):
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
return id_arr.cumsum()
count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]
Sample run -
In [117]: A = list('aaabaaacaaadaaac')
In [118]: count = np.unique(A,return_counts=1)[1]
...: out = grp_range(count)[np.argsort(A).argsort()]
...:
In [119]: out
Out[119]: array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
For getting the count, few other alternatives could be proposed with focus on performance -
np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)
Additionally, with A containing single-letter characters, we could get the count simply with -
np.bincount(np.array(A).view('uint8')-97)
Besides defaultdict there are a couple of other counters. Testing a slightly simpler case:
In [298]: from collections import defaultdict
In [299]: from collections import defaultdict, Counter
In [300]: def foo(l):
...: counter = defaultdict(int)
...: for i in l:
...: counter[i] += 1
...: return counter
...:
In [301]: short_list = list('aaabaaacaaadaaac')
In [302]: foo(short_list)
Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [303]: Counter(short_list)
Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [304]: arr=[ord(i)-ord('a') for i in short_list]
In [305]: np.bincount(arr)
Out[305]: array([12, 1, 2, 1], dtype=int32)
I constructed arr because bincount only works with ints.
In [306]: timeit np.bincount(arr)
The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.63 µs per loop
In [307]: timeit Counter(arr)
100000 loops, best of 3: 13.6 µs per loop
In [308]: timeit foo(arr)
100000 loops, best of 3: 6.49 µs per loop
I'm guessing it would hard to improve on pir2 based on default_dict.
Searching and counting like this are not a strong area for numpy.
How do I add a color column to the following dataframe so that color='green' if Set == 'Z', and color='red' otherwise?
Type Set
1 A Z
2 B Z
3 B X
4 C Y
If you only have two choices to select from:
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)
yields
Set Type color
0 Z A green
1 Z B green
2 X B red
3 Y C red
If you have more than two conditions then use np.select. For example, if you want color to be
yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
otherwise purple when (df['Type'] == 'B')
otherwise black,
then use
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
(df['Set'] == 'Z') & (df['Type'] == 'A'),
(df['Set'] == 'Z') & (df['Type'] == 'B'),
(df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)
which yields
Set Type color
0 Z A yellow
1 Z B blue
2 X B purple
3 Y C black
List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.
Example list comprehension:
df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit tests:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop
The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.
Simple example using just the "Set" column:
def set_color(row):
if row["Set"] == "Z":
return "red"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Example with more colours and more columns taken into account:
def set_color(row):
if row["Set"] == "Z":
return "red"
elif row["Type"] == "C":
return "blue"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C blue
Edit (21/06/2019): Using plydata
It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).
from plydata import define, if_else
Simple if_else:
df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Nested if_else:
df = define(df, color=if_else(
'Set=="Z"',
'"red"',
if_else('Type=="C"', '"green"', '"blue"')))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B blue
3 Y C green
Another way in which this could be achieved is
df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:
def map_values(row, values_dict):
return values_dict[row]
values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}
df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})
df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))
What's it look like:
df
Out[2]:
INDICATOR VALUE NEW_VALUE
0 A 10 1
1 B 9 2
2 C 8 3
3 D 7 4
This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).
And of course you could always do this:
df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)
But that approach is more than three times as slow as the apply approach from above, on my machine.
And you could also do this, using dict.get:
df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]
You can simply use the powerful .loc method and use one condition or several depending on your need (tested with pandas=1.0.5).
Code Summary:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"
#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
Explanation:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
# df so far:
Type Set
0 A Z
1 B Z
2 B X
3 C Y
add a 'color' column and set all values to "red"
df['Color'] = "red"
Apply your single condition:
df.loc[(df['Set']=="Z"), 'Color'] = "green"
# df:
Type Set Color
0 A Z green
1 B Z green
2 B X red
3 C Y red
or multiple conditions if you want:
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
You can read on Pandas logical operators and conditional selection here:
Logical operators for boolean indexing in Pandas
You can use pandas methods where and mask:
df['color'] = 'green'
df['color'] = df['color'].where(df['Set']=='Z', other='red')
# Replace values where the condition is False
or
df['color'] = 'red'
df['color'] = df['color'].mask(df['Set']=='Z', other='green')
# Replace values where the condition is True
Alternatively, you can use the method transform with a lambda function:
df['color'] = df['Set'].transform(lambda x: 'green' if x == 'Z' else 'red')
Output:
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
Performance comparison from #chai:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%timeit df['color1'] = 'red'; df['color1'].where(df['Set']=='Z','green')
%timeit df['color2'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color3'] = np.where(df['Set']=='Z', 'red', 'green')
%timeit df['color4'] = df.Set.map(lambda x: 'red' if x == 'Z' else 'green')
397 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
976 ms ± 241 ms per loop
673 ms ± 139 ms per loop
796 ms ± 182 ms per loop
if you have only 2 choices, use np.where()
df = pd.DataFrame({'A':range(3)})
df['B'] = np.where(df.A>2, 'yes', 'no')
if you have over 2 choices, maybe apply() could work
input
arr = pd.DataFrame({'A':list('abc'), 'B':range(3), 'C':range(3,6), 'D':range(6, 9)})
and arr is
A B C D
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
if you want the column E tobe if arr.A =='a' then arr.B elif arr.A=='b' then arr.C elif arr.A == 'c' then arr.D else something_else
arr['E'] = arr.apply(lambda x: x['B'] if x['A']=='a' else(x['C'] if x['A']=='b' else(x['D'] if x['A']=='c' else 1234)), axis=1)
and finally the arr is
A B C D E
0 a 0 3 6 0
1 b 1 4 7 4
2 c 2 5 8 8
One liner with .apply() method is following:
df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')
After that, df data frame looks like this:
>>> print(df)
Type Set color
0 A Z green
1 B Z green
2 B X red
3 C Y red
The case_when function from pyjanitor is a wrapper around pd.Series.mask and offers a chainable/convenient form for multiple conditions:
For a single condition:
df.case_when(
df.col1 == "Z", # condition
"green", # value if True
"red", # value if False
column_name = "color"
)
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
For multiple conditions:
df.case_when(
df.Set.eq('Z') & df.Type.eq('A'), 'yellow', # condition, result
df.Set.eq('Z') & df.Type.eq('B'), 'blue', # condition, result
df.Type.eq('B'), 'purple', # condition, result
'black', # default if none of the conditions evaluate to True
column_name = 'color'
)
Type Set color
1 A Z yellow
2 B Z blue
3 B X purple
4 C Y black
More examples can be found here
If you're working with massive data, a memoized approach would be best:
# First create a dictionary of manually stored values
color_dict = {'Z':'red'}
# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}
# Next, merge the two
color_dict.update(color_dict_other)
# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)
This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4
E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.
A Less verbose approach using np.select:
a = np.array([['A','Z'],['B','Z'],['B','X'],['C','Y']])
df = pd.DataFrame(a,columns=['Type','Set'])
conditions = [
df['Set'] == 'Z'
]
outputs = [
'Green'
]
# conditions Z is Green, Red Otherwise.
res = np.select(conditions, outputs, 'Red')
res
array(['Green', 'Green', 'Red', 'Red'], dtype='<U5')
df.insert(2, 'new_column',res)
df
Type Set new_column
0 A Z Green
1 B Z Green
2 B X Red
3 C Y Red
df.to_numpy()
array([['A', 'Z', 'Green'],
['B', 'Z', 'Green'],
['B', 'X', 'Red'],
['C', 'Y', 'Red']], dtype=object)
%%timeit conditions = [df['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
134 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df2 = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%%timeit conditions = [df2['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
188 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is an easy one-liner you can use when you have one or several conditions:
df['color'] = np.select(condlist=[df['Set']=="Z", df['Set']=="Y"], choicelist=["green", "yellow"], default="red")
Easy and good to go!
See more here: https://numpy.org/doc/stable/reference/generated/numpy.select.html
Have a dataframe with location and column for lookup as follows:
import pandas as pd
import numpy as np
i = ['dog', 'cat', 'bird', 'donkey'] * 100000
df1 = pd.DataFrame(np.random.randint(1, high=380, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100000).reset_index()
df1.columns = ['animal', 'locn']
df1.head()
The dataframe to be looked at is as follows:
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df
Looking for a faster way to assign a column with value of B, for every record in df1.
df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
...is pretty slow.
Use GroupBy.cumcount for counter of animal column for positions, so possible use merge with left join:
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
Verfify in smaller dataframe:
np.random.seed(2019)
i = ['dog', 'cat', 'bird', 'donkey'] * 100
df1 = pd.DataFrame(np.random.randint(1, high=10, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100).reset_index()
df1.columns = ['animal', 'locn']
print (df1)
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df1 = df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
locn = df.groupby('animal').cumcount()
df1 = df1.assign(new1 = df1.merge(df.reset_index().assign(locn = locn),
on=['animal','locn'], how='left')['B'])
print (df1.head(10))
animal locn val new new1
0 cat 9 -0.535465 -0.535465 -0.535465
1 bird 3 0.296240 0.296240 0.296240
2 donkey 6 0.222638 0.222638 0.222638
3 dog 9 1.115175 1.115175 1.115175
4 cat 7 0.608889 0.608889 0.608889
5 bird 9 -0.025648 -0.025648 -0.025648
6 donkey 1 0.324736 0.324736 0.324736
7 dog 1 0.533579 0.533579 0.533579
8 cat 8 -1.818238 -1.818238 -1.818238
9 bird 9 -0.025648 -0.025648 -0.025648
Sorry if this question has been asked before, but I did not find it here nor somewhere else:
I want to fill some of the fields of a column with tuples. Currently I would have to resort to:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
df['b'] = ''
df['b'] = df['b'].astype(object)
mytuple = ('x','y')
for l in df[df.a % 2 == 0].index:
df.set_value(l, 'b', mytuple)
with df being (which is what I want)
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)
This does not look very elegant to me and probably not very efficient. Instead of the loop, I would prefer something like
df.loc[df.a % 2 == 0, 'b'] = np.array([mytuple] * sum(df.a % 2 == 0), dtype=tuple)
which (of course) does not work. How can I improve my above method by using slicing?
In [57]: df.loc[df.a % 2 == 0, 'b'] = pd.Series([mytuple] * len(df.loc[df.a % 2 == 0])).values
In [58]: df
Out[58]:
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)