Adding and updating a pandas column based on conditions of other columns - pandas

So I have a dataframe of over 1 million rows
One column called 'activity', which has numbers from 1 - 12.
I added a new empty column called 'label'
The column 'label' needs to be filled with 0 or 1, based on the values of the column 'activity'
So if activity is 1, 2, 3, 6, 7, 8 label will be 0, otherwise it will be 1
Here is what I am currently doing:
df = pd.read_csv('data.csv')
df['label'] = ''
for index, row in df.iterrows():
if (row['activity'] == 1 or row['activity'] == 2 or row['activity'] == 3 or row['activity'] == 6 row['activity'] == 7 or row['activity'] == 8):
df.loc[index, 'label'] == 0
else:
df.loc[index, 'label'] == 1
df.to_cvs('data.csv', index = False)
This is very inefficient, and takes too long to run. Is there any optimizations? Possible use of numpy arrays? And any way to make the code cleaner?

Use numpy.where with Series.isin:
df['label'] = np.where(df['activity'].isin([1, 2, 3, 6, 7, 8]), 0, 1)
Or True, False mapping to 0, 1 by inverting mask:
df['label'] = (~df['activity'].isin([1, 2, 3, 6, 7, 8])).astype(int)

Related

Comparison of values in Dataframes with different size

I have a DataFrame in which I want to compare the speed of certain IDs at different conditions.
Boundary conditions:
IDs do not have to be represented in every condition,
ID is not represented in every condition with the same frequency.
My goal is to assign whether the speed remained
larger (speed > than speed in CondA +10%),
smaller ((speed < than speed in CondA -10%)) or
the same (speed < than speed in CondA +10%) & (speed > than speed in CondA -10%))
depending on the condition.
The data
import numpy as np
import pandas as pd
data1 = {
'ID' : [1, 1, 1, 2, 3, 3, 4, 5],
'Condition' : ['Cond_A', 'Cond_A', 'Cond_A', 'Cond_A', 'Cond_A', 'Cond_A','Cond_A','Cond_A', ],
'Speed' : [1.2, 1.05, 1.2, 1.3, 1.0, 0.85, 1.1, 0.85],
}
df1 = pd.DataFrame(data1)
data2 = {
'ID' : [1, 2, 3, 4, 5, 6],
'Condition' : ['Cond_B', 'Cond_B', 'Cond_B', 'Cond_B', 'Cond_B', 'Cond_B' ],
'Speed' : [0.8, 0.55, 0.7, 1.15, 1.2, 1.4],
}
df2 = pd.DataFrame(data2)
data3 = {
'ID' : [1, 2, 3, 4, 6],
'Condition' : ['Cond_C', 'Cond_C', 'Cond_C', 'Cond_C', 'Cond_C' ],
'Speed' : [1.8, 0.99, 1.7, 131, 0.2, ],
}
df3 = pd.DataFrame(data3)
lst_of_dfs = [df1,df2, df3]
# creating a Dataframe object
data = pd.concat(lst_of_dfs)
My goal is to archive a result like this
Condition ID Speed Category
0 Cond_A 1 1.150 NaN
1 Cond_A 2 1.300 NaN
2 Cond_A 3 0.925 NaN
3 Cond_A 4 1.100 NaN
4 Cond_A 5 0.850 NaN
5 Cond_B 1 0.800 faster
6 Cond_B 2 0.550 slower
7 Cond_B 3 0.700 slower
8 Cond_B 4 1.150 equal
...
My attempt:
Calculate average of speed for each ID per condition
data = data.groupby(["Condition", "ID"]).mean()["Speed"].reset_index()
Definition of thresholds. Assuming I want to realize thresholds up to 10 percent around the CondA-Values
threshold_upper = data.loc[(data.Condition == 'CondA')]['Speed'] + (data.loc[(data.Condition == 'CondA')]['Speed']*10/100)
threshold_lower = data.loc[(data.Condition == 'CondA')]['Speed'] - (data.loc[(data.Condition == 'CondA')]['Speed']*10/100)
Mapping strings 'faster', 'equal', 'slower' based on condition using numpy select.
conditions = [
(data.loc[(data.Condition == 'CondB')]['Speed'] > threshold_upper), #check whether Speed of each ID in CondB is faster than Speed in CondA+10%
(data.loc[(data.Condition == 'CondC')]['Speed'] > threshold_upper), #check whether Speed of each ID in CondC is faster than Speed in CondA+10%
((data.loc[(data.Condition == 'CondB')]['Speed'] < threshold_upper) & (data.loc[(data.Condition == 'CondB')]['Speed'] > threshold_lower)), #check whether Speed of each ID in CondB is slower than Speed in CondA+10% AND faster than Speed in CondA-10%
((data.loc[(data.Condition == 'CondC')]['Speed'] < threshold_upper) & (data.loc[(data.Condition == 'CondC')]['Speed'] > threshold_lower)), #check whether Speed of each ID in CondC is slower than Speed in CondA+10% AND faster than Speed in CondA-10%
(data.loc[(data.Condition == 'CondB')]['Speed'] < threshold_upper), #check whether Speed of each ID in CondB is slower than Speed in CondA-10%
(data.loc[(data.Condition == 'CondC')]['Speed'] < threshold_upper), #check whether Speed of each ID in CondC is faster than Speed in CondA-10%
]
values = [
'faster',
'faster',
'equal',
'equal',
'slower',
'slower'
]
data['Category'] = np.select(conditions, values)
Produces this error: <ValueError: Length of values (0) does not match length of index (16)>
My data frames unfortunately have a different length (since not all IDs performed all trials to each condition). I appreciate any hint. Many thanks in advance.
# Dataframe created
data
ID Condition Speed
0 1 Cond_A 1.20
1 1 Cond_A 1.05
2 1 Cond_A 1.20
# Reset the index
data = data.reset_index(drop=True)
# Creating based on ID
data['group'] = data.groupby(['ID']).ngroup()
# Creating functions which returns the upper and lower limit of speed
def lowlimit(x):
return x[x['Condition']=='Cond_A'].Speed.mean() * 0.9
def upperlimit(x):
return x[x['Condition']=='Cond_A'].Speed.mean() * 1.1
# Calculate the upperlimit and lowerlimit for the groups
df = pd.DataFrame()
df['ul'] = data.groupby('group').apply(lambda x: upperlimit(x))
df['ll'] = data.groupby('group').apply(lambda x: lowlimit(x))
# reseting the index
# So that we can merge the values of 'group' column
df = df.reset_index()
# Merging the data and df dataframe
data_new = pd.merge(data,df,on='group',how='left')
data_new
ID Condition Speed group ul ll
0 1 Cond_A 1.20 0 1.2650 1.0350
1 1 Cond_A 1.05 0 1.2650 1.0350
2 1 Cond_A 1.20 0 1.2650 1.0350
3 2 Cond_A 1.30 1 1.4300 1.1700
Now we have to apply the conditions
data_new.loc[(data_new['Speed'] >= data_new['ul']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'larger'
data_new.loc[(data_new['Speed'] <= data_new['ll']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'smaller'
data_new.loc[(data_new['Speed'] < data_new['ul']) & (data_new['Speed'] > data_new['ll']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'Same'
Here is the output
You can drop the other columns now, if you want
data_new = data_new.drop(columns=['group','ul','ll'])

How to index the unique value count in numpy? [duplicate]

Consider the following lists short_list and long_list
short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
np.random.choice(list(ascii_letters),
(10000, 2))
).sum(1).tolist()
How do I calculate the cumulative count by unique value?
I want to use numpy and do it in linear time. I want this to compare timings with my other methods. It may be easiest to illustrate with my first proposed solution
def pir1(l):
s = pd.Series(l)
return s.groupby(s).cumcount().tolist()
print(np.array(short_list))
print(pir1(short_list))
['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]
I've tortured myself trying to use np.unique because it returns a counts array, an inverse array, and an index array. I was sure I could these to get at a solution. The best I got is in pir4 below which scales in quadratic time. Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1.
Below are some of my attempts (none of which answer my question)
%%cython
from collections import defaultdict
def get_generator(l):
counter = defaultdict(lambda: -1)
for i in l:
counter[i] += 1
yield counter[i]
def pir2(l):
return [i for i in get_generator(l)]
def pir3(l):
return [i for i in get_generator(l)]
def pir4(l):
unq, inv = np.unique(l, 0, 1, 0)
a = np.arange(len(unq))
matches = a[:, None] == inv
return (matches * matches.cumsum(1)).sum(0).tolist()
setup
short_list = np.array(list('aaabaaacaaadaaac'))
functions
dfill takes an array and returns the positions where the array changes and repeats that index position until the next change.
# dfill
#
# Example with short_list
#
# 0 0 0 3 4 4 4 7 8 8 8 11 12 12 12 15
# [ a a a b a a a c a a a d a a a c]
#
# Example with short_list after sorting
#
# 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15
# [ a a a a a a a a a a a a b c c d]
argunsort returns the permutation necessary to undo a sort given the argsort array. The existence of this method became know to me via this post.. With this, I can get the argsort array and sort my array with it. Then I can undo the sort without the overhead of sorting again.
cumcount will take an array sort it, find the dfill array. An np.arange less dfill will give me cumulative count. Then I un-sort
# cumcount
#
# Example with short_list
#
# short_list:
# [ a a a b a a a c a a a d a a a c]
#
# short_list.argsort():
# [ 0 1 2 4 5 6 8 9 10 12 13 14 3 7 15 11]
#
# Example with short_list after sorting
#
# short_list[short_list.argsort()]:
# [ a a a a a a a a a a a a b c c d]
#
# dfill(short_list[short_list.argsort()]):
# [ 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15]
#
# np.range(short_list.size):
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
#
# np.range(short_list.size) -
# dfill(short_list[short_list.argsort()]):
# [ 0 1 2 3 4 5 6 7 8 9 10 11 0 0 1 0]
#
# unsorted:
# [ 0 1 2 0 3 4 5 0 6 7 8 0 9 10 11 1]
foo function recommended by #hpaulj using defaultdict
div function recommended by #Divakar (old, I'm sure he'd update it)
code
def dfill(a):
n = a.size
b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
return np.arange(n)[b[:-1]].repeat(np.diff(b))
def argunsort(s):
n = s.size
u = np.empty(n, dtype=np.int64)
u[s] = np.arange(n)
return u
def cumcount(a):
n = a.size
s = a.argsort(kind='mergesort')
i = argunsort(s)
b = a[s]
return (np.arange(n) - dfill(b))[i]
def foo(l):
n = len(l)
r = np.empty(n, dtype=np.int64)
counter = defaultdict(int)
for i in range(n):
counter[l[i]] += 1
r[i] = counter[l[i]]
return r - 1
def div(l):
a = np.unique(l, return_counts=1)[1]
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
rng = id_arr.cumsum()
return rng[argunsort(np.argsort(l))]
demonstration
cumcount(short_list)
array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
time testing
code
functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function')
lengths = pd.RangeIndex(100, 1100, 100, 'array length')
results = pd.DataFrame(index=lengths, columns=functions)
from string import ascii_letters
for i in lengths:
a = np.random.choice(list(ascii_letters), i)
for j in functions:
results.set_value(
i, j,
timeit(
'{}(a)'.format(j),
'from __main__ import a, {}'.format(j),
number=1000
)
)
results.plot()
Here's a vectorized approach using custom grouped range creating function and np.unique for getting the counts -
def grp_range(a):
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
return id_arr.cumsum()
count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]
Sample run -
In [117]: A = list('aaabaaacaaadaaac')
In [118]: count = np.unique(A,return_counts=1)[1]
...: out = grp_range(count)[np.argsort(A).argsort()]
...:
In [119]: out
Out[119]: array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
For getting the count, few other alternatives could be proposed with focus on performance -
np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)
Additionally, with A containing single-letter characters, we could get the count simply with -
np.bincount(np.array(A).view('uint8')-97)
Besides defaultdict there are a couple of other counters. Testing a slightly simpler case:
In [298]: from collections import defaultdict
In [299]: from collections import defaultdict, Counter
In [300]: def foo(l):
...: counter = defaultdict(int)
...: for i in l:
...: counter[i] += 1
...: return counter
...:
In [301]: short_list = list('aaabaaacaaadaaac')
In [302]: foo(short_list)
Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [303]: Counter(short_list)
Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [304]: arr=[ord(i)-ord('a') for i in short_list]
In [305]: np.bincount(arr)
Out[305]: array([12, 1, 2, 1], dtype=int32)
I constructed arr because bincount only works with ints.
In [306]: timeit np.bincount(arr)
The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.63 µs per loop
In [307]: timeit Counter(arr)
100000 loops, best of 3: 13.6 µs per loop
In [308]: timeit foo(arr)
100000 loops, best of 3: 6.49 µs per loop
I'm guessing it would hard to improve on pir2 based on default_dict.
Searching and counting like this are not a strong area for numpy.

find the first element in a list beyond some index and satisfying some condition

I have as Input:
A givenIndex
A list
I want to find the index of the first positive element in that list but ignoring all the indices that are strictly smaller than givenIndex
For example if givenIndex=2 and the list is listOf(1, 0, 0, 0, 6, 8, 2), the expected output is 4 (where the value is 6).
The following code gives the first positive element but It doesn't take into account ignoring all the indices that are smaller than givenIndex.
val numbers = listOf(1, 0, 0, 0, 6, 8, 2)
val output = numbers.indexOfFirst { it > 0 } //output is 0 but expected is 4
val givenIndex = 2
val output = numbers.withIndex().indexOfFirst { (index, value) -> index >= givenIndex && value > 0 }

Selecting values with Pandas multiindex using lists of tuples

I have a DataFrame with a MultiIndex with 3 levels:
id foo bar col1
0 1 a -0.225873
2 a -0.275865
2 b -1.324766
3 1 a -0.607122
2 a -1.465992
2 b -1.582276
3 b -0.718533
7 1 a -1.904252
2 a 0.588496
2 b -1.057599
3 a 0.388754
3 b -0.940285
Preserving the id index level, I want to sum along the foo and bar levels, but with different values for each id.
For example, for id = 0 I want to sum over foo = [1] and bar = [["a", "b"]], for id = 3 I want to sum over foo = [2] and bar = [["a", "b"]], and for id = 7 I want to sum over foo = [[1,2]] and bar = [["a"]]. Giving the result:
id col1
0 -0.225873
3 -3.048268
7 -1.315756
I have been trying something along these lines:
df.loc(axis = 0)[[(0, 1, ["a","b"]), (3, 2, ["a","b"]), (7, [1,2], "a")].sum()
Not sure if this is even possible. Any elegant solution (possibly removing the MultiIndex?) would be much appreciated!
The list of tuples is not the problem. The fact that each tuple does not correspond to a single index is the problem (Since a list isn't a valid key). If you want to index a Dataframe like this, you need to expand the lists inside each tuple to their own entries.
Define your options like the following list of dictionaries, then transform using a list comprehension and index using all individual entries.
d = [
{
'id': 0,
'foo': [1],
'bar': ['a', 'b']
},
{
'id': 3,
'foo': [2],
'bar': ['a', 'b']
},
{
'id': 7,
'foo': [1, 2],
'bar': ['a']
},
]
all_idx = [
(el['id'], i, j)
for el in d
for i in el['foo']
for j in el['bar']
]
# [(0, 1, 'a'), (0, 1, 'b'), (3, 2, 'a'), (3, 2, 'b'), (7, 1, 'a'), (7, 2, 'a')]
df.loc[all_idx].groupby(level=0).sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756
A more succinct solution using slicers:
sections = [(0, 1, slice(None)), (3, 2, slice(None)), (7, slice(1,2), "a")]
pd.concat(df.loc[s] for s in sections).groupby("id").sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756
Two things to note:
This may be less memory-efficient than the accepted answer since pd.concat creates a new DataFrame.
The slice(None)'s are mandatory, otherwise the index columns of the df.loc[s]'s mismatch when calling pd.concat.

Fill pandas fields with tuples as elements by slicing

Sorry if this question has been asked before, but I did not find it here nor somewhere else:
I want to fill some of the fields of a column with tuples. Currently I would have to resort to:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
df['b'] = ''
df['b'] = df['b'].astype(object)
mytuple = ('x','y')
for l in df[df.a % 2 == 0].index:
df.set_value(l, 'b', mytuple)
with df being (which is what I want)
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)
This does not look very elegant to me and probably not very efficient. Instead of the loop, I would prefer something like
df.loc[df.a % 2 == 0, 'b'] = np.array([mytuple] * sum(df.a % 2 == 0), dtype=tuple)
which (of course) does not work. How can I improve my above method by using slicing?
In [57]: df.loc[df.a % 2 == 0, 'b'] = pd.Series([mytuple] * len(df.loc[df.a % 2 == 0])).values
In [58]: df
Out[58]:
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)