How can I increase the efficiency of my loop that replaces values? - pandas

def assignGroup(row):
if row['democ_bad']== 1:
return 4
elif row['democ_bad'] == 2:
return 3
elif row['democ_bad'] == 3:
return 2
elif row['democ_bad'] == 4:
return 1
else:
return np.nan
wvs['demo_better']=wvs.apply(assignGroup,axis=1)

Try this.
mapping = {1: 4, 2: 3, 3: 2, 4: 1}
wvs['demo_better'] = wvs['democ_bad'].map(mapping)
It should return nans when there is no mapping.

Related

How to index the unique value count in numpy? [duplicate]

Consider the following lists short_list and long_list
short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
np.random.choice(list(ascii_letters),
(10000, 2))
).sum(1).tolist()
How do I calculate the cumulative count by unique value?
I want to use numpy and do it in linear time. I want this to compare timings with my other methods. It may be easiest to illustrate with my first proposed solution
def pir1(l):
s = pd.Series(l)
return s.groupby(s).cumcount().tolist()
print(np.array(short_list))
print(pir1(short_list))
['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]
I've tortured myself trying to use np.unique because it returns a counts array, an inverse array, and an index array. I was sure I could these to get at a solution. The best I got is in pir4 below which scales in quadratic time. Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1.
Below are some of my attempts (none of which answer my question)
%%cython
from collections import defaultdict
def get_generator(l):
counter = defaultdict(lambda: -1)
for i in l:
counter[i] += 1
yield counter[i]
def pir2(l):
return [i for i in get_generator(l)]
def pir3(l):
return [i for i in get_generator(l)]
def pir4(l):
unq, inv = np.unique(l, 0, 1, 0)
a = np.arange(len(unq))
matches = a[:, None] == inv
return (matches * matches.cumsum(1)).sum(0).tolist()
setup
short_list = np.array(list('aaabaaacaaadaaac'))
functions
dfill takes an array and returns the positions where the array changes and repeats that index position until the next change.
# dfill
#
# Example with short_list
#
# 0 0 0 3 4 4 4 7 8 8 8 11 12 12 12 15
# [ a a a b a a a c a a a d a a a c]
#
# Example with short_list after sorting
#
# 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15
# [ a a a a a a a a a a a a b c c d]
argunsort returns the permutation necessary to undo a sort given the argsort array. The existence of this method became know to me via this post.. With this, I can get the argsort array and sort my array with it. Then I can undo the sort without the overhead of sorting again.
cumcount will take an array sort it, find the dfill array. An np.arange less dfill will give me cumulative count. Then I un-sort
# cumcount
#
# Example with short_list
#
# short_list:
# [ a a a b a a a c a a a d a a a c]
#
# short_list.argsort():
# [ 0 1 2 4 5 6 8 9 10 12 13 14 3 7 15 11]
#
# Example with short_list after sorting
#
# short_list[short_list.argsort()]:
# [ a a a a a a a a a a a a b c c d]
#
# dfill(short_list[short_list.argsort()]):
# [ 0 0 0 0 0 0 0 0 0 0 0 0 12 13 13 15]
#
# np.range(short_list.size):
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
#
# np.range(short_list.size) -
# dfill(short_list[short_list.argsort()]):
# [ 0 1 2 3 4 5 6 7 8 9 10 11 0 0 1 0]
#
# unsorted:
# [ 0 1 2 0 3 4 5 0 6 7 8 0 9 10 11 1]
foo function recommended by #hpaulj using defaultdict
div function recommended by #Divakar (old, I'm sure he'd update it)
code
def dfill(a):
n = a.size
b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
return np.arange(n)[b[:-1]].repeat(np.diff(b))
def argunsort(s):
n = s.size
u = np.empty(n, dtype=np.int64)
u[s] = np.arange(n)
return u
def cumcount(a):
n = a.size
s = a.argsort(kind='mergesort')
i = argunsort(s)
b = a[s]
return (np.arange(n) - dfill(b))[i]
def foo(l):
n = len(l)
r = np.empty(n, dtype=np.int64)
counter = defaultdict(int)
for i in range(n):
counter[l[i]] += 1
r[i] = counter[l[i]]
return r - 1
def div(l):
a = np.unique(l, return_counts=1)[1]
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
rng = id_arr.cumsum()
return rng[argunsort(np.argsort(l))]
demonstration
cumcount(short_list)
array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
time testing
code
functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function')
lengths = pd.RangeIndex(100, 1100, 100, 'array length')
results = pd.DataFrame(index=lengths, columns=functions)
from string import ascii_letters
for i in lengths:
a = np.random.choice(list(ascii_letters), i)
for j in functions:
results.set_value(
i, j,
timeit(
'{}(a)'.format(j),
'from __main__ import a, {}'.format(j),
number=1000
)
)
results.plot()
Here's a vectorized approach using custom grouped range creating function and np.unique for getting the counts -
def grp_range(a):
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
return id_arr.cumsum()
count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]
Sample run -
In [117]: A = list('aaabaaacaaadaaac')
In [118]: count = np.unique(A,return_counts=1)[1]
...: out = grp_range(count)[np.argsort(A).argsort()]
...:
In [119]: out
Out[119]: array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
For getting the count, few other alternatives could be proposed with focus on performance -
np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)
Additionally, with A containing single-letter characters, we could get the count simply with -
np.bincount(np.array(A).view('uint8')-97)
Besides defaultdict there are a couple of other counters. Testing a slightly simpler case:
In [298]: from collections import defaultdict
In [299]: from collections import defaultdict, Counter
In [300]: def foo(l):
...: counter = defaultdict(int)
...: for i in l:
...: counter[i] += 1
...: return counter
...:
In [301]: short_list = list('aaabaaacaaadaaac')
In [302]: foo(short_list)
Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [303]: Counter(short_list)
Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [304]: arr=[ord(i)-ord('a') for i in short_list]
In [305]: np.bincount(arr)
Out[305]: array([12, 1, 2, 1], dtype=int32)
I constructed arr because bincount only works with ints.
In [306]: timeit np.bincount(arr)
The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.63 µs per loop
In [307]: timeit Counter(arr)
100000 loops, best of 3: 13.6 µs per loop
In [308]: timeit foo(arr)
100000 loops, best of 3: 6.49 µs per loop
I'm guessing it would hard to improve on pir2 based on default_dict.
Searching and counting like this are not a strong area for numpy.

Receiving an IndexError when the range in the list should not be out of bounds

I'm new to programming and I can't figure out why I'm getting this error. This is a snippet of code for my battleship script. What I'm trying to do is have a function select spots that are empty to place ships. I've imported random and I have mostly reused this code from my function on setting up the player's ships, which does work. Every once in a while I do not get this error and the game loop progresses as normal. I apologize if I have not given enough information and will update with more and provide full code if you wish to see it. Thank you for looking.
row_headers = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
column_headers = [' ', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
rows = columns = 10
grid_enemy = []
for row in range(row):
grid_enemy.append([])
for column in range(columns):
grid_enemy[row].append('.')
def setup_ship_enemy(ship_enemy):
random_coordinate = []
random_coordinate.append(random.randint(1,10))
random_coordinate.append(random.randint(1,10))
random_direction = random.randint(1,4) #direction is: UP DOWN LEFT RIGHT
valid_spot_e = 0
for unit_e in range(ship_enemy.size):
if random_direction == 1:
spot_e = -unit_e + -1 + random_coordinate[0]
if spot_e < 0:
return False
elif grid_enemy[spot_e][-1 + random_coordinate[1]] != '.':
return False
else:
valid_spot_e += 1
if valid_spot_e == ship.size:
for unit_e in range(ship_enemy.size):
ship_enemy.position.append([random_coordinate[0] + -unit_e, random_coordinate[1]])
if random_direction == 2:
spot_e = unit_e + -1 + random_coordinate[0]
if spot_e >= 10:
return False
elif grid_enemy[spot_e][-1 + random_coordinate[1]] != '.':
return False
else:
valid_spot_e += 1
if valid_spot_e == ship_enemy.size:
for unit_e in range(ship_enemy.size):
ship_enemy.position.append([random_coordinate[0] + unit_e, random_coordinate[1]])
if random_direction == 3:
spot_e = -unit_e + -1 + random_coordinate[1]
if spot_e < 0:
return False
elif grid_enemy[-1 + random_coordinate[0]][spot_e] != '.':
return False
else:
valid_spot_e += 1
if valid_spot_e == ship_enemy.size:
for unit_e in range(ship_enemy.size):
ship_enemy.position.append([random_coordinate[0], random_coordinate[1] -unit_e])
if random_direction == 4:
spot_e = unit_e + -1 + random_coordinate[1]
if spot_e >= 10:
return False
elif grid_enemy[-1 + random_coordinate[0]][spot_e] != '.':
return False
else:
valid_spot_e += 1
if valid_spot_e == ship_enemy.size:
for unit_e in range(ship_enemy.size):
ship_enemy.position.append([random_coordinate[0], random_coordinate[1] + unit_e])

Adding and updating a pandas column based on conditions of other columns

So I have a dataframe of over 1 million rows
One column called 'activity', which has numbers from 1 - 12.
I added a new empty column called 'label'
The column 'label' needs to be filled with 0 or 1, based on the values of the column 'activity'
So if activity is 1, 2, 3, 6, 7, 8 label will be 0, otherwise it will be 1
Here is what I am currently doing:
df = pd.read_csv('data.csv')
df['label'] = ''
for index, row in df.iterrows():
if (row['activity'] == 1 or row['activity'] == 2 or row['activity'] == 3 or row['activity'] == 6 row['activity'] == 7 or row['activity'] == 8):
df.loc[index, 'label'] == 0
else:
df.loc[index, 'label'] == 1
df.to_cvs('data.csv', index = False)
This is very inefficient, and takes too long to run. Is there any optimizations? Possible use of numpy arrays? And any way to make the code cleaner?
Use numpy.where with Series.isin:
df['label'] = np.where(df['activity'].isin([1, 2, 3, 6, 7, 8]), 0, 1)
Or True, False mapping to 0, 1 by inverting mask:
df['label'] = (~df['activity'].isin([1, 2, 3, 6, 7, 8])).astype(int)

What should be done if getting error:can only assign an iterable while applying a function on dataframe?

I am applying a function code string_match_score to a dataframe and getting type error:can only assign an iterable.Can anyone please tell me what that means and how to solve it? or what's the right way to apply the function?Below are the codes and dataframe.
def convert_to_list(string):
list1=[]
list1[:0]=string
return list1
def string_match_score(s1,s2,country1,country2):
list_1 = convert_to_list(s1)
list_2 = convert_to_list(s2)
if s1=="" or s2=="":
return 0
elif s1==s2:
return 1
elif country1 == "ax" and country2=="ax" and len(s1)==10 and len(s2)==13 and s1[:10]==s2[:10]:
return 90
elif country1 == "ax" and country2=="ax" and len(s2)==10 and len(s1)==13 and s1[:10]==s2[:10]:
return 90
elif len(list_1) != len(list_2):
return 0
elif len(list_1) == len(list_2):
# index variable
idx = 0
# Result list
res = []
# With iteration
for i in list_1:
if i != list_2[idx]:
res.append(idx)
idx = idx + 1
if len(res) >= 3:
return 0
elif len(res) == 1:
return 89
elif len(res) == 2:
if res[1] - res[0] != 1:
return 0
elif res[1] - res[0] == 1:
if list_1[res[0]] == list_2[res[1]] and list_1[res[1]] == list_2[res[0]]:
return 78
else:
return 0
df['final_score']=df.apply(lambda x:
string_match_score(x["ID1"],x["ID2"],x["Country1"],x["Country2"]),axis=1)
S.No Country1 ID1 Country2 ID2
1 ax ax 99577A09
2 US QWE US 9957700B
3 Mexico 81231828 US 81231826
4 US 81321862 US 81231862
First of all, the indentation in your definitions are wrong. It should look like this:
def convert_to_list(string):
list1=[]
return list1
def string_match_score(s1,s2,country1,country2):
list_1 = convert_to_list(s1)
list_2 = convert_to_list(s2)
if s1=="" or s2=="":
return 0
elif s1==s2:
return 1
elif country1 == "ax" and country2=="ax" and len(s1)==10 and len(s2)==13 and s1[:10]==s2[:10]:
return 90
elif country1 == "ax" and country2=="ax" and len(s2)==10 and len(s1)==13 and s1[:10]==s2[:10]:
return 90
elif len(list_1) != len(list_2):
return 0
elif len(list_1) == len(list_2):
# index variable
idx = 0
# Result list
res = []
# With iteration
for i in list_1:
if i != list_2[idx]:
res.append(idx)
idx = idx + 1
if len(res) >= 3:
return 0
elif len(res) == 1:
return 89
elif len(res) == 2:
if res[1] - res[0] != 1:
return 0
elif res[1] - res[0] == 1:
if list_1[res[0]] == list_2[res[1]] and list_1[res[1]] == list_2[res[0]]:
return 78
else:
return 0
Secondly, the error you get comes from the list1 in you first function because you can only assign an iterable to a slice; something with 0 or more values, not one specific value. Delete it. Then, it works.
Example: Take this df
S.No Country1 ID1 Country2 ID2
0 1 Mexico 8554433 US 8554433
1 2 US 34893434 US 99593434
2 3 Mexico 81231828 US 81231826
3 4 US 81321862 US 81231862
an apply your function:
df['final_score']=df.apply(lambda x:
string_match_score(x["ID1"],x["ID2"],x["Country1"],x["Country2"]),axis=1)
returns
S.No Country1 ID1 Country2 ID2 final_score
0 1 Mexico 8554433 US 8554433 1.0
1 2 US 34893434 US 99593434 NaN
2 3 Mexico 81231828 US 81231826 NaN
3 4 US 81321862 US 81231862 NaN

Fill pandas fields with tuples as elements by slicing

Sorry if this question has been asked before, but I did not find it here nor somewhere else:
I want to fill some of the fields of a column with tuples. Currently I would have to resort to:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
df['b'] = ''
df['b'] = df['b'].astype(object)
mytuple = ('x','y')
for l in df[df.a % 2 == 0].index:
df.set_value(l, 'b', mytuple)
with df being (which is what I want)
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)
This does not look very elegant to me and probably not very efficient. Instead of the loop, I would prefer something like
df.loc[df.a % 2 == 0, 'b'] = np.array([mytuple] * sum(df.a % 2 == 0), dtype=tuple)
which (of course) does not work. How can I improve my above method by using slicing?
In [57]: df.loc[df.a % 2 == 0, 'b'] = pd.Series([mytuple] * len(df.loc[df.a % 2 == 0])).values
In [58]: df
Out[58]:
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)