how to find missing number between minimum and maximum - numpy

I want to make a NumPy array which has below;
Random number: 0~9 (0<=value<=9) Random 1D size: 5~9 (5<= size <=9)
And I hope to find missing numbers between min and max so I made a code like this
import numpy as np
min_val = 0
max_val = 10
min_val_len = 5
max_val_len = 10
arr1 = [4,3,2,7,8,2,3]
a = list(arr1)
print(a)
diff = np.setdiff1d(range(min_val, max_val), arr1)
arr = np.arange(min_val_len, max_val_len)
if diff in arr:
print(diff)
else:
print("no missing")
In my purpose, the output will be [5,6].
And if an input is [1, 2, 3, 4, 5], the result will be 'no_missing'.
But the code isn't work on my expectation.

I think you expect in to work in a way it does not: You want to check every single element, try:
b = [d in arr for d in diff]
Now b contains a boolean value for each value d of diff. If you want to find the actual number that are missing you can do it using a condition
b = np.intersect1d(np.setdiff1d(range(min_val, max_val), arr1), arr)
Also note that python has built in set types, so you do not actually need to use numpy.
Now b contains all numbers of d that are in arr. But you can do it in even a simpler way as you're already using the notion of sets:
print(np.setdiff1d(rang

Related

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

Multiple operations iterating over dataframe columns (apply function?)

I have a pandas dataframe with thousands of columns and I would like to perform the following operations for each column of the dataframe:
check if the value i-th and i-1-th values are in the range (between x and y);
if #1 is satisfied, then find log(i/i-1) ** 2 of the column;
if #1 is not satisfied, assume 0;
find the total of #2 for each column.
Here is a dataframe with a single column:
d = {'col1': [10, 15, 23, 16, 5, 14, 11, 4]}
df = pd.DataFrame(data = d)
df
x = 10 and y = 20
Here is what I can do for this single column:
df["IsIn"] = "NA"
for i in range(1, len(df.col1)):
if (x < df.col1[i] < y) & (x < df.col1[i - 1] < y):
df.IsIn[i] = 1
else:
df.IsIn[i] = 0
df["rets"] = np.log(df["col1"] / df["col1"].shift(1))
df["var"] = df["IsIn"] * df["rets"]**2
Total = df["var"].sum()
Total
Ideally, I would have a (1 by n-cols) dataframe of Totals for each column. How can I best achieve this? I would also appreciate if you can supplement your answer with detailed explanation.
Yes, this is an instance where apply works. You only need to wrap your logic in a function. Also, consider between and shift on the condition to eliminate the first loop:
def func(s, x=10, y=20):
'''
compute the value given a series
'''
# mask where values are between x and y
valid = s.between(x,y)
# shift `valid` and double check
valid = valid & valid.shift(fill_value=False)
# squared log, mask with `valid`, and sum
return (np.log(s/s.shift())**2 * valid).sum()
# apply `func` on the columns
df.apply(func, x=10, y=20)
Output:
col1 0.222561
dtype: float64

bisect_right for numpy array

I wish to locate the index of the closest higher value to a query over a sorted numpy array (where the query value is not in the array).
similar to bisect_right in the python standard library, without converting the numpy array to a python list, and leveraging the fact that the array is sorted (i.e. runtime should be O(log N), like numpy's searchsorted).
Pandas have this option using get_loc with the 'bfill' option, but it seems a bit of an overkill to include it as a dependency just for this... I might have to resort to holding this array as both a python list and a numpy array, but wanted to hear if there's a more reasonable solution.
Edit: It seems searchsorted does exactly what I need.
We can see the code for bisect_right on github:
def bisect_right(a, x, lo=0, hi=None):
"""Return the index where to insert item x in list a, assuming a is sorted.
The return value i is such that all e in a[:i] have e <= x, and all e in
a[i:] have e > x. So if x already appears in the list, a.insert(x) will
insert just after the rightmost x already there.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
# Use __lt__ to match the logic in list.sort() and in heapq
if x < a[mid]: hi = mid
else: lo = mid+1
return lo
This is all numpy compliant:
import numpy as np
array = np.array([1,2,3,4,5,6])
print(bisect_right(array, 7))
>>> 6
print(bisect_right(array, 0))
>>> 0
To find the index of the closest higher value to a number given:
def closest_higher_value(array, value):
if bisect_right(array, value) < len(array):
return bisect_right(array, value)
print("value too large:", value, "is bigger than all elements of:")
print(array)
print(closest_higher_value(array, 3))
>>> 3
print(closest_higher_value(array, 7))
>>> value too large: 7 is bigger than all elements of:
>>> [1 2 3 4 5 6]
>>> None

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

Numpy index of the maximum with reduction - numpy.argmax.reduceat

I have a flat array b:
a = numpy.array([0, 1, 1, 2, 3, 1, 2])
And an array c of indices marking the start of each "chunk":
b = numpy.array([0, 4])
I know I can find the maximum in each "chunk" using a reduction:
m = numpy.maximum.reduceat(a,b)
>>> array([2, 3], dtype=int32)
But... Is there a way to find the index of the maximum <edit>within a chunk</edit> (like numpy.argmax), with vectorized operations (no lists, loops)?
Borrowing the idea from this post.
Steps involved :
Offset all elements in a group by a limit-offset. Sort them globally, thus limiting each group to stay at their positions, but sorting the elements within each group.
In the sorted array, we would look for the last element, which would be the group max. Their indices would be the argmax after offsetting down for the group lengths.
Thus, a vectorized implementation would be -
def numpy_argmax_reduceat(a, b):
n = a.max()+1 # limit-offset
grp_count = np.append(b[1:] - b[:-1], a.size - b[-1])
shift = n*np.repeat(np.arange(grp_count.size), grp_count)
sortidx = (a+shift).argsort()
grp_shifted_argmax = np.append(b[1:],a.size)-1
return sortidx[grp_shifted_argmax] - b
As a minor tweak and possibly faster one, we could alternatively create shift with cumsum and thus have a variation of the earlier approach, like so -
def numpy_argmax_reduceat_v2(a, b):
n = a.max()+1 # limit-offset
id_arr = np.zeros(a.size,dtype=int)
id_arr[b[1:]] = 1
shift = n*id_arr.cumsum()
sortidx = (a+shift).argsort()
grp_shifted_argmax = np.append(b[1:],a.size)-1
return sortidx[grp_shifted_argmax] - b