I often find myself holding an array not of indices, but of index bounds that effectively define multiple slices. A representative example is
import numpy as np
rand = np.random.default_rng(seed=0)
sample = rand.integers(low=0, high=10, size=(10, 10))
y, x = np.mgrid[:10, :10]
bad_starts = rand.integers(low=0, high=10, size=(10, 1))
print(bad_starts)
sample[
(x >= bad_starts) & (y < 5)
] = -1
print(sample)
[[4]
[7]
[3]
[2]
[7]
[8]
[0]
[0]
[6]
[3]]
[[ 8 6 5 2 -1 -1 -1 -1 -1 -1]
[ 6 9 5 6 9 7 6 -1 -1 -1]
[ 2 8 6 -1 -1 -1 -1 -1 -1 -1]
[ 8 1 -1 -1 -1 -1 -1 -1 -1 -1]
[ 4 0 0 1 0 6 5 -1 -1 -1]
[ 7 3 4 9 8 9 3 6 9 6]
[ 8 6 7 3 8 1 5 7 8 5]
[ 3 3 4 4 7 8 0 9 5 3]
[ 6 5 2 3 7 5 5 3 7 3]
[ 3 8 2 2 7 6 0 0 3 8]]
Is there a simpler way to accomplish the same thing with slices alone, avoiding having to call mgrid and avoiding an entire boolean predicate matrix?
With ogrid you get 'sparse' grid
In [488]: y,x
Out[488]:
(array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]]),
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]))
The mask is the same: (x >= bad_starts) & (y < 5)
A single value for each row can be fetched (or set) with:
In [491]: sample[np.arange(5)[:,None],bad_starts[:5]]
Out[491]:
array([[-1],
[-1],
[-1],
[-1],
[-1]])
But there isn't a way of accessing all -1 with simple slicing. Each row has a different length slice:
In [492]: [sample[i,bad_starts[i,0]:] for i in range(5)]
Out[492]:
[array([-1, -1, -1, -1, -1, -1]),
array([-1, -1, -1]),
array([-1, -1, -1, -1, -1, -1, -1]),
array([-1, -1, -1, -1, -1, -1, -1, -1]),
array([-1, -1, -1])]
There isn't a way to access all with one slice.
The equivalent 'advanced indexing' arrays are:
In [494]: np.nonzero((x >= bad_starts) & (y < 5))
Out[494]:
(array([0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4]),
array([4, 5, 6, 7, 8, 9, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5, 6, 7,
8, 9, 7, 8, 9]))
Related
I would like to create a new column in order to figure it out how many different sequences I have when I find the Zero value until the next Zero value, with 1’s values between then.
I am using R to develop such code:
I have two Scenarios: I have the Conversion Column and I'd like to create the New Column
First Scenario (when my Conversions Column starts with 1):
Conversions
New Column (The Sequence)
1
1
1
1
0
2
1
2
1
2
1
2
0
3
1
3
1
3
0
4
0
4
0
4
1
4
1
4
1
4
0
5
0
5
Second Scenario (when my Conversions Column starts with 0)
Conversions
New Column (The Sequence)
0
1
0
1
0
1
1
1
0
2
1
2
1
2
1
2
0
3
0
3
1
3
0
4
1
4
1
4
0
5
1
5
1
5
Thanks
library(dplyr)
dt1 <- tibble(
conversion = c(1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0),
sequence = c(1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5),
id = 1:17
)
dt2 <- tibble(
conversion = c(0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1),
sequence = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
id = 1:17
)
build_seq <- function(df) {
df %>%
mutate(
new_col = ifelse((conversion - lag(conversion, 1)) == -1, id, NA),
new_col = as.numeric(as.factor(new_col))
) %>%
tidyr::fill(new_col, .direction = "down") %>%
mutate(
new_col = ifelse(is.na(new_col), 1, new_col + 1)
)
}
new_dt1 <- build_seq(dt1)
new_dt2 <- build_seq(dt2)
all(new_dt1$new_col == new_dt1$sequence)
all(new_dt2$new_col == new_dt2$sequence)
I would add the new index to the new column e if b and c is the same.
In the mean time,
I need to consider the limit of the sum(d)<=20,
If the total d with the same b and c is exceed 20,
then give a new index.
the example input data below:
a
b
c
d
0
0
2
9
1
2
1
10
2
1
0
9
3
1
0
11
4
2
1
9
5
0
1
15
6
2
0
9
7
1
0
8
I sort the b and c first,
let comparing be more easier,
then I got key errorKeyError: 0, temporary_size += df.loc[df[i], 'd']\
Hope it like this:
a
b
c
d
e
5
0
1
15
1
0
0
2
9
2
2
1
0
9
3
3
1
0
11
3
7
1
0
8
4
6
2
0
9
5
1
2
1
10
6
4
2
1
9
6
and here is my code:
import pandas as pd
d = {'a': [0, 1, 2, 3, 4, 5, 6, 7], 'b': [0, 2, 1, 1, 2, 0, 2, 1], 'c': [2, 1, 0, 0, 1, 1, 0, 0], 'd': [9, 10, 9, 11, 9, 15, 9, 8]}
df = pd.DataFrame(data=d)
print(df)
df.sort_values(['b', 'c'], ascending=[True, True], inplace=True, ignore_index=True)
e_id = 0
total_size = 20
temporary_size = 0
for i in range(0, len(df.index)-1):
if df.loc[i, 'b'] == df.loc[i+1, 'b'] and df.loc[i, 'c'] != df.loc[i+1, 'c']:
temporary_size = temporary_size + df.loc[i, 'd']
if temporary_size <= total_size:
df.loc['e', i] = e_id
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
e_id += 1
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
print(df)
finally, I can't get the column c in my dataframe.
THANKS FOR ALL!
I define an array as :
XRN =np.array([[[0,1,0,1,0,1,0,1,0,1],
[0,1,1,0,0,1,0,1,0,1],
[0,1,0,0,1,1,0,1,0,1],
[0,1,0,1,0,0,1,1,0,1],],
[[0,1,0,1,0,1,1,0,0,1],
[0,1,0,1,0,1,0,1,1,0],
[1,1,1,0,0,0,0,1,0,1],
[0,1,0,1,0,0,1,1,0,1],],
[[0,1,0,1,0,1,1,1,0,0],
[0,1,0,1,1,1,0,1,0,0],
[0,1,0,1,1,0,0,1,0,1],
[0,1,0,1,0,0,1,1,0,1],]])
print(XRN.shape,XRN)
XRN_LEN = XRN.shape[1]
I can obtain the sum of inner matrix with :
XRN_UP = XRN.sum(axis=1)
print("XRN_UP",XRN_UP.shape,XRN_UP)
XRN_UP (3, 10) [[0 4 1 2 1 3 1 4 0 4]
[1 4 1 3 0 2 2 3 1 3]
[0 4 0 4 2 2 2 4 0 2]]
I want to get the sum of all diagonals with the same shape (3,10)
I tested the code :
RIGHT = [XRN.diagonal(i,axis1=0,axis2=1).sum(axis=1) for i in range(XRN_LEN)]
np_RIGHT = np.array(RIGHT)
print("np_RIGHT=",np_RIGHT.shape,np_RIGHT)
but got
np_RIGHT= (4, 10) [[0 3 0 3 1 2 0 3 1 2]
[1 3 2 1 0 1 1 3 0 3]
[0 2 0 1 1 1 1 2 0 2]
[0 1 0 1 0 0 1 1 0 1]]
I checked all values for axis1 and axis 2 but never got the shape(3,10) : How can I do ?
axis1 axis2 shape
0 1 (4,10)
0 2 (4,4)
1 0 (4,10)
1 2 (4,3)
2 0 (4,4)
2 1 (4,3)
If I understand correctly, you want to sum all possible diagonals on the three elements separately. If that's the case, then you must apply np.diagonal on axis1=1 and axis2=2. This way, you end up with 10 diagonals per element which you sum down to 10 values per element. There are 3 elements, so the resulting shape is (10, 3):
>>> np.array([XRN.diagonal(i, 1, 2).sum(1) for i in range(XRN.shape[-1])])
array([[2, 3, 2],
[2, 1, 2],
[1, 1, 2],
[3, 2, 3],
[2, 2, 2],
[2, 2, 2],
[2, 3, 3],
[2, 2, 2],
[1, 0, 0],
[1, 1, 0]])
Edit: changed example df for clarity
I have a dataframe, similar to the one given below (except the real one has a few thousand rows and columns, and values being floats):
df = pd.DataFrame([[6,5,4,3,8], [6,5,4,3,6], [1,1,3,9,5], [0,1,2,7,4], [2, 0, 0, 4, 0])
0 1 2 3 4
0 6 5 4 3 8
1 6 5 4 3 6
2 1 1 3 9 5
3 0 1 2 7 4
4 2 0 0 4 0
From this dataframe, I would like to drop all rows for which all values are lower than or equal to any other row. For this simple example, row 1 and row 3 should be deleted ('dominated' by row 0 and row 2 respectively'):
filtered df:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
It would be even better if the approach could take into account floating point errors, since my real dataframe contains floats (i.e. instead of dropping rows where all values are lower/equal, the values shouldn't be lower than a small amount (e.g. 0.0001).
My initial idea to tackle this problem was as follows:
Select the first row
Compare the other rows with it using a list comprehension (see below)
Drop all rows that returned True
Repeat for the next row
List comprehension code:
selected_row = df.loc[0
[(df.loc[r]<=selected_row).all() and (df.loc[r]<selected_row).any() for r in range(len(df))]
[False, True, False, False, False]
This seems hardly efficient however. Any suggestions on how to (efficiently) tackle this problem would be greatly appreciated.
We can try with broadcasting:
import pandas as pd
df = pd.DataFrame([
[6, 5, 4, 3, 8], [6, 5, 4, 3, 6], [1, 1, 3, 9, 5],
[0, 1, 2, 7, 4], [2, 0, 0, 4, 0]
])
# Need to ensure only one of each row present since comparing to 1
# there needs to be one and only one of each row
df = df.drop_duplicates()
# Broadcasted comparison explanation below
cmp = (df.values[:, None] <= df.values).all(axis=2).sum(axis=1) == 1
# Filter using the results from the comparison
df = df[cmp]
df:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
Intuition:
Broadcast the comparison operation over the DataFrame:
(df.values[:, None] <= df.values)
[[[ True True True True True]
[ True True True True False]
[False False False True False]
[False False False True False]
[False False False True False]] # df vs [6 5 4 3 8]
[[ True True True True True]
[ True True True True True]
[False False False True False]
[False False False True False]
[False False False True False]] # df vs [6 5 4 3 6]
[[ True True True False True]
[ True True True False True]
[ True True True True True]
[False True False False False]
[ True False False False False]] # df vs [1 1 3 9 5]
[[ True True True False True]
[ True True True False True]
[ True True True True True]
[ True True True True True]
[ True False False False False]] # df vs [0 1 2 7 4]
[[ True True True False True]
[ True True True False True]
[False True True True True]
[False True True True True]
[ True True True True True]]] # df vs [2 0 0 4 0]
Then we can check for all on axis=2:
(df.values[:, None] <= df.values).all(axis=2)
[[ True False False False False] # Rows le [6 5 4 3 8]
[ True True False False False] # Rows le [6 5 4 3 6]
[False False True False False] # Rows le [1 1 3 9 5]
[False False True True False] # Rows le [0 1 2 7 4]
[False False False False True]] # Rows le [2 0 0 4 0]
Then we can use sum to total how many rows are less than or equal to:
(df.values[:, None] <= df.values).all(axis=2).sum(axis=1)
[1 2 1 2 1]
The rows where the is only 1 row that is less than or equal to (self match only) are the rows to keep. Because we drop_duplicates there will be no duplicates in the dataframe so the only True values will be the self-match and those that are less than or equal to:
(df.values[:, None] <= df.values).all(axis=2).sum(axis=1) == 1
[ True False True False True]
This then becomes the filter for the DataFrame:
df = df[[True, False, True, False, True]]
df:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
What is the expected proportion of dominant rows?
What is the size of the datasets that you will handle and the available memory?
While a solution like the broadcasting approach is very clever and efficient (vectorized), it will not be able to handle large dataframes as the size of the broadcast will quickly explode the memory limit (a 100,000×10 input array will not run on most computers).
Here is another approach to avoid testing all combinations and computing everything at once in the memory. It is slower due to the loop, but it should be able to handle much larger arrays. It will also run faster when the proportion of dominated rows increases.
In summary, it compares the dataset with the first row, drops the dominated rows, shifts the first row to the end and start again until doing a full loop. If rows get dropped over time, the number of comparison decrease.
def get_dominants_loop(df):
from tqdm import tqdm
seen = [] # keep track of tested rows
idx = df.index # initial index
for i in tqdm(range(len(df)+1)):
x = idx[0]
if x in seen: # done a full loop
return df.loc[idx]
seen.append(idx[0])
# check which rows are dominated and drop them from the index
idx = (df.loc[idx]-df.loc[x]).le(0).all(axis=1)
# put tested row at the end
idx = list(idx[~idx].index)+[x]
To drop the dominated rows:
df = get_dominants_loop(df)
NB. I used tqdm here to have a progress bar. It is not needed for the code to run
Quick benchmarking in cases where the broadcast approach could not run: <2min for 100k×10 in a cas where most rows are not dominated ; 4s when most rows are dominated
you can try:
df[df.shift(1)[0] >= df[1][0]]
output
0
1
2
3
4
1
6
5
4
3
6
2
1
1
3
9
5
You can try something like that:
# Cartesian product
x = np.tile(df, df.shape[0]).reshape(-1, df.shape[1])
y = np.tile(df.T, df.shape[0]).T
# Remove same rows
#dups = np.all(x == y, axis=1)
#x = x[~dups]
#y = y[~dups]
x = np.delete(x, slice(None, None, df.shape[0]+1), axis=0)
y = np.delete(y, slice(None, None, df.shape[0]+1), axis=0)
# Keep dominant rows
m = x[np.all(x >= y, axis=1)]
>>> m
array([[6, 5, 4, 3, 8],
[1, 1, 3, 9, 5]])
# Before remove duplicates
# df1 = pd.DataFrame({'x': x.tolist(), 'y': y.tolist()})
>>> df1
x y
0 [6, 5, 4, 3, 8] [6, 5, 4, 3, 8] # dup
1 [6, 5, 4, 3, 8] [6, 5, 4, 3, 6] # DOMINANT
2 [6, 5, 4, 3, 8] [1, 1, 3, 9, 5]
3 [6, 5, 4, 3, 8] [0, 1, 2, 7, 4]
4 [6, 5, 4, 3, 6] [6, 5, 4, 3, 8]
5 [6, 5, 4, 3, 6] [6, 5, 4, 3, 6] # dup
6 [6, 5, 4, 3, 6] [1, 1, 3, 9, 5]
7 [6, 5, 4, 3, 6] [0, 1, 2, 7, 4]
8 [1, 1, 3, 9, 5] [6, 5, 4, 3, 8]
9 [1, 1, 3, 9, 5] [6, 5, 4, 3, 6]
10 [1, 1, 3, 9, 5] [1, 1, 3, 9, 5] # dup
11 [1, 1, 3, 9, 5] [0, 1, 2, 7, 4] # DOMINANT
12 [0, 1, 2, 7, 4] [6, 5, 4, 3, 8]
13 [0, 1, 2, 7, 4] [6, 5, 4, 3, 6]
14 [0, 1, 2, 7, 4] [1, 1, 3, 9, 5]
15 [0, 1, 2, 7, 4] [0, 1, 2, 7, 4] # dup
Here is a way using df.apply()
m = (pd.concat(df.apply(lambda x: df.ge(x,axis=1),axis=1).tolist(),keys = df.index)
.all(axis=1)
.groupby(level=0)
.sum()
.eq(1))
ndf = df.loc[m]
Output:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
My goal is to create a new column c_list that contains a list after an groupby (without merge function): df['c_list'] = df.groupby('a').agg({'c':lambda x: list(x)})
df = pd.DataFrame(
{'a': ['x', 'y', 'y', 'x'],
'b': [2, 0, 0, 0],
'c': [8, 2, 5, 6]
}
)
df
Initial dataframe
a b c
0 x 2 8
1 y 0 2
2 y 0 5
3 x 0 6
Looking for:
a b c d
0 x 2 8 [6, 8]
1 y 0 2 [2, 5]
2 y 0 5 [2, 5]
3 x 0 6 [6, 8]
Try with transform
df['d']=df.groupby('a').c.transform(lambda x : [x.values.tolist()]*len(x))
0 [8, 6]
1 [2, 5]
2 [2, 5]
3 [8, 6]
Name: c, dtype: object
Or
df['d']=df.groupby('a').c.agg(list).reindex(df.a).values