How can I create a new column to identify the sequence between Zero Values - sequence

I would like to create a new column in order to figure it out how many different sequences I have when I find the Zero value until the next Zero value, with 1’s values between then.
I am using R to develop such code:
I have two Scenarios: I have the Conversion Column and I'd like to create the New Column
First Scenario (when my Conversions Column starts with 1):
Conversions
New Column (The Sequence)
1
1
1
1
0
2
1
2
1
2
1
2
0
3
1
3
1
3
0
4
0
4
0
4
1
4
1
4
1
4
0
5
0
5
Second Scenario (when my Conversions Column starts with 0)
Conversions
New Column (The Sequence)
0
1
0
1
0
1
1
1
0
2
1
2
1
2
1
2
0
3
0
3
1
3
0
4
1
4
1
4
0
5
1
5
1
5
Thanks

library(dplyr)
dt1 <- tibble(
conversion = c(1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0),
sequence = c(1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5),
id = 1:17
)
dt2 <- tibble(
conversion = c(0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1),
sequence = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
id = 1:17
)
build_seq <- function(df) {
df %>%
mutate(
new_col = ifelse((conversion - lag(conversion, 1)) == -1, id, NA),
new_col = as.numeric(as.factor(new_col))
) %>%
tidyr::fill(new_col, .direction = "down") %>%
mutate(
new_col = ifelse(is.na(new_col), 1, new_col + 1)
)
}
new_dt1 <- build_seq(dt1)
new_dt2 <- build_seq(dt2)
all(new_dt1$new_col == new_dt1$sequence)
all(new_dt2$new_col == new_dt2$sequence)

Related

Add a new column if index in other 2 column is the same

I would add the new index to the new column e if b and c is the same.
In the mean time,
I need to consider the limit of the sum(d)<=20,
If the total d with the same b and c is exceed 20,
then give a new index.
the example input data below:
a
b
c
d
0
0
2
9
1
2
1
10
2
1
0
9
3
1
0
11
4
2
1
9
5
0
1
15
6
2
0
9
7
1
0
8
I sort the b and c first,
let comparing be more easier,
then I got key errorKeyError: 0, temporary_size += df.loc[df[i], 'd']\
Hope it like this:
a
b
c
d
e
5
0
1
15
1
0
0
2
9
2
2
1
0
9
3
3
1
0
11
3
7
1
0
8
4
6
2
0
9
5
1
2
1
10
6
4
2
1
9
6
and here is my code:
import pandas as pd
d = {'a': [0, 1, 2, 3, 4, 5, 6, 7], 'b': [0, 2, 1, 1, 2, 0, 2, 1], 'c': [2, 1, 0, 0, 1, 1, 0, 0], 'd': [9, 10, 9, 11, 9, 15, 9, 8]}
df = pd.DataFrame(data=d)
print(df)
df.sort_values(['b', 'c'], ascending=[True, True], inplace=True, ignore_index=True)
e_id = 0
total_size = 20
temporary_size = 0
for i in range(0, len(df.index)-1):
if df.loc[i, 'b'] == df.loc[i+1, 'b'] and df.loc[i, 'c'] != df.loc[i+1, 'c']:
temporary_size = temporary_size + df.loc[i, 'd']
if temporary_size <= total_size:
df.loc['e', i] = e_id
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
e_id += 1
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
print(df)
finally, I can't get the column c in my dataframe.
THANKS FOR ALL!

Change every n-th element of a row in a 2d numpy array depending on the row number

I have a 2d array:
H = 12
a = np.ones([H, H])
print(a.astype(int))
[[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1]]
The goal is, for every row r to substitute every r+1-th (starting with 0th) element of that row with 0.
Namely, for the 0th row substitute every 'first' (i.e. all of them) element with 0. For the 1st row substitute every 2nd element with 0. And so on.
It can trivially be done in a loop (the printed array is the desired output):
for i in np.arange(H):
a[i, ::i+1] = 0
print(a.astype(int))
[[0 0 0 0 0 0 0 0 0 0 0 0]
[0 1 0 1 0 1 0 1 0 1 0 1]
[0 1 1 0 1 1 0 1 1 0 1 1]
[0 1 1 1 0 1 1 1 0 1 1 1]
[0 1 1 1 1 0 1 1 1 1 0 1]
[0 1 1 1 1 1 0 1 1 1 1 1]
[0 1 1 1 1 1 1 0 1 1 1 1]
[0 1 1 1 1 1 1 1 0 1 1 1]
[0 1 1 1 1 1 1 1 1 0 1 1]
[0 1 1 1 1 1 1 1 1 1 0 1]
[0 1 1 1 1 1 1 1 1 1 1 0]
[0 1 1 1 1 1 1 1 1 1 1 1]]
Can I make use the vectorisation power of numpy here and avoid looping? Or it is not possible?
You can use a np.arange and broadcast modulo over itself
import numpy as np
H = 12
a = np.arange(H)
((a % (a+1)[:, None]) != 0).astype('int')
Output
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1],
[0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1],
[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1],
[0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Vectorised slicing of multiple rows in ndarray at arbitrary positions

I often find myself holding an array not of indices, but of index bounds that effectively define multiple slices. A representative example is
import numpy as np
rand = np.random.default_rng(seed=0)
sample = rand.integers(low=0, high=10, size=(10, 10))
y, x = np.mgrid[:10, :10]
bad_starts = rand.integers(low=0, high=10, size=(10, 1))
print(bad_starts)
sample[
(x >= bad_starts) & (y < 5)
] = -1
print(sample)
[[4]
[7]
[3]
[2]
[7]
[8]
[0]
[0]
[6]
[3]]
[[ 8 6 5 2 -1 -1 -1 -1 -1 -1]
[ 6 9 5 6 9 7 6 -1 -1 -1]
[ 2 8 6 -1 -1 -1 -1 -1 -1 -1]
[ 8 1 -1 -1 -1 -1 -1 -1 -1 -1]
[ 4 0 0 1 0 6 5 -1 -1 -1]
[ 7 3 4 9 8 9 3 6 9 6]
[ 8 6 7 3 8 1 5 7 8 5]
[ 3 3 4 4 7 8 0 9 5 3]
[ 6 5 2 3 7 5 5 3 7 3]
[ 3 8 2 2 7 6 0 0 3 8]]
Is there a simpler way to accomplish the same thing with slices alone, avoiding having to call mgrid and avoiding an entire boolean predicate matrix?
With ogrid you get 'sparse' grid
In [488]: y,x
Out[488]:
(array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]]),
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]))
The mask is the same: (x >= bad_starts) & (y < 5)
A single value for each row can be fetched (or set) with:
In [491]: sample[np.arange(5)[:,None],bad_starts[:5]]
Out[491]:
array([[-1],
[-1],
[-1],
[-1],
[-1]])
But there isn't a way of accessing all -1 with simple slicing. Each row has a different length slice:
In [492]: [sample[i,bad_starts[i,0]:] for i in range(5)]
Out[492]:
[array([-1, -1, -1, -1, -1, -1]),
array([-1, -1, -1]),
array([-1, -1, -1, -1, -1, -1, -1]),
array([-1, -1, -1, -1, -1, -1, -1, -1]),
array([-1, -1, -1])]
There isn't a way to access all with one slice.
The equivalent 'advanced indexing' arrays are:
In [494]: np.nonzero((x >= bad_starts) & (y < 5))
Out[494]:
(array([0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4]),
array([4, 5, 6, 7, 8, 9, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5, 6, 7,
8, 9, 7, 8, 9]))

The efficient way to compare value between two cell and assign value based on condition in Numpy

The objective is to count the frequency when two nodes have similar value.
Say, for example, we have a vector
pd.DataFrame([0,4,1,1,1],index=['A','B','C','D','E'])
as below
0
A 0
B 4
C 1
D 1
E 1
And, the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise.
N is then
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 1 1
D 0 0 1 1 1
E 0 0 1 1 1
This simple example can be extended to 2D. For example, here create array of shape (4,5)
A B C D E
0 0 0 0 0 0
1 0 4 1 1 1
2 0 1 1 2 2
3 0 3 2 2 2
Similarly, we go row wise and set the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise. At every iteration of the row, we sum the cell value.
The frequency is then equal to
A B C D E
A 4.0 1.0 1.0 1.0 1.0
B 1.0 4.0 2.0 1.0 1.0
C 1.0 2.0 4.0 3.0 3.0
D 1.0 1.0 3.0 4.0 4.0
E 1.0 1.0 3.0 4.0 4.0
Based on this, the following code is proposed. But, the current implementation used 3 for-loops and some if-else statement.
I am curios whether the code below can be enhanced further, or maybe, there is a build-in method within Pandas or Numpy that can be used to achieve similar objective.
import numpy as np
arr=[[ 0,0,0,0,0],
[0,4,1,1,1],
[0,1,1,2,2],
[0,3,2,2,2]]
arr=np.array(arr)
# C=arr
# nrows
npart = len(arr[:,0])
# Ncolumns
m = len(arr[0,:])
X = np.zeros(shape =(m,m), dtype = np.double)
for i in range(npart):
for k in range(m):
for p in range(m):
# Check whether the pair have similar value or not
if arr[i,k] == arr[i,p]:
X[k,p] = X[k,p] + 1
else:
X[k,p] = X[k,p] + 0
Output:
4.00000,1.00000,1.00000,1.00000,1.00000
1.00000,4.00000,2.00000,1.00000,1.00000
1.00000,2.00000,4.00000,3.00000,3.00000
1.00000,1.00000,3.00000,4.00000,4.00000
1.00000,1.00000,3.00000,4.00000,4.00000
p.s. The index A,B,C,D,E and use of pandas are for clarification purpose.
With numpy, you can use broadcasting:
1D
a = np.array([0,4,1,1,1])
(a==a[:, None])*1
output:
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 1]])
2D
a = np.array([[0, 0, 0, 0, 0],
[0, 4, 1, 1, 1],
[0, 1, 1, 2, 2],
[0, 3, 2, 2, 2]])
(a.T == a.T[:,None]).sum(2)
output:
array([[4, 1, 1, 1, 1],
[1, 4, 2, 1, 1],
[1, 2, 4, 3, 3],
[1, 1, 3, 4, 4],
[1, 1, 3, 4, 4]])

Pandas index clause across multiple columns in a multi-column header

I have a data frame with multi-column headers.
import pandas as pd
headers = pd.MultiIndex.from_tuples([("A", "u"), ("A", "v"), ("B", "x"), ("B", "y")])
f = pd.DataFrame([[1, 1, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1], [1, 0, 1, 0]], columns = headers)
f
A B
u v x y
0 1 1 0 1
1 1 0 0 0
2 0 0 1 1
3 1 0 1 0
I want to select the rows in which either all the A columns or all the B columns are true.
I can do so explicitly.
f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
A B
u v x y
0 1 1 0 1
1 1 0 0 0
3 1 0 1 0
f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
A B
u v x y
0 1 1 0 1
2 0 0 1 1
3 1 0 1 0
I want to write a function select(f, top_level_name) where the indexing clause applies to all the columns under the same top level name such that
select(f, "A") == f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
select(f, "B") == f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
I want this function to work with arbitrary numbers of sub-columns with arbitrary names.
How do I write select?