here is my sample data input and output:
df=pd.DataFrame({'A_flag': [1, 1,1], 'B_flag': [1, 1,0],'C_flag': [0, 1,0],'A_value': [5, 3,7], 'B_value': [2, 7,4],'C_value': [4, 2,5]})
df1=pd.DataFrame({'A_flag': [1, 1,1], 'B_flag': [1, 1,0],'C_flag': [0, 1,0],'A_value': [5, 3,7], 'B_value': [2, 7,4],'C_value': [4, 2,5], 'Final':[3.5,3,7]})
I want to generate another column called 'Final' conditional on A_flag, B_flag and C_flag:
(a) If number of three columns equal to 1 is 3, then 'Final'=median of (A_value, B_value, C_value)
(b) If the number of satisfied conditions is 2, then 'Final'= mean of those two
(c) If the number is 1, the 'Final' = that one
For example, in row 1, A_flag=1 and B_flag =1, 'Final'=A_value+B_value/2=5+2/2=3.5
in row 2, all three flags are 1 so 'Final'= median of (3,7,2) =3
in row 3, only A_flag=1, so 'Final'=A_value=7
I tried the following:
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==3, "Final"]= df[['A_flag','B_flag','C_flag']].median(axis=1)
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==2, "Final"]=
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==1, "Final"]=
I don't know how to subset the columns that for the second and third scenarios.
Assuming the order of flag and value columns match, you can first filter the flag and value like columns then mask the values in value columns where flag is 0, then calculate median along axis=1
flag = df.filter(like='_flag')
value = df.filter(like='_value')
df['median'] = value.mask(flag.eq(0).to_numpy()).median(1)
A_flag B_flag C_flag A_value B_value C_value median
0 1 1 0 5 2 4 3.5
1 1 1 1 3 7 2 3.0
2 1 0 0 7 4 5 7.0
When dealing with functions and dataframe, usually the easiest way to go is defining a function and then calling that function to the dataframe either by iterating over the columns or the rows. I think in your case this might work:
import pandas as pd
df = pd.DataFrame(
{
"A_flag": [1, 1, 1],
"B_flag": [1, 1, 0],
"C_flag": [0, 1, 0],
"A_value": [5, 3, 7],
"B_value": [2, 7, 4],
"C_value": [4, 2, 5],
}
)
def make_final_column(row):
flags = [(row['A_flag'], row['A_value']), (row['B_flag'], row['B_value']), (row['C_flag'], row['C_value'])]
met_condition = [row[1] for row in flags if row[0] == 1]
return sum(met_condition) / len(met_condition)
df["Final"] = df.apply(make_final_column, axis=1)
df
With numpy:
flags = df[["A_flag", "B_flag", "C_flag"]].to_numpy()
values = df[["A_value", "B_value", "C_value"]].to_numpy()
# Sort each row so that the 0 flags appear first
index = np.argsort(flags)
flags = np.take_along_axis(flags, index, axis=1)
# Rearrange the values to match the flags
values = np.take_along_axis(values, index, axis=1)
# Result
np.select(
[
flags[:, 0] == 1, # when all flags are 1
flags[:, 1] == 1, # when two flags are 1
flags[:, 2] == 1, # when one flag is 1
],
[
np.quantile(values, 0.5, axis=1), # median all of 3 values
np.mean(values[:, -2:], axis=1), # mean of the two 1-flag
values[:, 2], # value of the 1-flag
],
default=np.nan
)
Quite interesting solutions already. I have used a masked approach.
Explanation:
So, with the flag given already it becomes easy to find which values are important just by multiplying by the flag. There after mask the values which are zero in respective rows and find median over the axis.
>>> import numpy as np
>>> t_arr = np.array((df.A_flag * df.A_value, df.B_flag * df.B_value, df.C_flag * df.C_value)).T
>>> maskArr = np.ma.masked_array(t_arr, mask=x==0)
>>> df["Final"] = np.ma.median(maskArr, axis=1)
>>> df
A_flag B_flag C_flag A_value B_value C_value Final
0 1 1 0 5 2 4 3.5
1 1 1 1 3 7 2 3.0
2 1 0 0 7 4 5 7.0
I am looking for an efficient way to compute the indices of the binnings of bincount as a ndarray.
To illustrate:
>>> x = np.array([0, 1, 1, 0, 2])
>>> b = np.bincount(x)
>>> b
[2 2 1]
I am now looking for an ndarray that represents the indices of the elements of each bin:
[0 3 1 2 4]
I am looking for a fast numpy solution that should not contain loops. Anyone knows how to implement this? Thanks very much in advance!
I start with a pd dataframe:
Node,prob
0 0 ,0.0035
1 1 ,0.0070
2 2 ,0.0025
3 3 ,0.0005
4 4 ,0.0105
5 5 ,0.0015
6 6 ,0.0085
7 7 ,0.0055
8 8 ,0.0060
9 9 ,0.0030
I have indices (nodes) for which I need the values (probs). The indices are:
array([0, 2, 4, 8, 9, 5, 3, 1])
I convert the dataframe to a dictionary and run a loop. I almost get the desired result.
I've iterated over all the _todict options and "index" and "records" were the ones that worked best. :
nodes2 = nodes.to_dict("index")
for i in indices:
print(nodes2[i])
Result:
{'Node,prob': '0,0.0035'}
{'Node,prob': '2,0.0025'}
{'Node,prob': '4,0.0105'}
{'Node,prob': '8,0.0060'}
{'Node,prob': '9,0.0030'}
{'Node,prob': '5,0.0015'}
{'Node,prob': '3,0.0005'}
{'Node,prob': '1,0.0070'}
Ideally, I would get a numpy array represented by the values of the above dictionary, as printed below. How do I extract below from the dictionary? Hop this isn't too confusing! Thanks in advance.
[
[0 0.0035]
[2 0.0025]
[4 0.0105]
[8 0.0060]
[9 0.0030]
[5 0.0015]
[3 0.0005]
[1 0.0070]
]
If your example output is correct
nodes2 = nodes.to_dict("index")
new_list = [ map(float, nodes2[i]['Node,prob'].split(',') for i in indices]
print(new_list)
I have one array in which the values should be averaged until the day that is given as a value in another array. The first array has 365 days as the first axis, and the second array corresponds to specific julian dates, ranging from 0 to 365, from which the value from the first array should be averaged.
array1.shape = (365, 375, 700)
array2.shape = (375, 700)
The resultant array naturally will have the same shape as the second array that is used for averaging the first array. Is there an easy way to do this? Maybe with some for loops or with vectorization/broadcasting?
Thanks in advance!
You can use numpy.cumsum to calculate the cumulative sum along axis=0 then taking some index and dividing by this index give the average till this index.
import numpy as np
def averages(a, b):
return a.cumsum(axis=0)[
b.ravel(),
np.repeat(np.arange(b.shape[0]), b.shape[1]),
np.tile(np.arange(b.shape[1]), b.shape[0]),
].reshape(b.shape) / (b + 1)
a = np.arange(12).reshape(3, 2, 2)
b = np.array([[0, 1], [1, 2]])
print(a)
# [[[ 0 1]
# [ 2 3]]
# [[ 4 5]
# [ 6 7]]
# [[ 8 9]
# [10 11]]]
print(b)
# [[0 1]
# [1 2]]
print(averages(a, b))
# [[0. 3.]
# [4. 7.]]
I have a matrix:
Params =
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
For each row I want to select some elements using column indices:
col_indices =
[[0 1]
[1 2]
[2 3]]
In Numpy, I can create row indices:
row_indices =
[[0 0]
[1 1]
[2 2]]
and do params[row_indices, col_indices]
In TenforFlow, I did this:
tf_params = tf.constant(params)
tf_col_indices = tf.constant(col_indices, dtype=tf.int32)
tf_row_indices = tf.constant(row_indices, dtype=tf.int32)
tf_params[row_indices, col_indices]
But there raised an error:
ValueError: Shape must be rank 1 but is rank 3
What does it mean? How should I do this kind of indexing properly?
Thanks!
Tensor rank (sometimes referred to as order or degree or n-dimension) is the number of dimensions of the tensor. For example, the following tensor (defined as a Python list) has a rank of 2:
t = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
A rank two tensor is what we typically think of as a matrix, a rank one tensor is a vector. For a rank two tensor you can access any element with the syntax t[i, j]. For a rank three tensor you would need to address an element with t[i, j, k]. See this for more details.
ValueError: Shape must be rank 1 but is rank 3 means you are trying to create a 3-tensor (cube of numbers) instead of a vector.
To see how you can declare tensor constants of different shape, you can see this.