how to create a column with the index of the biggest among other columns AND some condition

how to create a column with the index of the biggest among other columns AND some condition - pandas

I have a dataset with some columns, I want to create another column, where values are the column name of the variable with the highest value BUT different from 1
For Example:
df = pd.DataFrame({'A': [1, 0.2, 0.1, 0],
'B': [0.2,1, 0, 0.5],
'C': [1, 0.4, 0.3, 1]},
index=['1', '2', '3', '4'])
df
index
A
B
C
1
1.0
0.2
1.0
2
0.2
1.0
0.4
3
0.1
0.0
0.3
4
0.0
0.5
1.0
Should give an output like
index
A
B
C
NEWCOL
1
1.0
0.2
1.0
B
2
0.2
0.3
0.1
C
3
0.1
0.4
0.2
B
4
0.0
0.5
1.0
B
df2['newcol'] = df2.idxmax(axis=1) if df2.max(index=1) != 1
but didn't work

here is one way to do it
# filter out the data that is 1 and find the id of the max value using idxmax
df['newcol']=df[~df.isin([1])].idxmax(axis=1)
df
A B C newcol
1 1.0 0.2 1.0 B
2 0.2 1.0 0.4 C
3 0.1 0.0 0.3 C
4 0.0 0.5 1.0 B
PS: your input, starting and expected data don't match. The above is based on the input DF

Related

groupby shows unobserved values of non-categorical columns

I created this simple example to illustrate my issue:
x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})
it yields this DataFrame:
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
or I can show unobserved values for all grouping variables like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
b 0.0
4 a 0.0
b 0.0
5 a 0.0
b 0.0
1 3 a 0.0
b 0.0
4 a 0.0
b 0.2
5 a 0.0
b 0.0
2 3 a 0.0
b 0.0
4 a 0.0
b 0.0
5 a 0.3
b 0.0
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?

You can unstack the level of interest, cat_var in this case:
(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
.agg({'value':'sum'})
.unstack('cat_var',fill_value=0)
)
Output:
value
cat_var a b
int_var1 int_var2
0 3 0.1 0.0
1 4 0.0 0.2
2 5 0.3 0.0

how to select row index and variable name based on value in a data frame?

I have a large data frame made of float numbers between -1.0 and 1.0. I would like to create a new list containing the index rows, the variable names and the values for all the cells having a number higher than 0.59.
Here is an example:
A B C D ... FD
0 0.34 -0.23 0.6 0.7 ... 0.3
1 -0.5 0.99 0.8 0.2 ... 0.8
...
45 0.8 0.13 0.34 0.4 ... -0.9
output:
0 C 0.6
0 D 0.7
1 B 0.99
1 C 0.8
...
1 FD 0.8
etc..
Thanks!

I am sure there must be a better solution than mine, as mine has awful performance (iterating cell by cell). But here is my attempt:
# creating a sample df
df = pd.DataFrame(np.random.uniform(-1, 1, size=(10, 4)), columns=list('abcd'))
new_list = []
for tup in df.itertuples():
for i in range(1, len(tup)):
if tup[i] > 0.59:
new_list.append([tup.Index, df.columns[i-1], tup[i]])
new_df = pd.DataFrame(new_list, columns=['index', 'column', 'value'])
new_df = new_df.set_index('index')

Excel sumproudct function in pandas dataframes

Ok, as a python beginner I found multiplication matrix in pandas dataframes is very difficult to conduct.
I have two tables look like:
df1
Id lifetime 0 1 2 3 4 5 .... 30
0 1 4 0.1 0.2 0.1 0.4 0.5 0.4... 0.2
1 2 7 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
2 3 8 0.5 0.2 0.1 0.4 0.5 0.4... 0.6
.......
9 6 10 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
df2
Group lifetime 0 1 2 3 4 5 .... 30
0 2 4 0.9 0.8 0.9 0.8 0.8 0.8... 0.9
1 2 7 0.8 0.9 0.9 0.9 0.8 0.8... 0.9
2 3 8 0.9 0.7 0.8 0.8 0.9 0.9... 0.9
.......
9 5 10 0.8 0.9 0.7 0.7 0.9 0.7... 0.9
I want to perform excel's sumproduct function in my codes and the length of the columns that need to be summed are based on the lifetime in column 1 of both dfs, e,g.,
for row 0 in df1&df2, lifetime=4:
sumproduct(df1 row 0 from column 0 to column 3,
df2 row 0 from column 0 to column 3)
for row 1 in df1&df2, lifetime=7
sumproduct(df1 row 2 from column 0 to column 6,
df2 row 2 from column 0 to column 6)
.......
How can I do this?

You can use .iloc to access row and columns with integers.
So where lifetime==4 is row 0, and if you count the column numbers where Id is zero, then column labeled as 0 would be 2, and column labeled as 3 would be 5, to get that interval you would enter 2:6.
Once you get the correct data in both data frames with .iloc[0,2:6], you run np.dot
See below:
import numpy as np
np.dot(df1.iloc[0,2:6], df2.iloc[1,2:6])
Just to make sure you have the right data, try just running
df1.iloc[0,2:6]
Then try the np.dot product. You can read up on "pandas iloc" and "slicing" for more info.

Grouping by and applying lambda with condition for the first row - Pandas

I have a data frame with IDs, and choices that have made by those IDs.
The alternatives (choices) set is a list of integers: [10, 20, 30, 40].
Note: That's important to use this list. Let's call it 'choice_list'.
This is the data frame:
ID Choice
1 10
1 30
1 10
2 40
2 40
2 40
3 20
3 40
3 10
I want to create a variable for each alternative: '10_Var', '20_Var', '30_Var', '40_Var'.
At the first row of each ID, if the first choice was '10' for example, so the variable '10_Var' will get the value 0.6 (some parameter), and each of the other variables ('20_Var', '30_Var', '40_Var') will get the value (1 - 0.6) / 4.
The number 4 stands for the number of alternatives.
Expected result:
ID Choice 10_Var 20_Var 30_Var 40_Var
1 10 0.6 0.1 0.1 0.1
1 30
1 10
2 40 0.1 0.1 0.1 0.6
2 40
2 40
3 20 0.1 0.6 0.1 0.1
3 40
3 10

you can use np.where to do this. It is efficient that df.where
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
choices = np.unique(df.Choice)
for choice in choices:
df[f"var_{choice}"] = np.where(df.Choice==choice, 0.6, (1 - 0.6) / 4)
df
Result
ID Choice var_10 var_20 var_30 var_40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 0.1 0.1 0.6 0.1
2 1 10 0.6 0.1 0.1 0.1
3 2 40 0.1 0.1 0.1 0.6
4 2 40 0.1 0.1 0.1 0.6
5 2 40 0.1 0.1 0.1 0.6
6 3 20 0.1 0.6 0.1 0.1
7 3 40 0.1 0.1 0.1 0.6
8 3 10 0.6 0.1 0.1 0.1
Edit
To set values to 1st row of group only
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
df=df.set_index("ID")
## create unique index for each row if not already
df = df.reset_index()
choices = np.unique(df.Choice)
## get unique id of 1st row of each group
grouped = df.loc[df.reset_index().groupby("ID")["index"].first()]
## set value for each new variable
for choice in choices:
grouped[f"var_{choice}"] = np.where(grouped.Choice==choice, 0.6, (1 - 0.6) / 4)
pd.concat([df, grouped.iloc[:, -len(choices):]], axis=1)

We can use insert o create the rows based on the unique ID values obtained through Series.unique.We can also create a mask to fill only the first row using np.where.
At the beginning sort_values is used to sort the values based on the ID. You can skip this step if your data frame is already well sorted (like the one shown in the example):
df=df.sort_values('ID')
n=df['Choice'].nunique()
mask=df['ID'].ne(df['ID'].shift())
for choice in df['Choice'].sort_values(ascending=False).unique():
df.insert(2,column=f'{choice}_Var',value=np.nan)
df.loc[mask,f'{choice}_Var']=np.where(df.loc[mask,'Choice'].eq(choice),0.6,0.4/n)
print(df)
ID Choice 10_Var 20_Var 30_Var 40_Var
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN

A mix of numpy and pandas solution:
rows = np.unique(df.ID.values, return_index=1)[1]
df1 = df.loc[rows].assign(val=0.6)
df2 = (pd.crosstab([df1.index, df1.ID, df1.Choice], df1.Choice, df1.val, aggfunc='first')
.reindex(choice_list, axis=1)
.fillna((1-0.6)/len(choice_list)).reset_index(level=[1,2], drop=True))
pd.concat([df, df2], axis=1)
Out[217]:
ID Choice 10 20 30 40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN

Python: How replace non-zero values in a Pandas dataframe with values from a series

I have a dataframe 'A' with 3 columns and 4 rows (X1..X4). Some of the elements in 'A' are non-zero. I have another dataframe 'B' with 1 column and 4 rows (X1..X4). I would like to create a dataframe 'C' so that where 'A' has a nonzero value, it takes the value from the equivalent row in 'B'
I've tried a.where(a!=0,c)..obviously wrong as c is not a scalar
A = pd.DataFrame({'A':[1,6,0,0],'B':[0,0,1,0],'C':[1,0,3,0]},index=['X1','X2','X3','X4'])
B = pd.DataFrame({'A':{'X1':1.5,'X2':0.4,'X3':-1.1,'X4':5.2}})
These are the expected results:
C = pd.DataFrame({'A':[1.5,0.4,0,0],'B':[0,0,-1.1,0],'C':[1.5,0,-1.1,0]},index=['X1','X2','X3','X4'])

np.where():
If you want to assign back to A:
A[:]=np.where(A.ne(0),B,A)
For a new df:
final=pd.DataFrame(np.where(A.ne(0),B,A),columns=A.columns)
A B C
0 1.5 0.0 1.5
1 0.4 0.0 0.0
2 0.0 -1.1 -1.1
3 0.0 0.0 0.0

Usage of fillna
A=A.mask(A.ne(0)).T.fillna(B.A).T
A
Out[105]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0
Or
A=A.mask(A!=0,B.A,axis=0)
Out[111]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0

Use:
A.mask(A!=0,B['A'],axis=0,inplace=True)
print(A)
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to create a column with the index of the biggest among other columns AND some condition - pandas

Related

groupby shows unobserved values of non-categorical columns

how to select row index and variable name based on value in a data frame?

Excel sumproudct function in pandas dataframes

Grouping by and applying lambda with condition for the first row - Pandas

Python: How replace non-zero values in a Pandas dataframe with values from a series

Categories

Resources