how to select row index and variable name based on value in a data frame? - dataframe

I have a large data frame made of float numbers between -1.0 and 1.0. I would like to create a new list containing the index rows, the variable names and the values for all the cells having a number higher than 0.59.
Here is an example:
A B C D ... FD
0 0.34 -0.23 0.6 0.7 ... 0.3
1 -0.5 0.99 0.8 0.2 ... 0.8
...
45 0.8 0.13 0.34 0.4 ... -0.9
output:
0 C 0.6
0 D 0.7
1 B 0.99
1 C 0.8
...
1 FD 0.8
etc..
Thanks!

I am sure there must be a better solution than mine, as mine has awful performance (iterating cell by cell). But here is my attempt:
# creating a sample df
df = pd.DataFrame(np.random.uniform(-1, 1, size=(10, 4)), columns=list('abcd'))
new_list = []
for tup in df.itertuples():
for i in range(1, len(tup)):
if tup[i] > 0.59:
new_list.append([tup.Index, df.columns[i-1], tup[i]])
new_df = pd.DataFrame(new_list, columns=['index', 'column', 'value'])
new_df = new_df.set_index('index')

Related

how to create a column with the index of the biggest among other columns AND some condition

I have a dataset with some columns, I want to create another column, where values are the column name of the variable with the highest value BUT different from 1
For Example:
df = pd.DataFrame({'A': [1, 0.2, 0.1, 0],
'B': [0.2,1, 0, 0.5],
'C': [1, 0.4, 0.3, 1]},
index=['1', '2', '3', '4'])
df
index
A
B
C
1
1.0
0.2
1.0
2
0.2
1.0
0.4
3
0.1
0.0
0.3
4
0.0
0.5
1.0
Should give an output like
index
A
B
C
NEWCOL
1
1.0
0.2
1.0
B
2
0.2
0.3
0.1
C
3
0.1
0.4
0.2
B
4
0.0
0.5
1.0
B
df2['newcol'] = df2.idxmax(axis=1) if df2.max(index=1) != 1
but didn't work
here is one way to do it
# filter out the data that is 1 and find the id of the max value using idxmax
df['newcol']=df[~df.isin([1])].idxmax(axis=1)
df
A B C newcol
1 1.0 0.2 1.0 B
2 0.2 1.0 0.4 C
3 0.1 0.0 0.3 C
4 0.0 0.5 1.0 B
PS: your input, starting and expected data don't match. The above is based on the input DF

replace value in a dataframe with values from another dataframe based on column names and values

So I have one dataframe with multiple variables, both categorical and numerical.
Var0 Var1 Var2
1 a 1
1 a 5
0 c 8
1 d 15
The second dataframe contains all column names from the first one in a column, the value that I want to bring BBB and a cutoff which can be both an interval such as (1.0,2.0] or a category such as:
Variable Cutoff BBB
Var1 a 0.2
Var1 b 0.3
Var1 c 0.8
Var1 d 0.1
Var2 (1, 5] 0.8
Var2 (6, 10] 0.1
Var2 (11, 20] 0.3
Var0 1 0.3
Var0 0 0.5
I want to replace the values from the first dataframe with the corresponding values from the second dataframe (the ones that satisfy those conditions).
The expected output should look like this:
Var0 Var1 Var2
0.3 0.2 0.8
0.3 0.2 0.8
0.5 0.8 0.1
0.3 0.1 0.3
Regards,
Vlad

Excel sumproudct function in pandas dataframes

Ok, as a python beginner I found multiplication matrix in pandas dataframes is very difficult to conduct.
I have two tables look like:
df1
Id lifetime 0 1 2 3 4 5 .... 30
0 1 4 0.1 0.2 0.1 0.4 0.5 0.4... 0.2
1 2 7 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
2 3 8 0.5 0.2 0.1 0.4 0.5 0.4... 0.6
.......
9 6 10 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
df2
Group lifetime 0 1 2 3 4 5 .... 30
0 2 4 0.9 0.8 0.9 0.8 0.8 0.8... 0.9
1 2 7 0.8 0.9 0.9 0.9 0.8 0.8... 0.9
2 3 8 0.9 0.7 0.8 0.8 0.9 0.9... 0.9
.......
9 5 10 0.8 0.9 0.7 0.7 0.9 0.7... 0.9
I want to perform excel's sumproduct function in my codes and the length of the columns that need to be summed are based on the lifetime in column 1 of both dfs, e,g.,
for row 0 in df1&df2, lifetime=4:
sumproduct(df1 row 0 from column 0 to column 3,
df2 row 0 from column 0 to column 3)
for row 1 in df1&df2, lifetime=7
sumproduct(df1 row 2 from column 0 to column 6,
df2 row 2 from column 0 to column 6)
.......
How can I do this?
You can use .iloc to access row and columns with integers.
So where lifetime==4 is row 0, and if you count the column numbers where Id is zero, then column labeled as 0 would be 2, and column labeled as 3 would be 5, to get that interval you would enter 2:6.
Once you get the correct data in both data frames with .iloc[0,2:6], you run np.dot
See below:
import numpy as np
np.dot(df1.iloc[0,2:6], df2.iloc[1,2:6])
Just to make sure you have the right data, try just running
df1.iloc[0,2:6]
Then try the np.dot product. You can read up on "pandas iloc" and "slicing" for more info.

How to apply subtraction to groupby object

I have a dataframe like this
test = pd.DataFrame({'category':[1,1,2,2,3,3],
'type':['new', 'old','new', 'old','new', 'old'],
'ratio':[0.1,0.2,0.2,0.4,0.4,0.8]})
category ratio type
0 1 0.10000 new
1 1 0.20000 old
2 2 0.20000 new
3 2 0.40000 old
4 3 0.40000 new
5 3 0.80000 old
I would like to subtract each category's old ratio from the new ratio but not sure how to reshape the DF to do so
Use DataFrame.pivot first, so possible subtract very easy:
df = test.pivot('category','type','ratio')
df['val'] = df['old'] - df['new']
print (df)
type new old val
category
1 0.1 0.2 0.1
2 0.2 0.4 0.2
3 0.4 0.8 0.4
Another approach
df = df.groupby('category').apply(lambda x: x[x['type'] == 'old'].reset_index()['ratio'][0] - x[x['type'] == 'new'].reset_index()['ratio'][0]).reset_index(name='val')
Output
category val
0 1 0.1
1 2 0.2
2 3 0.4

Transpose table then set and rename index

I want to transpose a table and rename the index.
If I display the df with existing index Time I get
Time v1 v2
1 0.5 0.3
2 0.2 0.1
3 0.3 0.3
and after df.transpose() I'm at
Time 1 2 3
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
Interestingly if I do now df.Time I get
AttributeError: 'DataFrame' object has no attribute 'Time'
although it gets displayed in the output.
I can't find a way to easily rename the column Time to Variable and set that as the new index ..
I tried df.reset_index().set_index("index") but what I get is something that looks like this:
Time 1 2 3
index
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
You need only rename column names by rename_axis:
print (df.transpose().rename_axis('Variable', axis=1))
Variable 1 2 3
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
Or set new column names by assign name:
df1 = df.transpose()
df1.columns.name = 'Var'
print (df1)
Var 1 2 3
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
But I think you need set new column from index and then rename column index to var, also reset column names to None:
df1 = df.transpose().reset_index().rename(columns={'index':'var'})
df1.columns.name = None
print (df1)
var 1 2 3
0 v1 0.5 0.2 0.3
1 v2 0.3 0.1 0.3