Applying function to Pandas with GroupBy along direction of the grouping variable - pandas

I have a cohort of N people and I computed a correlation matrix of some quantities (q1_score,...q5_score)
df.groupby('participant_id').corr()
Out[130]:
q1_score q2_score q3_score q4_score q5_score
participant_id
11.0 q1_score 1.000000 -0.748887 -0.546893 -0.213635 -0.231169
q2_score -0.748887 1.000000 0.639649 0.324976 0.335596
q3_score -0.546893 0.639649 1.000000 0.154539 0.151233
q4_score -0.213635 0.324976 0.154539 1.000000 0.998752
q5_score -0.231169 0.335596 0.151233 0.998752 1.000000
14.0 q1_score 1.000000 -0.668781 -0.124614 -0.352075 -0.244251
q2_score -0.668781 1.000000 -0.175432 0.360183 0.184585
q3_score -0.124614 -0.175432 1.000000 -0.137993 -0.125115
q4_score -0.352075 0.360183 -0.137993 1.000000 0.968564
q5_score -0.244251 0.184585 -0.125115 0.968564 1.000000
17.0 q1_score 1.000000 -0.799223 -0.814424 -0.790587 -0.777318
q2_score -0.799223 1.000000 0.787238 0.658524 0.640786
q3_score -0.814424 0.787238 1.000000 0.702570 0.701440
q4_score -0.790587 0.658524 0.702570 1.000000 0.998996
q5_score -0.777318 0.640786 0.701440 0.998996 1.000000
18.0 q1_score 1.000000 -0.595545 -0.617691 -0.472409 -0.477523
q2_score -0.595545 1.000000 0.386705 0.148761 0.115068
q3_score -0.617691 0.386705 1.000000 0.806637 0.782345
q4_score -0.472409 0.148761 0.806637 1.000000 0.982617
q5_score -0.477523 0.115068 0.782345 0.982617 1.000000
I need to compute the median values of the correlations of all participants? What I mean: I need to take corr. between the item J and item K for all participants and find their median value.
I am sure it is a one line of code, but I'm struggling to realise (still learning pandas by examples).

Stack your data, and do another groupby:
df.groupby('participant_id').corr().stack().groupby(level = [1,2]).median()
Edit: Actually, you don't need to stack if you don't want to:
df.groupby('participant_id').corr().groupby(level = [1]).median()
works too.

IIUC, you want the average mean of each participant across all questions:
df.where(df != 1).mean(axis=1).mean(level=0)
Let's get rid of correlations with same question with where, then get the mean for all questions by participant_id with direction of axis=1, then get the participant_id mean level=0.
Output:
participant_id
11.0 0.086416
14.0 -0.031493
17.0 0.130800
18.0 0.105896
dtype: float64
Edit: I used mean instead of median, we can so do the same logic with median.
df.where(df != 1).median(axis=1).median(level=0)

Related

Convert a sparse matrix to dataframe

I have a sparse matrix that stores computed similarities between a set of documents. The matrix is an ndarray.
0 1 2 3 4
0 1.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 1.000000 0.067279 0.000000 0.000000
2 0.000000 0.067279 1.000000 0.025758 0.012039
3 0.000000 0.000000 0.025758 1.000000 0.000000
4 0.000000 0.000000 0.012039 0.000000 1.000000
I would like to transform this data into a 3-dimensional dataframe as follows.
docA docB similarity
1 2 0.067279
2 3 0.025758
2 4 0.012039
This final result does not contain matrix diagonals or zero values. It also lists each document pair only once (i.e. in one row only). Is there is a built-in / efficient method to achieve this end result? Any pointers would be much appreciated.
Thanks!
Convert the dataframe to an array:
x = df.to_numpy()
Get a list of non-diagonal non-zero entries from the sparse symmetric distance matrix:
i, j = np.triu_indices_from(x, k=1)
v = x[i, j]
ijv = np.concatenate((i, j, v)).reshape(3, -1).T
ijv = ijv[v != 0.0]
Convert it back to a dataframe:
df_ijv = pd.DataFrame(ijv)
I'm not sure if this is any faster or anything but an alternative way to do the middle step is to convert the numpy array to an ijv or "triplet" sparse matrix:
from scipy import sparse
coo = sparse.coo_matrix(x)
ijv = np.concatenate((coo.row, coo.col, coo.data)).reshape(3, -1).T
Now given a symmetric distance matrix, all you need to do is to keep the non-zero elements on the upper right triangle. You could loop through these. Or you could pre-mask the array with np.triu_indices_from(x, k=1), but that kind of defeats the whole purpose of this supposedly faster method... hmmm.

pandas Selecting single value from df using .loc() is producing a df instead of a numeric

I have two dataframes, sarc and non. After running describe() on both I want to compare the mean value for a particular column in both dataframes. I used .loc() and tried saving the value as a float but it is saving as a dataframe, which prevents me from comparing the two values using the > operator. Here's my code:
sarc.describe()
label c_len c_s_l_len score
count 5092.0 5092.000000 5092.000000 5092.000000
mean 1.0 54.876277 33.123527 6.919874
std 0.0 37.536986 22.566558 43.616977
min 1.0 0.000000 0.000000 -96.000000
25% 1.0 29.000000 18.000000 1.000000
50% 1.0 47.000000 28.000000 2.000000
75% 1.0 71.000000 43.000000 5.000000
max 1.0 466.000000 307.000000 2381.000000
non.describe()
label c_len c_s_l_len score
count 4960.0 4960.000000 4960.000000 4960.000000
mean 0.0 55.044153 33.100806 6.912298
std 0.0 47.873732 28.738776 39.216049
min 0.0 0.000000 0.000000 -119.000000
25% 0.0 23.000000 14.000000 1.000000
50% 0.0 43.000000 26.000000 2.000000
75% 0.0 74.000000 44.000000 4.000000
max 0.0 594.000000 363.000000 1534.000000
non_c_len_mean = non.describe().loc[['mean'], ['c_len']].astype(np.float64)
sarc_c_len_mean = sarc.describe().loc[['mean'], ['c_len']].astype(np.float64)
if sarc_c_len_mean > non_c_len_mean:
# do stuff
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The variables are indeed of <class 'pandas.core.frame.DataFrame'> type, and each prints as a labeled 1-row, 1-col df instead of just the value. How can I select only the numeric value as a float?
Remove the [] in .loc when you pick the columns and index
non.describe().loc['mean', 'c_len']

Pandas take only a cell from the describe function

I am using pandas describe function for the below result:
dt_d=dt.describe()
print(dt_d)
count 120.00000 120.000000 120.000000 120.000000
mean 5.89000 3.060000 3.795833 1.190833
std 0.84589 0.441807 1.792861 0.757372
min 4.30000 2.000000 1.000000 0.100000
25% 5.17500 2.800000 1.575000 0.300000
50% 5.80000 3.000000 4.450000 1.400000
75% 6.40000 3.325000 5.100000 1.800000
max 7.90000 4.400000 6.900000 2.500000
If I want to take a cell from the describe function, for example, from the mean row, the mean in the third column, how will I be able to call it on its own?
df.describe() returns a DataFrame so you can just index it as you would any other DataFrame, using .loc.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,10,(10,3)))
df.describe()
# 0 1 2
#count 10.00000 10.000000 10.000000
#mean 4.30000 2.400000 5.400000
#std 2.58414 1.429841 2.458545
#min 1.00000 1.000000 1.000000
#25% 2.25000 1.000000 4.250000
#50% 4.00000 2.500000 6.000000
#75% 5.00000 3.000000 7.000000
#max 9.00000 5.000000 8.000000
df.describe().loc['mean', 2]
#5.4

Replace obj file geometric vertex with real coordinates 3D coordinates

I have a actual plane with known 3D coordinates of it's four corners relative to a landmark. It's coordinates are:
Front left corner: -32.5100 128.2703 662.2551
Front right corner: 65.2244 131.0850 656.1088
Back left corner: -23.4983 129.0271 838.3724
Back right corner: 74.1135 131.4294 833.4199
I am now creating a 3D obj file plane by using blender which has a image as texture mapped on it. By following the tutorial about adding texture on a plane using blender, I get both my obj file and mtl file shows below. I tried to directly replace the geometric vertex of the obj file to my own coordinates, but the coordinates are not connected in meshlab. Any idea about how to modify the obj file?
Thanks,
OBJ File:
# Blender v2.76 (sub 0) OBJ File: ''
# www.blender.org
mtllib planePhantom.mtl
o Plane
v -0.088000 0.000000 0.049250
v 0.088000 0.000000 0.049250
v -0.088000 0.000000 -0.049250
v 0.088000 0.000000 -0.049250
vt 0.000000 0.000000
vt 1.000000 0.000000
vt 1.000000 1.000000
vt 0.000000 1.000000
vn 0.000000 1.000000 0.000000
usemtl Material.001
s off
f 1/1/1 2/2/1 4/3/1 3/4/1
MTL file:
# Blender MTL File: 'None'
# Material Count: 1
newmtl Material.001
Ns 96.078431
Ka 1.000000 1.000000 1.000000
Kd 0.640000 0.640000 0.640000
Ks 0.500000 0.500000 0.500000
Ke 0.000000 0.000000 0.000000
Ni 1.000000
d 1.000000
illum 0
map_Kd IMG_0772_cropped_unsharpmask_100_4_0.jpeg
The plane shows in meshlab before replacing:
Okay, it turns out be that the order for connecting vertex is wrong and I have already figured it out

How do you get modf or modff to work properly?

I am simply trying to cycle through a list of (10) names using an incrementing counter by taking the modulus of the counter with respect to the length of the list. However, the code seems to skip a number here and there. I have tried both modf() and modff() and different type castings, but no luck.
Here is an example of the code:
defaultNameList = [NSArray arrayWithObjects:#"RacerX",#"Speed",#"Sprittle",#"Chim-Chim",#"Pops",#"Dale",#"Junior",#"Chip",#"Fred",#"Barney", nil];
float intpart;
int pickName = (int)(modff(entryCount/10.0,&intpart) * 10.0);
NSLog(#"%ld %f %f %f %d %#",entryCount, entryCount/10.0, modff(entryCount/10.0,&intpart), modff(entryCount/10.0,&intpart) * 10.0 ,pickName, [defaultNameList objectAtIndex:pickName]);
The console gives:
0 0.000000 0.000000 0.000000 0 RacerX
1 0.100000 0.100000 1.000000 1 Speed
2 0.200000 0.200000 2.000000 2 Sprittle
3 0.300000 0.300000 3.000000 3 Chim-Chim
4 0.400000 0.400000 4.000000 4 Pops
5 0.500000 0.500000 5.000000 5 Dale
6 0.600000 0.600000 6.000000 6 Junior
7 0.700000 0.700000 7.000000 6 Junior
8 0.800000 0.800000 8.000000 8 Fred
9 0.900000 0.900000 9.000000 8 Fred
10 1.000000 0.000000 0.000000 0 RacerX
As far as I can tell it should not skip pickName = 7 or 9, but it does.
Casting to (int) truncates the file. That is, if it cannot be exactly represented in the floating-point system which is used on the actual architecture, and is a bit less than the exact value, it will be rounded towards zero. To solve this problem, round the number instead of truncating:
int pickName = (int)(modff(entryCount / 10.0, &intpart) * 10.0 + 0.5);
(This assumes that the number is not negative.)
However, since you're working with integers here, and floating-point operations are expensive, you should consider using the modulo operator instead (which operates on integers):
int pickName = entryCount % 10;