Pandas sort by two column values - pandas

With this dataframe:
item XP_home XP_away
A 0.000000 5.229861
B 6.412500 0.000000
C 5.037361 0.000000
D 0.000000 3.394792
I can sort like so:
df = df.sort_values(by='XP_home', ascending=False).head(2)
and get:
B 6.412500 0.000000
C 5.037361 0.000000
or:
df = df.sort_values(by='XP_away', ascending=False).head(2)
and get:
A 0.000000 5.229861
D 0.000000 3.394792
But how can I sort by the highest of both column values, to get:
item XP_home XP_away
B 6.412500 0.000000
A 0.000000 5.229861
C 5.037361 0.000000
D 0.000000 3.394792

Let us try argsort
out = df.iloc[(-df.filter(like = 'XP').max(1)).argsort()]
item XP_home XP_away
1 B 6.412500 0.000000
0 A 0.000000 5.229861
2 C 5.037361 0.000000
3 D 0.000000 3.394792

You can sort on the max value across rows:
print (df.assign(val=df[["XP_home", "XP_away"]].max(1))
.sort_values("val", ascending=False).drop("val", 1))
item XP_home XP_away
1 B 6.412500 0.000000
0 A 0.000000 5.229861
2 C 5.037361 0.000000
3 D 0.000000 3.394792

Since pandas 1.1.0 you can compute the sorting values with the key argument
df.sort_values('XP_home', key=lambda _: df[['XP_away', 'XP_home']].max(1), ascending=False)
Out:
item XP_home XP_away
1 B 6.412500 0.000000
0 A 0.000000 5.229861
2 C 5.037361 0.000000
3 D 0.000000 3.394792

Related

Masked array assignment

I have a NxN array A, a NxN array B and a NxN mask (BitMatrix) M. Now I want to copy / assign the values of B to A only for the indices for which M is true. What is the best way to do that?
You can use logical indexing
julia> A = zeros(5,5); B = ones(5,5); M = rand(Bool, 5, 5)
5×5 Matrix{Bool}:
1 0 1 1 0
1 0 1 1 0
1 0 1 1 1
0 0 0 1 0
0 0 0 0 1
julia> A[M] = B[M]; A
5×5 Matrix{Float64}:
1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 1.0 1.0
0.0 0.0 0.0 1.0 0.0
0.0 0.0 0.0 0.0 1.0
or simply write a loop:
julia> for i in eachindex(A, B, M)
if M[i]
A[i] = B[i]
end
end

replacing dictionary values into a dataframe

I have the following df on one side:
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 0 0 0 0
ADECCO 0 0 0 0 0
BANKIA 0 0 0 0 0
and the following dict on the other:
{'ADMIRAL': 1, 'ADECCO': -1, 'BANKIA': -1}
where the df.index values correspond to the the dict.keys
I would like to replace the dict.values into the df placing one value per row to obtain this output:
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 1 0 0 0
ADECCO 0 0 -1 0 0
BANKIA 0 0 0 -1 0
Loop by dict values and set values by at:
d = {'ADMIRAL': 1, 'ADECCO': -1, 'BANKIA': -1}
for k, v in d.items():
df.at[k, k] = v
#alternative
#df.loc[k, k] = v
print (df)
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 1 0 0 0
ADECCO 0 0 -1 0 0
BANKIA 0 0 0 -1 0
Another solution is create DataFrame by dict by MultiIndex.from_arrays and unstack:
s = pd.Series(list(d.values()), index=pd.MultiIndex.from_arrays([d.keys(), d.keys()]))
df1 = s.unstack()
print (df1)
ADECCO ADMIRAL BANKIA
ADECCO -1.0 NaN NaN
ADMIRAL NaN 1.0 NaN
BANKIA NaN NaN -1.0
And then replace non NaNs by combine_first:
df = df1.combine_first(df)
print (df)
ACCOR SA ADECCO ADMIRAL BANKIA BANKINTER
ADECCO 0.0 -1.0 0.0 0.0 0.0
ADMIRAL 0.0 0.0 1.0 0.0 0.0
BANKIA 0.0 0.0 0.0 -1.0 0.0

Get frequency of items in a pandas column in given intervals of values stored in another pandas column

My dataframe
class_lst = ["B","A","C","Z","H","K","O","W","L","R","M","Y","Q","X","X","G","G","G","G","G"]
value_lst = [1,0.999986,1,0.999358,0.999906,0.995292,0.998481,0.388307,0.99608,0.99829,1,0.087298,1,1,0.999993,1,1,1,1,1]
df =pd.DataFrame(
{'class': class_lst,
'val': value_lst
})
For any interval of 'val' in ranges
ranges = np.arange(0.0, 1.1, 0.1)
I would like to get the frequency of 'val' items, as follows:
class range frequency
A (0, 0.10] 0
A (0.10, 0.20] 0
A (0.20, 0.30] 0
...
A (0.90, 100] 1
G (0, 0.10] 0
G (0.10, 0.20] 0
G (0.20, 0.30] 0
...
G (0.80, 0.90] 0
G (0.90, 100] 5
...
I tried
df.groupby(pd.cut(df.val, ranges)).count()
but the output looks like
class val
val
(0, 0.1] 1 1
(0.1, 0.2] 0 0
(0.2, 0.3] 0 0
(0.3, 0.4] 1 1
(0.4, 0.5] 0 0
(0.5, 0.6] 0 0
(0.6, 0.7] 0 0
(0.7, 0.8] 0 0
(0.8, 0.9] 0 0
(0.9, 1] 18 18
and does not match with the expected one
This might be a good start:
df["range"] = pd.cut(df['val'], ranges)
class val range
0 B 1.000000 (0.9, 1.0]
1 A 0.999986 (0.9, 1.0]
2 C 1.000000 (0.9, 1.0]
3 Z 0.999358 (0.9, 1.0]
4 H 0.999906 (0.9, 1.0]
5 K 0.995292 (0.9, 1.0]
6 O 0.998481 (0.9, 1.0]
7 W 0.388307 (0.3, 0.4]
8 L 0.996080 (0.9, 1.0]
9 R 0.998290 (0.9, 1.0]
10 M 1.000000 (0.9, 1.0]
11 Y 0.087298 (0.0, 0.1]
12 Q 1.000000 (0.9, 1.0]
13 X 1.000000 (0.9, 1.0]
14 X 0.999993 (0.9, 1.0]
15 G 1.000000 (0.9, 1.0]
16 G 1.000000 (0.9, 1.0]
17 G 1.000000 (0.9, 1.0]
18 G 1.000000 (0.9, 1.0]
19 G 1.000000 (0.9, 1.0]
and then
df.groupby(["class", "range"]).size()
class range
A (0.9, 1.0] 1
B (0.9, 1.0] 1
C (0.9, 1.0] 1
G (0.9, 1.0] 5
H (0.9, 1.0] 1
K (0.9, 1.0] 1
L (0.9, 1.0] 1
M (0.9, 1.0] 1
O (0.9, 1.0] 1
Q (0.9, 1.0] 1
R (0.9, 1.0] 1
W (0.3, 0.4] 1
X (0.9, 1.0] 2
Y (0.0, 0.1] 1
Z (0.9, 1.0] 1
This will give already the right bin for each class and its frequency.

Pandas dataframe finding largest N elements of each row with row-specific N

I have a DataFrame:
>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 23.0 24.0 25.0
and a Series that specifies the number of top N values I want from each row
>>> n_max = pd.Series([2,3,4])
What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?
The desired output is
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:
>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...
Based on #ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:
>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
For performance, I would suggest NumPy -
def mask_variable_largest_per_row(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Sample run -
In [182]: df
Out[182]:
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 5.0 24.0 25.0
In [183]: n_max = pd.Series([2,3,2])
In [184]: mask_variable_largest_per_row(df, n_max)
Out[184]:
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 NaN NaN NaN 24.0 25.0
Further boost : Bringing in numpy.argpartition to replace the numpy.argsort should help, as we don't care about the order of indices to be reset as NaNs. Thus, a numpy.argpartition based one would be -
def mask_variable_largest_per_row_v2(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
N = (n-n_max.values).max()
N = np.clip(N, a_min=0, a_max=n-1)
sidx = a.argpartition(N, axis=1) #sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Runtime test
Other approaches -
def pandas_rank_based(df, n_max):
n_max.index = df.index
df_rank = df.stack(dropna=False).groupby(level=0).rank\
(ascending=False, method='first').unstack()
selected = df_rank.le(n_max, axis=0)
return df[selected]
Verification and timings -
In [387]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
...: out1 = pandas_rank_based(df1, n_max)
...: out2 = mask_variable_largest_per_row(df2, n_max)
...: out3 = mask_variable_largest_per_row_v2(df3, n_max)
...: print np.nansum(out1-out2)==0 # Verify
...: print np.nansum(out1-out3)==0 # Verify
...:
True
True
In [388]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
In [389]: %timeit pandas_rank_based(df1, n_max)
1 loops, best of 3: 559 ms per loop
In [390]: %timeit mask_variable_largest_per_row(df2, n_max)
10 loops, best of 3: 34.1 ms per loop
In [391]: %timeit mask_variable_largest_per_row_v2(df3, n_max)
100 loops, best of 3: 5.92 ms per loop
Pretty good speedups there of 50x+ over the pandas built-in!

Blender generate obj with two slashes for faces where

I'm trying to make a 3D scene with OpenGL ES 2, I'm new into xCode and Objective-C.
I follow this tutorial to transform blender *.obj generated file to *.h and *.c files
But, the script want obj like this:
v 1.000000 -1.000000 -1.000000
v 1.000000 -1.000000 1.000000
v -1.000000 -1.000000 1.000000
v -1.000000 -1.000000 -1.000000
v 1.000000 1.000000 -0.999999
v 0.999999 1.000000 1.000001
v -1.000000 1.000000 1.000000
v -1.000000 1.000000 -1.000000
vt 0.375624 0.500625
vt 0.624375 0.500624
vt 0.375625 0.749375
vt 0.375625 0.251875
vt 0.375624 0.003126
vt 0.624374 0.251874
vt 0.873126 0.749375
vt 0.873126 0.998126
vt 0.624375 0.749375
vt 0.624375 0.998126
vt 0.126874 0.998126
vt 0.126874 0.749375
vt 0.375625 0.998126
vt 0.624373 0.003126
vn 0.000000 -1.000000 0.000000
vn 0.000000 1.000000 0.000000
vn 1.000000 -0.000000 0.000000
vn 0.000000 -0.000000 1.000000
vn -1.000000 -0.000000 -0.000000
vn 0.000000 0.000000 -1.000000
vn 1.000000 0.000000 0.000001
s off
f 1/1/1 2/2/1 4/3/1
f 5/4/2 8/5/2 6/6/2
f 1/1/3 5/4/3 2/2/3
f 2/7/4 6/8/4 3/9/4
f 3/9/5 7/10/5 4/3/5
f 5/11/6 1/12/6 8/13/6
f 2/2/1 3/9/1 4/3/1
f 8/5/2 7/14/2 6/6/2
f 5/4/7 6/6/7 2/2/7
f 6/8/4 7/10/4 3/9/4
f 7/10/5 8/13/5 4/3/5
f 1/12/6 4/3/6 8/13/6
And when I create a new cube (or anything else) I obtain an obj like this:
v 1.000000 -1.000000 -1.000000
v 1.000000 -1.000000 1.000000
v -1.000000 -1.000000 1.000000
v -1.000000 -1.000000 -1.000000
v 1.000000 1.000000 -0.999999
v 0.999999 1.000000 1.000001
v -1.000000 1.000000 1.000000
v -1.000000 1.000000 -1.000000
vn 0.000000 -1.000000 0.000000
vn 0.000000 1.000000 0.000000
vn 1.000000 -0.000000 0.000000
vn 0.000000 -0.000000 1.000000
vn -1.000000 -0.000000 -0.000000
vn 0.000000 0.000000 -1.000000
vn 1.000000 0.000000 0.000001
s off
f 1//1 2//1 4//1
f 5//2 8//2 6//2
f 1//3 5//3 2//3
f 2//4 6//4 3//4
f 3//5 7//5 4//5
f 5//6 1//6 8//6
f 2//1 3//1 4//1
f 8//2 7//2 6//2
f 5//7 6//7 2//7
f 6//4 7//4 3//4
f 7//5 8//5 4//5
f 1//6 4//6 8//6
Faces are separated by two slashes where I expected only one.
There is any simpliest way to generate .h and .c files for xCode, every script (for Blender) I try failed. Or anyone can tell me how to get a clean obj file.
Thanks a lot
Your resulting file is correct obj. However, what you want it to be - is incorrect obj.
Format for f command is f position_id/texture_coordinates_id/normal_id. You don't have texture coordinates, so this field is empty.
Options are
fix parser so it could load obj without texture coordinates, or
just add UV map to your object.