Julia Dataframe - parallel join DataFrame operations? - dataframe

I am wondering if there is a way to parallelize leftjoin operations in Julia
For example, I have the following DataFrames :
df1:
Id LDC LDR
a ldc1 ldr1
b ldc2 ldr2
c ldc3 ldr4
d ldc2 ldr3
df2 :
LDC dc1 dc2 dc3
ldc1 0.5 0.4 0.2
ldc2 0.1 0.6 0.7
ldc3 0.4 0.9 0.3
df3
LDR lap1 lap2 lap3
ldr1 0.05 0.06 0.07
ldr2 0.10 0.12 0.13
ldr3 0.01 0.01 0.02
ldr4 0.05 0.06 0.07
I currently make a serial join operation as below
df1 = leftjoin(df1, df2, on = "LDC")
df1 = leftjoin(df1, df3, on = "LDR")
which give me the desired result :
Id LDC LDR dc1 dc2 dc3 lap1 lap2 lap3
a ldc1 ldr1 0.5 0.4 0.2 0.05 0.06 0.07
b ldc2 ldr2 0.1 0.6 0.7 0.10 0.12 0.13
d ldc2 ldr3 0.1 0.6 0.7 0.01 0.01 0.02
c ldc3 ldr4 0.4 0.9 0.3 0.05 0.06 0.07
My question is : Is there a way to "populate" the initial DF (df1) with a parallelized join operation to obtain the same result ?
Thanks for any help you could provide
UPDATE : Here is the code to generate df1,df2 & df3.
df1 = DataFrame(Id = ["a","b","c","d"],
LDC = ["ldc1","ldc2","ldc3","ldc2"],
LDR = ["ldr1","ldr2","ldr4","ldr3"]
)
df2 = DataFrame(LDC = ["ldc1","ldc2","ldc3"],
dc1 = [0.5,0.4,0.2],
dc2 = [0.1,0.6,0.7],
dc3 = [0.4,0.9,0.3]
)
df3 = DataFrame(LDR = ["ldr1","ldr2","ldr3"],
lap1 = [0.05,0.06,0.07],
lap2 = [0.10,0.12,0.13],
lap3 = [0.01,0.01,0.02],
lap4 = [0.05,0.06,0.07]
)

Related

fill_between does not fill in the correct portion

I created a planck's curve with spectral colour for visible light range.
problem:
I would like to create a shading area under the curve. But I failed to to color the correct area under the curve. Thank you very much.
I tried:
1.
plt.fill_between(wavelengths,0,spectrum, color='w'),
but this does not work.
2. I also tried
plt.fill_between(wavelengths,1e13-spectrum, color='w')
again this does not work
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
h = 6.626e-34
c = 3.0e+8
k = 1.38e-23
def planck(wav, T):
a = 2.0*h*c**2
wav = wav*1e-9
b = h*c/(wav*k*T)
intensity = a/ ( (wav**5) * (np.exp(b) - 1.0) )
return intensity
def wavelength_to_rgb(wavelength, gamma=0.8):
'''taken from http://www.noah.org/wiki/Wavelength_to_RGB_in_Python
This converts a given wavelength of light to an
approximate RGB color value. The wavelength must be given
in nanometers in the range from 380 nm through 750 nm
(789 THz through 400 THz).
Based on code by Dan Bruton
http://www.physics.sfasu.edu/astro/color/spectra.html
Additionally alpha value set to 0.5 outside range'''
wavelength = float(wavelength)
if wavelength >= 380 and wavelength <= 750:
A = 1.
else:
A=0.5
if wavelength < 380:
wavelength = 380.
if wavelength >750:
wavelength = 750.
if wavelength >= 380 and wavelength <= 440:
attenuation = 0.3 + 0.7 * (wavelength - 380) / (440 - 380)
R = ((-(wavelength - 440) / (440 - 380)) * attenuation) ** gamma
G = 0.0
B = (1.0 * attenuation) ** gamma
elif wavelength >= 440 and wavelength <= 490:
R = 0.0
G = ((wavelength - 440) / (490 - 440)) ** gamma
B = 1.0
elif wavelength >= 490 and wavelength <= 510:
R = 0.0
G = 1.0
B = (-(wavelength - 510) / (510 - 490)) ** gamma
elif wavelength >= 510 and wavelength <= 580:
R = ((wavelength - 510) / (580 - 510)) ** gamma
G = 1.0
B = 0.0
elif wavelength >= 580 and wavelength <= 645:
R = 1.0
G = (-(wavelength - 645) / (645 - 580)) ** gamma
B = 0.0
elif wavelength >= 645 and wavelength <= 750:
attenuation = 0.3 + 0.7 * (750 - wavelength) / (750 - 645)
R = (1.0 * attenuation) ** gamma
G = 0.0
B = 0.0
else:
R = 0.0
G = 0.0
B = 0.0
return (R,G,B,A)
clim=(350,780)
norm = plt.Normalize(*clim)
wl = np.arange(clim[0],clim[1]+1,2)
colorlist = list(zip(norm(wl),[wavelength_to_rgb(w) for w in wl]))
spectralmap = matplotlib.colors.LinearSegmentedColormap.from_list("spectrum", colorlist)
fig, axs = plt.subplots(1, 1, figsize=(8,4), tight_layout=True)
wavelengths = np.linspace(100, 4000, 1000)
spectrum = planck(wavelengths, 5778)
plt.plot(wavelengths, spectrum, color='darkred')
y = np.linspace(0, 3e13, 10)
X,Y = np.meshgrid(wavelengths, y)
extent=(np.min(wavelengths), np.max(wavelengths), np.min(y), np.max(y))
plt.imshow(X, clim=clim, extent=extent, cmap=spectralmap, aspect='auto')
plt.xticks(fontsize =15)
plt.xlabel('Wavelength (nm)',fontsize =20)
plt.title('Shortwave radiation',fontsize =20)
plt.text(1000,2.5e13, 'Sun 5778K',fontsize =20)
plt.fill_between(wavelengths,0,spectrum, color='w')
plt.savefig('WavelengthColors.png', dpi=200)
ax = plt.gca()
#ax.axes.get_yaxis().set_visible(False)
right_side = ax.spines["right"]
right_side.set_visible(False)
left_side = ax.spines["left"]
left_side.set_visible(False)
top_side = ax.spines["top"]
top_side.set_visible(False)
plt.ylabel("Normalized spectral emittance",fontsize =20)
plt.show()

How To Loop Nested If-Then Statements in VBA

I am going to try to simplify my objective as well as add all of my vba as my OP was not clear.
I am writing a macro that is to be used to determine a commissions percentage based on a particular strategies (Tier1, Tier2, BPO or Enterprise), a Gross Margin range and contract year. This will need to be looped through about 5,000 rows of data in the final product. I have been trying to nest multiple If-Then statements to achieve my goal however it is not working.
Below is the table for the commissions rates that apply to each of the strategies and then the code that I wrote for this nested If-Then statement.
Looking to try to make this simpler and loop it through the entirety of the rows with data. Goal is to have each cell in column J return a Commission rate determined by the strategy in column i, year in column D and GM in column Z. The strategy has the potential to vary each row down the page.
Would I be better off creating a custom Function?
Kinda crazy task for a first time macro writer. Appreciate all the feedback I have gotten already and look forward to any other ideas to come.
enter image description here
enter image description here
enter image description here
My Code:
Where Column I = Strategy
Where Column D = Year
Where Column Z = Gross Margin
Where Column J = Result of If-Then
where Column C is a defined data set which determines the number of rows in the workbook.
Sub Define_Comm_Rate
Dim LastRow As Long
LastRow = Range("C" & Rows.Count).End(xlUp).Row
If Sheet1.Range("I2") = "BPO" And Sheet1.Range("Z2") >= 0.24 Then
If Sheet1.Range("D2") = 1 Then
Sheet1.Range("J2") = 0.4
Else
If Sheet1.Range("D2") = 2 Then
Sheet1.Range("J2") = 0.3
Else: Sheet1.Range("J2") = 0.15
End If
End If
End If
If Sheet1.Range("I2") = "BPO" And Sheet1.Range("Z2") >= 0.21 And Sheet1.Range("Z2") < 0.24 Then
If Sheet1.Range("D2") = 1 Then
Sheet1.Range("J2") = 0.35
Else
If Sheet1.Range("D2") = 2 Then
Sheet1.Range("J2") = 0.25
Else: Sheet1.Range("J2") = 0.1
End If
End If
End If
If Sheet1.Range("I2") = "BPO" And Sheet1.Range("Z2") >= 0.18 And Sheet1.Range("Z2") < 0.21 Then
If Sheet1.Range("D2") = 1 Then
Sheet1.Range("J2") = 0.3
Else
If Sheet1.Range("D2") = 2 Then
Sheet1.Range("J2") = 0.2
Else: Sheet1.Range("J2") = 0.05
End If
End If
End If
If Sheet1.Range("I2") = "BPO" And Sheet1.Range("Z2") < 0.18 Then
If Sheet1.Range("D2") = 1 Then
Sheet1.Range("J2") = 0.25
Else
If Sheet1.Range("D2") = 2 Then
Sheet1.Range("J2") = 0.15
Else: Sheet1.Range("J2") = 0.05
End If
End If
End If
If Sheet1.Range("I2") = "Enterprise24" Then
If Sheet1.Range("D2") = "1" Then
Sheet1.Range("J2") = 0.4
Else
If Sheet1.Range("D2") = 2 Then
Sheet1.Range("J2") = 0.3
Else: Sheet1.Range("J2") = 0.15
End If
End If
End If
If Sheet1.Range("I2") = "Enterprise21" Then
If Sheet1.Range("D2") = 1 Then
Sheet1.Range("J2") = 0.35
Else
If Sheet1.Range("D2") = 2 Then
Sheet1.Range("J2") = 0.25
Else: Sheet1.Range("J2") = 0.1
End If
End If
End If
If Sheet1.Range("I2") = "Enterprise18" Then
If Sheet1.Range("D2") = 1 Then
Sheet1.Range("J2") = 0.3
Else
If Sheet1.Range("D2") = 2 Then
Sheet1.Range("J2") = 0.2
Else: Sheet1.Range("J2") = 0.05
End If
End If
End If
If Sheet1.Range("I2") = "Enterprise00" Then
If Sheet1.Range("D2") = 1 Then
Sheet1.Range("J2") = 0.25
Else
If Sheet1.Range("D2") = 2 Then
Sheet1.Range("J2") = 0.15
Else: Sheet1.Range("J2") = 0.05
End If
End If
End If
If Sheet1.Range("I2") = "Tier1" Then
If Sheet1.Range("Z2") > 0.4 Then
Sheet1.Range("J2") = 0.5
Else
If Sheet1.Range("Z2") <= 0.4 And Sheet1.Range("Z2") > 0.25 Then
Sheet1.Range("J2") = (1 * Sheet1.Range("Z2")) + 0.1
Else
If Sheet1.Range("Z2") <= 0.25 And Sheet1.Range("Z2") > 0.075 Then
Sheet1.Range("J2") = (2 * Sheet1.Range("Z2")) - 0.15
Else
If Sheet1.Range("Z2") <= 0.075 And Sheet1.Range("Z2") > 0 Then
Sheet1.Range("J2") = 0
Else: Sheet1.Range("J2") = 0.5
End If
End If
End If
End If
End If
If Sheet1.Range("I2") = "Tier1-100" Then
If Sheet1.Range("Z2") > 0.4 Then
Sheet1.Range("J2") = 0.5
Else
If Sheet1.Range("Z2") <= 0.4 And Sheet1.Range("Z2") > 0.25 Then
Sheet1.Range("J2") = (1 * Sheet1.Range("Z2")) + 0.1
Else
If Sheet1.Range("Z2") <= 0.25 And Sheet1.Range("Z2") > 0.075 Then
Sheet1.Range("J2") = (2 * Sheet1.Range("Z2")) - 0.15
Else
If Sheet1.Range("Z2") <= 0.075 And Sheet1.Range("Z2") > 0 Then
Sheet1.Range("J2") = 0
Else: Sheet1.Range("J2") = 0.5
End If
End If
End If
End If
End If
If Sheet1.Range("I2") = "Tier2" Then
If Sheet1.Range("Z2") > 0.35 Then
Sheet1.Range("J2") = 0.5
Else
If Sheet1.Range("Z2") <= 0.35 And Sheet1.Range("Z2") > 0.25 Then
Sheet1.Range("J2") = (1 * Sheet1.Range("Z2")) + 0.15
Else
If Sheet1.Range("Z2") <= 0.25 And Sheet1.Range("Z2") > 0.05 Then
Sheet1.Range("J2") = (2 * Sheet1.Range("Z2")) - 0.1
Else
If Sheet1.Range("Z2") <= 0.05 And Sheet1.Range("Z2") > 0 Then
Sheet1.Range("J2") = 0
Else: Sheet1.Range("J2") = 0.5
End If
End If
End If
End If
End If
If Sheet1.Range("I2") = "Tier2-100" Then
If Sheet1.Range("Z2") > 0.35 Then
Sheet1.Range("J2") = 0.5
Else
If Sheet1.Range("Z2") <= 0.35 And Sheet1.Range("Z2") > 0.25 Then
Sheet1.Range("J2") = (1 * Sheet1.Range("Z2")) + 0.15
Else
If Sheet1.Range("Z2") <= 0.25 And Sheet1.Range("Z2") > 0.05 Then
Sheet1.Range("J2") = (2 * Sheet1.Range("Z2")) - 0.1
Else
If Sheet1.Range("Z2") <= 0.05 And Sheet1.Range("Z2") > 0 Then
Sheet1.Range("J2") = 0
Else: Sheet1.Range("J2") = 0.5
End If
End If
End If
End If
End If
Sheet1.Range("J2").AutoFill Destination:=Sheet1.Range("J2:J" & LastRow)
Application.Calculate
End Sub
I'm going to offer a non-VBA approach to this using INDIRECT, INDEX, MATCH, and a few tables. My thought is that instead of coding lots of nested IF's, with hard-coded values, in VBA, you should be able to do this with lookup tables. (Disclaimer: this was also a fun intellectual exercise.)
First, create a table similar to the Commissions Table you already have and name it your specific strategy, e.g. "BPO", under Formulas > Name Manager. I created mine on a separate sheet named "Tables". Note that I used 1 in row 1 as your max (and unrealistic) gross margin. I also added 1, 2, and 3 in cells B1, C1, and D1 respectively. You'd need to create similar tables for your other strategies, and put them under the BPO table.
Then in column J on your data tab, enter this formula: =INDEX(INDIRECT(I2),MATCH(Z2,INDIRECT(I2&"["&Z$1&"]"),-1),MATCH(D2,Tables!$A$1:$D$1,1))
This INDEX formula has 3 main parts:
INDIRECT(I2) - this returns the array comprising the table you have named "BPO" - so you know you're looking at the table appropriate to that particular strategy.
MATCH(Z2,INDIRECT(I2&"["&Z$1&"]"),-1) - this looks up your gross margin in column Z against the table (BPO), looking in the column[Gross Margin]. The last argument of MATCH (match type) is -1, meaning that it finds the smallest value that is greater than or equal to your gross margin (note that in the table, Gross Margin is sorted in descending order). So for example, if your Gross Margin is 0.22, the MATCH will return 0.2399.
MATCH(D2,Tables!$A$1:$D$1,1) - This looks up the year, and tries to find the largest value that is less than or equal to it. So if the year is 1, 2, or 3, the MATCH will return 1, 2, or 3, respectively, but if the year is greater than 3, the MATCH returns 3.
Columns AB and AC in the second screenshot are just the results of 2. and 3. above, included to show that the correct commission value is being returned. Note that the "Year Column" is not Year 2 or Year 3, but the 2nd or 3rd column in the BPO table, i.e. Year 1 or Year 2 respectively.
Thanks for all the input/feedback.
Due to added complexity of additional sales plans needing to be incorporated as well as needing the flexibility to add/remove sales plans at any time, I ended up writing a custom Function.
Function Commissions(Strategy As String, GM As Variant, YR As Variant) As Variant
If Strategy = "BPO" And GM >= 0.24 And YR = 1 Then
Commissions = 0.4
ElseIf Strategy = "BPO" And GM >= 0.24 And YR = 2 Then
Commissions = 0.3
ElseIf Strategy = "BPO" And GM >= 0.24 And YR >= 3 Then
Commissions = 0.15
ElseIf Strategy = "BPO" And GM >= 0.21 And GM < 0.24 And YR = 1 Then
Commissions = 0.35
ElseIf Strategy = "BPO" And GM >= 0.21 And GM < 0.24 And YR = 2 Then
Commissions = 0.25
ElseIf Strategy = "BPO" And GM >= 0.21 And GM < 0.24 And YR >= 3 Then
Commissions = 0.1
ElseIf Strategy = "BPO" And GM >= 0.18 And GM < 0.21 And YR = 1 Then
Commissions = 0.3
ElseIf Strategy = "BPO" And GM >= 0.18 And GM < 0.21 And YR = 2 Then
Commissions = 0.2
ElseIf Strategy = "BPO" And GM >= 0.18 And GM < 0.21 And YR >= 3 Then
Commissions = 0.05
ElseIf Strategy = "BPO" And GM < 0.18 And YR = 1 Then
Commissions = 0.25
ElseIf Strategy = "BPO" And GM < 0.18 And YR = 2 Then
Commissions = 0.15
ElseIf Strategy = "BPO" And GM < 0.18 And YR >= 3 Then
Commissions = 0.05
''all other strategies continued below....''
End If
End Function

Dataframe tranposition and mapping

I need to perform two sample t-test, for which I have to transpose my sample file and map values from another csv file to the sample file. I am new to python, so far I have tried this:
with open('project.csv') as f_project:
df = pd.read_csv('project.csv', delimiter = ',', dtype= 'unicode',
error_bad_lines=False)
df.set_index('TaxID', inplace=True)
df_kraken = df.T
df_meta = pd.read_csv('Meta3251.csv', delimiter = ',', dtype= 'unicode',
error_bad_lines=False, usecols = ['SRA ID', '(0/1)'])
df_kraken['Meta'] = df_kraken['TaxID'].map(df_meta.set_index('SRA ID')
['(0/1)'])
My sample file dataframe after transposition looks like this:
333046 1049 337090
PRJEB3251_ERR169499 0.05 0.03 0.01
PRJEB3251_ERR169500 0 0 0
PRJEB3251_ERR169501 0 0 0
PRJEB3251_ERR169502 0.05 0 0
PRJEB3251_ERR169503 0.03 1.9 0
PRJEB3251_ERR169507 0.01 0 0
PRJEB3251_ERR169508 0 0.1 0
PRJEB3251_ERR169509 0 0.05 0
The index is not been set as TaxID.
I have another csv file which T have taken as another dataframe so that I can map the values. It looks like
SRA ID (0/1)
ERR169611 1
ERR169610 1
ERR169609 1
ERR169608 1
ERR169607 0
ERR169606 0
ERR169605 1
ERR169604 1
ERR169484 0
I need to map the zero one values to the first column of 1st dataframe. Im stuck with the error : KeyError: 'TaxID'
Any hepl regarding this will be highly appreciated.
After you suggestion I have this :
import pandas as pd
df = pd.read_csv('project.csv').set_index('ID').T
df = df.reset_index().rename(columns={'index': 'Project ID'})
df_meta = pd.read_csv('Meta3251.csv', delimiter = ',', dtype= 'unicode',
error_bad_lines=False, usecols = ['SRA ID', '(0/1)'])
df['KEY'] = df['Project ID'].str.split('_').str[1]
df['Meta ID'] = df['KEY'].replace(dict(zip(df_meta['SRA ID'], df['(Project
ID)'])))
df.to_csv('R.csv')
After this I have the following result:
Project ID 333046 1049 KEY Meta ID
0 PRJEB3251_ERR169499 0.05 0.03 ERR169499 PRJEB3251_ERR169636
1 PRJEB3251_ERR169500 0 0 ERR169500 PRJEB3251_ERR169635
2 PRJEB3251_ERR169501 0 0 ERR169501 PRJEB3251_ERR169626
3 PRJEB3251_ERR169502 0.05 0 ERR169502 PRJEB3251_ERR169625
I have the index but the good part is now im able to rename my column, the mapping is not working though.
Here a solution that could work:
df = pd.read_csv('project.csv', delimiter = ',', dtype= 'unicode', error_bad_lines=False)
df.set_index('TaxID', inplace=True)
df_kraken = df.T.reset_index() # Make sure 'TaxID' is a column
df_meta = pd.read_csv('Meta3251.csv', delimiter = ',', dtype= 'unicode', error_bad_lines=False, usecols = ['SRA ID', '(0/1)'])
# In your example the second dataframe only matches what's after the '_'
# so you can isolate that part
df_kraken['KEY'] = df_kraken['TaxID'].str.split('_').str[1]
df_kraken['Meta'] = df_kraken['KEY'].replace(dict(zip(meta['SRA'], meta['ID'])))
EDIT
The question has been edited.
After read_csv() (first line):
TaxID PRJEB3251_ERR169499 PRJEB3251_ERR169500 PRJEB3251_ERR169501
0 333046 0.05 0 0
1 1049 0.03 0 0
2 337090 0.01 0 0
3 288681 3.6 0 0
4 267889 0.02 0 0
...
Then
df = df.set_index('TaxID').T
print(df)
TaxID 333046 1049 337090
PRJEB3251_ERR169499 0.05 0.03 0.01
PRJEB3251_ERR169500 0.00 0.00 0.00
PRJEB3251_ERR169501 0.00 0.00 0.00
Note that at this point TaxID is the name of the columns index, not of the row index. If you want to have TaxID as a column:
df = df.reset_index().rename(columns={'index': 'TaxID'})
To avoid confusion you can remove TaxID from the column name:
df.columns.name = None

Get frequency of items in a pandas column in given intervals of values stored in another pandas column

My dataframe
class_lst = ["B","A","C","Z","H","K","O","W","L","R","M","Y","Q","X","X","G","G","G","G","G"]
value_lst = [1,0.999986,1,0.999358,0.999906,0.995292,0.998481,0.388307,0.99608,0.99829,1,0.087298,1,1,0.999993,1,1,1,1,1]
df =pd.DataFrame(
{'class': class_lst,
'val': value_lst
})
For any interval of 'val' in ranges
ranges = np.arange(0.0, 1.1, 0.1)
I would like to get the frequency of 'val' items, as follows:
class range frequency
A (0, 0.10] 0
A (0.10, 0.20] 0
A (0.20, 0.30] 0
...
A (0.90, 100] 1
G (0, 0.10] 0
G (0.10, 0.20] 0
G (0.20, 0.30] 0
...
G (0.80, 0.90] 0
G (0.90, 100] 5
...
I tried
df.groupby(pd.cut(df.val, ranges)).count()
but the output looks like
class val
val
(0, 0.1] 1 1
(0.1, 0.2] 0 0
(0.2, 0.3] 0 0
(0.3, 0.4] 1 1
(0.4, 0.5] 0 0
(0.5, 0.6] 0 0
(0.6, 0.7] 0 0
(0.7, 0.8] 0 0
(0.8, 0.9] 0 0
(0.9, 1] 18 18
and does not match with the expected one
This might be a good start:
df["range"] = pd.cut(df['val'], ranges)
class val range
0 B 1.000000 (0.9, 1.0]
1 A 0.999986 (0.9, 1.0]
2 C 1.000000 (0.9, 1.0]
3 Z 0.999358 (0.9, 1.0]
4 H 0.999906 (0.9, 1.0]
5 K 0.995292 (0.9, 1.0]
6 O 0.998481 (0.9, 1.0]
7 W 0.388307 (0.3, 0.4]
8 L 0.996080 (0.9, 1.0]
9 R 0.998290 (0.9, 1.0]
10 M 1.000000 (0.9, 1.0]
11 Y 0.087298 (0.0, 0.1]
12 Q 1.000000 (0.9, 1.0]
13 X 1.000000 (0.9, 1.0]
14 X 0.999993 (0.9, 1.0]
15 G 1.000000 (0.9, 1.0]
16 G 1.000000 (0.9, 1.0]
17 G 1.000000 (0.9, 1.0]
18 G 1.000000 (0.9, 1.0]
19 G 1.000000 (0.9, 1.0]
and then
df.groupby(["class", "range"]).size()
class range
A (0.9, 1.0] 1
B (0.9, 1.0] 1
C (0.9, 1.0] 1
G (0.9, 1.0] 5
H (0.9, 1.0] 1
K (0.9, 1.0] 1
L (0.9, 1.0] 1
M (0.9, 1.0] 1
O (0.9, 1.0] 1
Q (0.9, 1.0] 1
R (0.9, 1.0] 1
W (0.3, 0.4] 1
X (0.9, 1.0] 2
Y (0.0, 0.1] 1
Z (0.9, 1.0] 1
This will give already the right bin for each class and its frequency.

Pandas dataframe finding largest N elements of each row with row-specific N

I have a DataFrame:
>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 23.0 24.0 25.0
and a Series that specifies the number of top N values I want from each row
>>> n_max = pd.Series([2,3,4])
What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?
The desired output is
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:
>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...
Based on #ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:
>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
For performance, I would suggest NumPy -
def mask_variable_largest_per_row(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Sample run -
In [182]: df
Out[182]:
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 5.0 24.0 25.0
In [183]: n_max = pd.Series([2,3,2])
In [184]: mask_variable_largest_per_row(df, n_max)
Out[184]:
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 NaN NaN NaN 24.0 25.0
Further boost : Bringing in numpy.argpartition to replace the numpy.argsort should help, as we don't care about the order of indices to be reset as NaNs. Thus, a numpy.argpartition based one would be -
def mask_variable_largest_per_row_v2(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
N = (n-n_max.values).max()
N = np.clip(N, a_min=0, a_max=n-1)
sidx = a.argpartition(N, axis=1) #sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Runtime test
Other approaches -
def pandas_rank_based(df, n_max):
n_max.index = df.index
df_rank = df.stack(dropna=False).groupby(level=0).rank\
(ascending=False, method='first').unstack()
selected = df_rank.le(n_max, axis=0)
return df[selected]
Verification and timings -
In [387]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
...: out1 = pandas_rank_based(df1, n_max)
...: out2 = mask_variable_largest_per_row(df2, n_max)
...: out3 = mask_variable_largest_per_row_v2(df3, n_max)
...: print np.nansum(out1-out2)==0 # Verify
...: print np.nansum(out1-out3)==0 # Verify
...:
True
True
In [388]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
In [389]: %timeit pandas_rank_based(df1, n_max)
1 loops, best of 3: 559 ms per loop
In [390]: %timeit mask_variable_largest_per_row(df2, n_max)
10 loops, best of 3: 34.1 ms per loop
In [391]: %timeit mask_variable_largest_per_row_v2(df3, n_max)
100 loops, best of 3: 5.92 ms per loop
Pretty good speedups there of 50x+ over the pandas built-in!