Get frequency of items in a pandas column in given intervals of values stored in another pandas column - pandas

My dataframe
class_lst = ["B","A","C","Z","H","K","O","W","L","R","M","Y","Q","X","X","G","G","G","G","G"]
value_lst = [1,0.999986,1,0.999358,0.999906,0.995292,0.998481,0.388307,0.99608,0.99829,1,0.087298,1,1,0.999993,1,1,1,1,1]
df =pd.DataFrame(
{'class': class_lst,
'val': value_lst
})
For any interval of 'val' in ranges
ranges = np.arange(0.0, 1.1, 0.1)
I would like to get the frequency of 'val' items, as follows:
class range frequency
A (0, 0.10] 0
A (0.10, 0.20] 0
A (0.20, 0.30] 0
...
A (0.90, 100] 1
G (0, 0.10] 0
G (0.10, 0.20] 0
G (0.20, 0.30] 0
...
G (0.80, 0.90] 0
G (0.90, 100] 5
...
I tried
df.groupby(pd.cut(df.val, ranges)).count()
but the output looks like
class val
val
(0, 0.1] 1 1
(0.1, 0.2] 0 0
(0.2, 0.3] 0 0
(0.3, 0.4] 1 1
(0.4, 0.5] 0 0
(0.5, 0.6] 0 0
(0.6, 0.7] 0 0
(0.7, 0.8] 0 0
(0.8, 0.9] 0 0
(0.9, 1] 18 18
and does not match with the expected one

This might be a good start:
df["range"] = pd.cut(df['val'], ranges)
class val range
0 B 1.000000 (0.9, 1.0]
1 A 0.999986 (0.9, 1.0]
2 C 1.000000 (0.9, 1.0]
3 Z 0.999358 (0.9, 1.0]
4 H 0.999906 (0.9, 1.0]
5 K 0.995292 (0.9, 1.0]
6 O 0.998481 (0.9, 1.0]
7 W 0.388307 (0.3, 0.4]
8 L 0.996080 (0.9, 1.0]
9 R 0.998290 (0.9, 1.0]
10 M 1.000000 (0.9, 1.0]
11 Y 0.087298 (0.0, 0.1]
12 Q 1.000000 (0.9, 1.0]
13 X 1.000000 (0.9, 1.0]
14 X 0.999993 (0.9, 1.0]
15 G 1.000000 (0.9, 1.0]
16 G 1.000000 (0.9, 1.0]
17 G 1.000000 (0.9, 1.0]
18 G 1.000000 (0.9, 1.0]
19 G 1.000000 (0.9, 1.0]
and then
df.groupby(["class", "range"]).size()
class range
A (0.9, 1.0] 1
B (0.9, 1.0] 1
C (0.9, 1.0] 1
G (0.9, 1.0] 5
H (0.9, 1.0] 1
K (0.9, 1.0] 1
L (0.9, 1.0] 1
M (0.9, 1.0] 1
O (0.9, 1.0] 1
Q (0.9, 1.0] 1
R (0.9, 1.0] 1
W (0.3, 0.4] 1
X (0.9, 1.0] 2
Y (0.0, 0.1] 1
Z (0.9, 1.0] 1
This will give already the right bin for each class and its frequency.

Related

Creating a dataset (data frame,df) in R consisting of all possible combinations of 10 different variables

Thanks in advance for any advice. As part of a study, I need to:
Part 1:
I need to create a .csv dataset (or r data frame?) that produces all possible combinations of 10 different variables. Each of the 10 variables has either 2 (i.e., binary 0,1) or 4 levels. I think this should be possible easily in both excel and r, but r would be preferable. They are provided in the table below:
For example, the first set of combinations would keep "druga_LIFE" at 0.5 and then would cycle through all combinations of the other variables, then it would follow by fixing "druga_LIFE" at 1 and cycling through all other combinations of variables. Eventually, it would move on to fix "druga.NEED" at 0 changing other variables, then at 1 and so on.
The dataset should be a full set of combinations with no repeat combinations.
I understand there is a large number of possible combinations - this is as expected, but I don't think this should be too difficult to compute.
Part 2:
I then need to go through this dataset, selecting only the possible combinations where:
"druga.LIFE" is the same as "drugb.LIFE"
AND
2)"druga.NEED" is the same as "drugb.NEED"
I think this should be simple with the dplyr package in R.
I have created the df in r, but do not know how to begin with cycling through to produce all possible combinations.
#DATASET OF ALL POSSIBLE CHOICE SETS#
#Creating the Vectors of choices
DrugA_LIFE <- c(0.5, 1, 2,5)
DrugA_NEED <- c(0,1)
DrugA_CERT <- c(0, 0.2, 0.4, 0.6)
DrugA_RISK <- c(0.1, 0.2, 0.4, 0.6)
DrugA_WAIT <- c(0, 0.5, 1, 2)
DrugB_LIFE <- c(0.5, 1, 2,5)
DrugB_NEED <- c(0,1)
DrugB_CERT <- c(0, 0.2, 0.4, 0.6)
DrugB_RISK <- c(0.1, 0.2, 0.4, 0.6)
DrugB_WAIT <- c(0, 0.5, 1, 2)
#Create data frame
df <- data.frame(DrugA_LIFE, DrugA_NEED, DrugA_CERT, DrugA_RISK, DrugA_WAIT, DrugB_LIFE, DrugB_NEED, DrugB_CERT, DrugB_RISK, DrugB_WAIT)
All possible combinations? expand.grid or tidyr::expand_big. We can apply either function to an already-made frame using do.call.
Unique? Use R's unique or dplyr::distinct.
Filtering? Use ... dplyr::filter (or base R subset).
library(dplyr)
# library(tidyr) # expand_grid
do.call(tidyr::expand_grid, df) %>%
distinct() %>%
filter(DrugA_LIFE == DrugB_LIFE, DrugA_NEED == DrugB_NEED)
# # A tibble: 32,768 × 10
# DrugA_LIFE DrugA_NEED DrugA_CERT DrugA_RISK DrugA_WAIT DrugB_LIFE DrugB_NEED DrugB_CERT DrugB_RISK DrugB_WAIT
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.5 0 0 0.1 0 0.5 0 0 0.1 0
# 2 0.5 0 0 0.1 0 0.5 0 0 0.1 0.5
# 3 0.5 0 0 0.1 0 0.5 0 0 0.1 1
# 4 0.5 0 0 0.1 0 0.5 0 0 0.1 2
# 5 0.5 0 0 0.1 0 0.5 0 0 0.2 0
# 6 0.5 0 0 0.1 0 0.5 0 0 0.2 0.5
# 7 0.5 0 0 0.1 0 0.5 0 0 0.2 1
# 8 0.5 0 0 0.1 0 0.5 0 0 0.2 2
# 9 0.5 0 0 0.1 0 0.5 0 0 0.4 0
# 10 0.5 0 0 0.1 0 0.5 0 0 0.4 0.5
# # … with 32,758 more rows
# # ℹ Use `print(n = ...)` to see more rows
Data:
df <- structure(list(DrugA_LIFE = c(0.5, 1, 2, 5), DrugA_NEED = c(0, 1, 0, 1), DrugA_CERT = c(0, 0.2, 0.4, 0.6), DrugA_RISK = c(0.1, 0.2, 0.4, 0.6), DrugA_WAIT = c(0, 0.5, 1, 2), DrugB_LIFE = c(0.5, 1, 2, 5), DrugB_NEED = c(0, 1, 0, 1), DrugB_CERT = c(0, 0.2, 0.4, 0.6), DrugB_RISK = c(0.1, 0.2, 0.4, 0.6), DrugB_WAIT = c(0, 0.5, 1, 2)), class = "data.frame", row.names = c(NA, -4L))

Masked array assignment

I have a NxN array A, a NxN array B and a NxN mask (BitMatrix) M. Now I want to copy / assign the values of B to A only for the indices for which M is true. What is the best way to do that?
You can use logical indexing
julia> A = zeros(5,5); B = ones(5,5); M = rand(Bool, 5, 5)
5×5 Matrix{Bool}:
1 0 1 1 0
1 0 1 1 0
1 0 1 1 1
0 0 0 1 0
0 0 0 0 1
julia> A[M] = B[M]; A
5×5 Matrix{Float64}:
1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 1.0 1.0
0.0 0.0 0.0 1.0 0.0
0.0 0.0 0.0 0.0 1.0
or simply write a loop:
julia> for i in eachindex(A, B, M)
if M[i]
A[i] = B[i]
end
end

Julia Dataframe - parallel join DataFrame operations?

I am wondering if there is a way to parallelize leftjoin operations in Julia
For example, I have the following DataFrames :
df1:
Id LDC LDR
a ldc1 ldr1
b ldc2 ldr2
c ldc3 ldr4
d ldc2 ldr3
df2 :
LDC dc1 dc2 dc3
ldc1 0.5 0.4 0.2
ldc2 0.1 0.6 0.7
ldc3 0.4 0.9 0.3
df3
LDR lap1 lap2 lap3
ldr1 0.05 0.06 0.07
ldr2 0.10 0.12 0.13
ldr3 0.01 0.01 0.02
ldr4 0.05 0.06 0.07
I currently make a serial join operation as below
df1 = leftjoin(df1, df2, on = "LDC")
df1 = leftjoin(df1, df3, on = "LDR")
which give me the desired result :
Id LDC LDR dc1 dc2 dc3 lap1 lap2 lap3
a ldc1 ldr1 0.5 0.4 0.2 0.05 0.06 0.07
b ldc2 ldr2 0.1 0.6 0.7 0.10 0.12 0.13
d ldc2 ldr3 0.1 0.6 0.7 0.01 0.01 0.02
c ldc3 ldr4 0.4 0.9 0.3 0.05 0.06 0.07
My question is : Is there a way to "populate" the initial DF (df1) with a parallelized join operation to obtain the same result ?
Thanks for any help you could provide
UPDATE : Here is the code to generate df1,df2 & df3.
df1 = DataFrame(Id = ["a","b","c","d"],
LDC = ["ldc1","ldc2","ldc3","ldc2"],
LDR = ["ldr1","ldr2","ldr4","ldr3"]
)
df2 = DataFrame(LDC = ["ldc1","ldc2","ldc3"],
dc1 = [0.5,0.4,0.2],
dc2 = [0.1,0.6,0.7],
dc3 = [0.4,0.9,0.3]
)
df3 = DataFrame(LDR = ["ldr1","ldr2","ldr3"],
lap1 = [0.05,0.06,0.07],
lap2 = [0.10,0.12,0.13],
lap3 = [0.01,0.01,0.02],
lap4 = [0.05,0.06,0.07]
)

replacing dictionary values into a dataframe

I have the following df on one side:
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 0 0 0 0
ADECCO 0 0 0 0 0
BANKIA 0 0 0 0 0
and the following dict on the other:
{'ADMIRAL': 1, 'ADECCO': -1, 'BANKIA': -1}
where the df.index values correspond to the the dict.keys
I would like to replace the dict.values into the df placing one value per row to obtain this output:
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 1 0 0 0
ADECCO 0 0 -1 0 0
BANKIA 0 0 0 -1 0
Loop by dict values and set values by at:
d = {'ADMIRAL': 1, 'ADECCO': -1, 'BANKIA': -1}
for k, v in d.items():
df.at[k, k] = v
#alternative
#df.loc[k, k] = v
print (df)
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 1 0 0 0
ADECCO 0 0 -1 0 0
BANKIA 0 0 0 -1 0
Another solution is create DataFrame by dict by MultiIndex.from_arrays and unstack:
s = pd.Series(list(d.values()), index=pd.MultiIndex.from_arrays([d.keys(), d.keys()]))
df1 = s.unstack()
print (df1)
ADECCO ADMIRAL BANKIA
ADECCO -1.0 NaN NaN
ADMIRAL NaN 1.0 NaN
BANKIA NaN NaN -1.0
And then replace non NaNs by combine_first:
df = df1.combine_first(df)
print (df)
ACCOR SA ADECCO ADMIRAL BANKIA BANKINTER
ADECCO 0.0 -1.0 0.0 0.0 0.0
ADMIRAL 0.0 0.0 1.0 0.0 0.0
BANKIA 0.0 0.0 0.0 -1.0 0.0

Pandas dataframe finding largest N elements of each row with row-specific N

I have a DataFrame:
>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 23.0 24.0 25.0
and a Series that specifies the number of top N values I want from each row
>>> n_max = pd.Series([2,3,4])
What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?
The desired output is
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:
>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...
Based on #ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:
>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
For performance, I would suggest NumPy -
def mask_variable_largest_per_row(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Sample run -
In [182]: df
Out[182]:
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 5.0 24.0 25.0
In [183]: n_max = pd.Series([2,3,2])
In [184]: mask_variable_largest_per_row(df, n_max)
Out[184]:
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 NaN NaN NaN 24.0 25.0
Further boost : Bringing in numpy.argpartition to replace the numpy.argsort should help, as we don't care about the order of indices to be reset as NaNs. Thus, a numpy.argpartition based one would be -
def mask_variable_largest_per_row_v2(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
N = (n-n_max.values).max()
N = np.clip(N, a_min=0, a_max=n-1)
sidx = a.argpartition(N, axis=1) #sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Runtime test
Other approaches -
def pandas_rank_based(df, n_max):
n_max.index = df.index
df_rank = df.stack(dropna=False).groupby(level=0).rank\
(ascending=False, method='first').unstack()
selected = df_rank.le(n_max, axis=0)
return df[selected]
Verification and timings -
In [387]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
...: out1 = pandas_rank_based(df1, n_max)
...: out2 = mask_variable_largest_per_row(df2, n_max)
...: out3 = mask_variable_largest_per_row_v2(df3, n_max)
...: print np.nansum(out1-out2)==0 # Verify
...: print np.nansum(out1-out3)==0 # Verify
...:
True
True
In [388]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
In [389]: %timeit pandas_rank_based(df1, n_max)
1 loops, best of 3: 559 ms per loop
In [390]: %timeit mask_variable_largest_per_row(df2, n_max)
10 loops, best of 3: 34.1 ms per loop
In [391]: %timeit mask_variable_largest_per_row_v2(df3, n_max)
100 loops, best of 3: 5.92 ms per loop
Pretty good speedups there of 50x+ over the pandas built-in!