Is cut() style binning available in dplyr? - sql

Is there a way to do something like a cut() function for binning numeric values in a dplyr table? I'm working on a large postgres table and can currently either write a case statement in the sql at the outset, or output unaggregated data and apply cut(). Both have pretty obvious downsides... case statements are not particularly elegant and pulling a large number of records via collect() not at all efficient.

Just so there's an immediate answer for others arriving here via search engine, the n-breaks form of cut is now implemented as the ntile function in dplyr:
> data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = ntile(x, 2))
x bin
1 5 2
2 1 1
3 3 2
4 2 1
5 2 1
6 3 2

I see this question was never updated with the tidyverse solution so I'll add it for posterity.
The function to use is cut_interval from the ggplot2 package. It works similar to base::cut but it does a better job of marking start and end points than the base function in my experience because cut increases the range by 0.1% at each end.
data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = cut_interval(x, n = 2))
x bin
1 5 (3,5]
2 1 [1,3]
3 3 [1,3]
4 2 [1,3]
5 2 [1,3]
6 3 [1,3]
You can also specify the bin width with cut_width.
data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = cut_width(x, width = 2, center = 1))
x bin
1 5 (4,6]
2 1 [0,2]
3 3 (2,4]
4 2 [0,2]
5 2 [0,2]
6 3 (2,4]

The following works with dplyr, assuming x is the variable we wish to bin:
# Make n bins
df %>% mutate( x_bins = cut( x, breaks = n )
# Or make specific bins
df %>% mutate( x_bins = cut( x, breaks = c(0,2,6,10) )

Related

how to extract the list of values from one column in pandas

I wish to extract the list of values from one column in pandas how to extract the list of values from one column and then use those values to create additional columns based on number of values within the list.
My dataframe:
a = pd.DataFrame({"test":["","","","",[1,2,3,4,5,6,6],"","",[11,12,13,14,15,16,17]]})
Current output:
test
0
1
2
3
4 [1, 2, 3, 4, 5, 6, 6]
5
6
7 [11, 12, 13, 14, 15, 16, 17]
expected output:
example_1 example_2 example_3 example_4 example_5 example_6 example_7
0
1
2
3
4 1 2 3 4 5 6 6
5
6 11 12 13 14 15 16 17
lets say we expect it to have 7 values for each list. <- this is most of my current case so if I can set the limit then it will be a good one. Thank you.
This should be what you're looking for. I replaced the nan values with blank cells, but you can change that of course.
a = pd.DataFrame({"test":["","","","",[1,2,3,4,5,6,6],"","",[11,12,13,14,15,16,17]]})
ab = a.test.apply(pd.Series).fillna("")
ab.columns = ['example_' + str(i) for i in range(1, 8)]
Output:
Edit: using .add_prefix() as the other answer uses is prettier than setting the column names manually with a list comprehension.
Here's a one-liner:
(pd.DataFrame
(a.test.apply(pd.Series)
.fillna("")
.set_axis(range(1, a.test.str.len().max() + 1), axis=1)
.add_prefix("example_")
)
The set_axis is just to make the columns 1-indexed, if you don't mind 0-indexing you can leave it out.

Filter rows from subsets of a Pandas DataFrame efficiently

I have a DataFrame consisting of medical data where the columns are ["Patient_ID", "Code", "Data"], where "Code" just represents some medical interaction patient "Patient_ID" had on "Date". Any patient will generally have more than one row, since they have more than one interaction. I want to apply two types of filtering to this data.
Remove any patients who have less than some min_len interactions
To each patient apply a half-overlapping, sliding window of length T days. Within each window keep only the first of any duplicate codes, and then shuffle the codes within the window
So I need to modify subsets of the overall dataframe, but the modification involves changing the size of the subset. I have both of these implemented as part of a larger pipeline, however they are a sigfnificant bottleneck in terms of time. I'm wondering if there's a more efficient way to achieve the same thing, as I really just threw together what worked and I'm not too familiar on efficiency of pandas operations. Here is how I have them currently:
def Filter_by_length(df, min_len = 1):
print("Filtering short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
def shuffle_col(df, col):
df[col] = np.random.permutation(df[col])
return df
def Filter_by_redundancy(df, T, min_len = 1):
print("Filtering redundant concepts and short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
start_date = sub_df.DATE.min()
end_date = sub_df.DATE.max()
next_date = start_date + dt.timedelta(days = T)
while start_date <= end_date:
sub_df = pd.concat([sub_df[sub_df.DATE < start_date],\
shuffle_col(sub_df[(sub_df.DATE <= next_date) & (sub_df.DATE >= start_date)]\
.drop_duplicates(subset = ['CODE']), "CODE"),\
sub_df[sub_df.DATE > next_date]], sort = False )
start_date += dt.timedelta(days = int(T/2))
next_date += dt.timedelta(days = int(T/2))
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
As you can see, in the second case I am actually applying both filters, because it is important to have the option to apply both together or either one on its own, but I am interested in any performance improvement that can be made to either one or both.
For the first part, instead of counting in your group-by like that, I would use this approach:
>>> d = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'q': [np.random.randint(1, 15, size=np.random.randint(1, 5)) for _ in range(5)]}).explode('q')
id q
0 1 1
0 1 9
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
3 4 11
3 4 5
4 5 5
4 5 6
4 5 3
4 5 2
>>> sizes = d.groupby('id').size()
>>> d[d['id'].isin(sizes[sizes >= 3].index)] # index is list of IDs meeting criteria
id q
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
4 5 5
4 5 6
4 5 3
4 5 2
I'm not sure why you want to shuffle your codes within some window. To avoid an X-Y problem, what are you in fact trying to do there?

Pandas Groupby -- efficient selection/filtering of groups based on multiple conditions?

I am trying to
filter dataframe groups in Pandas, based on multiple (any) conditions.
but I cannot seem to get to a fast Pandas 'native' one-liner.
Here I generate an example dataframe of 2*n*n rows and 4 columns:
import itertools
import random
n = 100
lst = range(0, n)
df = pd.DataFrame(
{'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)),
'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
'C': random.choices(list(range(100)), k=2*n*n),
'D': random.choices(list(range(100)), k=2*n*n)
})
resulting in dataframes such as:
A B C D
0 0 0 26 49
1 0 0 29 80
2 0 1 70 92
3 0 1 7 2
4 1 0 90 11
5 1 0 19 4
6 1 1 29 4
7 1 1 31 95
I want to
select groups grouped by A and B,
filtered groups down to where any values in the group are greater than 50 in both columns C and D,
A "native" Pandas one-liner would be the following:
test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )
which produces
A B C D
2 0 1 70 92
3 0 1 7 2
This is all fine for small dataframes (say n < 20).
But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.
I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):
test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]
but arguably a bit more ugly. My questions are:
Is there a better "native" Pandas solution for this task? , and
Is there a reason for the sub-optimal performance of my version of the "native" solution?
Bonus question:
In fact I only want to extract the groups and not together with their data. I.e., I only need
A B
0 1
in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?
This is similar to your second approach, but chained together:
mask = (df[['C','D']].gt(50) # in the case you have different thresholds for `C`, `D` [50, 60]
.all(axis=1) # check for both True on the rows
.groupby([df['A'],df['B']]) # normal groupby
.transform('max') # 'any' instead of 'max' also works
)
df.loc[mask]
If you don't want the data, you can forgo the transform:
mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()
mask[mask].index
# out
# MultiIndex([(0, 1)],
# names=['A', 'B'])

subset df by masking between specific rows

I'm trying to subset a pandas df by removing rows that fall between specific values. The problem is these values can be at different rows so I can't select fixed rows.
Specifically, I want to remove rows that fall between ABC xxx and the integer 5. These values could fall anywhere in the df and be of unequal length.
Note: The string ABC will be followed by different values.
I thought about returning all the indexes that contain these two values.
But mask could work better if I could return all rows between these two values?
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'X',1,2,'ABC',1,4,5,'Y',1,2],
})
mask = (df['Val'].str.contains(r'ABC(?!$)')) & (df['Val'] == 5)
Intended Output:
Val
0 None
8 X
9 1
10 2
15 Y
16 1
17 2
If ABC is always before 5 and always pairs (ABC, 5) get indices of values with np.where, zip and get index values between - last filter by isin with invert mask by ~:
#2 values of ABC, 5 in data
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'None','None','None',
'None','ABC','None',1,2,3,4,5,'None','None','None']
})
m1 = np.where(df['Val'].str.contains(r'ABC', na=False))[0]
m2 = np.where(df['Val'] == 5)[0]
print (m1)
[ 1 12]
print (m2)
[ 7 18]
idx = [x for y, z in zip(m1, m2) for x in range(y, z + 1)]
print (df[~df.index.isin(idx)])
Val
0 None
8 X
9 1
10 2
11 None
19 X
20 1
21 2
a = df.index[df['Val'].str.contains('ABC')==True][0]
b = df.index[df['Val']==5][0]+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
Output
Val
0 None
8 X
9 1
10 2
If there are more than one 'ABC' and 5, then you the below version.
With this you get the df other than the first ABC & the last 5
a = (df['Val'].str.contains('ABC')==True).idxmax()
b = df['Val'].where(df['Val']==5).last_valid_index()+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]

Joining without matching all rows of 'y' using dplyr

The problems with the base function merge are well documented online yet still cause havoc. plyr::join solved many of these issues and works fantastically. The new kid on the block is dplyr. I'd like to know how to perform option 2 in the example below using dplyr. Anyone know if that's possible, and should it be a feature request?
Reproducible example
df1 <- data.frame(nm = c("y", "x", "z"), v2 = 10:12)
df2 <- data.frame(nm = c("x", "x", "y", "z", "x"), v1 = c(1, 1, 2, 3, 1))
Option 1: merge
merge(df1, df2, by = "nm", all.x = T, all.y = F)
This doesn't provide what I want and messes with the order:
## nm v2 v1
## 1 x 11 1
## 2 x 11 1
## 3 x 11 1
## 4 y 10 2
## 5 z 12 3
Option 2: plyr - this is what I want, but it's a little slow
library(plyr)
join(df1, df2, match = "first")
Note: only rows from x are kept:
## nm v2 v1
## 1 y 10 2
## 2 x 11 1
## 3 z 12 3
Option 3: dplyr:
library(dplyr)
inner_join(df1, df2)
This changes the order and keeps rows from y.
## nm v2 v1
## 1 x 11 1
## 2 x 11 1
## 3 y 10 2
## 4 z 12 3
## 5 x 11 1
left_join(df1, df2)
The only difference here is the order:
## nm v2 v1
## 1 y 10 2
## 2 x 11 1
## 3 x 11 1
## 4 x 11 1
## 5 z 12 3
This is a really useful feature so surprised option 2 is not even possible with dplyr, unless I've missed something.
I don't think what you are looking for is possible using dplyr. However, in this case you can get the desired output using the code below.
library(dplyr)
unique(inner_join(df1, df2))
Output:
nm v2 v1
1 x 11 1
3 y 10 2
4 z 12 3