Joining without matching all rows of 'y' using dplyr - sql

The problems with the base function merge are well documented online yet still cause havoc. plyr::join solved many of these issues and works fantastically. The new kid on the block is dplyr. I'd like to know how to perform option 2 in the example below using dplyr. Anyone know if that's possible, and should it be a feature request?
Reproducible example
df1 <- data.frame(nm = c("y", "x", "z"), v2 = 10:12)
df2 <- data.frame(nm = c("x", "x", "y", "z", "x"), v1 = c(1, 1, 2, 3, 1))
Option 1: merge
merge(df1, df2, by = "nm", all.x = T, all.y = F)
This doesn't provide what I want and messes with the order:
## nm v2 v1
## 1 x 11 1
## 2 x 11 1
## 3 x 11 1
## 4 y 10 2
## 5 z 12 3
Option 2: plyr - this is what I want, but it's a little slow
library(plyr)
join(df1, df2, match = "first")
Note: only rows from x are kept:
## nm v2 v1
## 1 y 10 2
## 2 x 11 1
## 3 z 12 3
Option 3: dplyr:
library(dplyr)
inner_join(df1, df2)
This changes the order and keeps rows from y.
## nm v2 v1
## 1 x 11 1
## 2 x 11 1
## 3 y 10 2
## 4 z 12 3
## 5 x 11 1
left_join(df1, df2)
The only difference here is the order:
## nm v2 v1
## 1 y 10 2
## 2 x 11 1
## 3 x 11 1
## 4 x 11 1
## 5 z 12 3
This is a really useful feature so surprised option 2 is not even possible with dplyr, unless I've missed something.

I don't think what you are looking for is possible using dplyr. However, in this case you can get the desired output using the code below.
library(dplyr)
unique(inner_join(df1, df2))
Output:
nm v2 v1
1 x 11 1
3 y 10 2
4 z 12 3

Related

df.apply(myfunc, axis=1) results in error but df.groupby(df.index).apply(myfunc) does not

I have a dataframe that looks like this:
a b c
0 x x x
1 y y y
2 z z z
I would like to apply a function to each row of dataframe. That function then creates a new dataframe with multiple rows from each input row and returns it. Here is my_func:
def my_func(df):
dup_num = int(df.c - df.a)
if isinstance(df, pd.Series):
df_expanded = pd.concat([pd.DataFrame(df).transpose()]*dup_num,
ignore_index=True)
else:
df_expanded = pd.concat([pd.DataFrame(df)]*dup_num,
ignore_index=True)
return df_expanded
The final dataframe will look like something like this:
a b c
0 x x x
1 x x x
2 y y y
3 y y y
4 y y y
5 z z z
6 z z z
So I did:
df_expanded = df.apply(my_func, axis=1)
I inserted breakpoints inside the function and for each row, the created dataframe from my_func is correct. However, at the end, when the last row returns, I get an error stating that:
ValueError: cannot copy sequence with size XX to array axis with dimension YY
As if apply is trying to return a Series not a group of dataFrames that the function created.
So instead of df.apply I did:
df_expanded = df.groupby(df.index).apply(my_func)
Which just creates groups of single rows and applies the same function. This on the other hand works.
Why?
Perhaps we can take advantage of how pd.Series.explode and pd.Series.apply(pd.Series) work to simplify this process.
Given:
a b c
0 1 1 4
1 2 2 4
2 3 3 4
Doing:
new_df = (df.apply(lambda x: [x.tolist()]*(x.c-x.a), axis=1)
.explode(ignore_index=True)
.apply(pd.Series))
new_df.columns = df.columns
print(new_df)
Output:
a b c
0 1 1 4
1 1 1 4
2 1 1 4
3 2 2 4
4 2 2 4
5 3 3 4

How to *multiply* (for lack of a better term) two dataframes [duplicate]

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.
Given two simple DataFrames;
left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})
left
col1 col2
0 A 1
1 B 2
2 C 3
right
col1 col2
0 X 20
1 Y 30
2 Z 50
The cross product of these frames can be computed, and will look something like:
A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50
What is the most performant method of computing this result?
Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:
pandas <= 1.1.X
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
cartesian_product_basic(left, right)
pandas >= 1.2
left.merge(right, how="cross") # implements the technique above
col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".
While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.
A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is #senderle's first implementation.
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames
Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!
This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and
def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
cartesian_product_generalized(left, right)
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True
And, along similar lines,
left2 = left.copy()
left2.index = ['s1', 's2', 's1']
right2 = right.copy()
right2.index = ['x', 'y', 'y']
left2
col1 col2
s1 A 1
s2 B 2
s1 C 3
right2
col1 col2
x X 20
y Y 30
y Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True
This solution can generalise to multiple DataFrames. For example,
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
cartesian_product_multi(*[left, right, left]).head()
0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1
Further Simplification
A simpler solution not involving #senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True
Performance Comparison
Benchmarking these solutions on some contrived DataFrames with unique indices, we have
Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.
Performance Benchmarking Code
This is the timing script. All functions called here are defined above.
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)
for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames
Cross join *
* you are here
After pandas 1.2.0 merge now have option cross
left.merge(right, how='cross')
Using itertools product and recreate the value in dataframe
import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
Here's an approach with triple concat
m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
pd.concat([right]*len(left)).reset_index(drop=True) ], 1)
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
One option is with expand_grid from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'left':left, 'right':right}
jn.expand_grid(others = others)
left right
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
I think the simplest way would be to add a dummy column to each data frame, do an inner merge on it and then drop that dummy column from the resulting cartesian dataframe:
left['dummy'] = 'a'
right['dummy'] = 'a'
cartesian = left.merge(right, how='inner', on='dummy')
del cartesian['dummy']

left outer join in R with conditions

Is there a way to merge (left outer join) data frames by multiple columns, but with OR condition?
Example: There are two data frames df1 and df2 with columns x, y, num. I would like to have a data frame with all rows from df1, but with only those rows from df2 which satisfy the conditions: df1$x == df2$x OR df2$y == df2y.
Here are sample data:
df1 <- data.frame(x = LETTERS[1:5],
y = 1:5,
num = rnorm(5), stringsAsFactors = F)
df1
x y num
1 A 1 0.4209480
2 B 2 0.4687401
3 C 3 0.3018787
4 D 4 0.0669793
5 E 5 0.9231559
df2 <- data.frame(x = LETTERS[3:7],
y = 3:7,
num = rnorm(5), stringsAsFactors = F)
df2$x[4] <- NA
df2$y[3] <- NA
df2
x y num
1 C NA -0.7160824
2 <NA> 4 -0.3283618
3 E 5 -1.8775298
4 F 6 -0.9821082
5 G 7 1.8726288
Then, the result is expected to be:
x y num x y num
1 A 1 0.4209480 <NA> NA NA
2 B 2 0.4687401 <NA> NA NA
3 C 3 0.3018787 C NA -0.7160824
4 D 4 0.0669793 <NA> 4 -0.3283618
5 E 5 0.9231559 E 5 -1.8775298
The most obvious solution is to use the sqldf package:
mergedData <- sqldf::sqldf("SELECT * FROM df1
LEFT OUTER JOIN df2
ON df1.x = df2.x
OR df1.y = df2.y")
Unfortunately this simple solution is extremely slow, and it will take ages to merge data frames with more than 100k rows each.
Another option is to split the right data frame and merge by parts, but it is there any more elegant or even "out of the box" solution?
Here's one approach using data.table. For each column, we perform a join, but only extract the indices (as opposed to materialising the entire join).. Then, we can combine these indices from all the columns (this part would need some changes if there can be multiple matches).
require(data.table)
setDT(df1)
setDT(df2)
foo <- function(dx, dy, cols) {
ix = lapply(cols, function(col) {
dy[dx, on=col, which=TRUE] # for each row in dx, get matching indices of dy
# by matching on column specified in "col"
})
ix = do.call(function(...) pmax(..., na.rm=TRUE), ix)
}
ix = foo(df1, df2, c("x", "y")) # obtain matching indices of df2 for each row in df1
df1[, paste0("col", 1:3) := df2[ix]] # update df1 by reference
df1
# x y num col1 col2 col3
# 1: A 1 2.09611034 NA NA NA
# 2: B 2 -1.06795571 NA NA NA
# 3: C 3 1.38254433 C 3 1.0173476
# 4: D 4 -0.09367922 D 4 -0.6379496
# 5: E 5 0.47552072 E NA -0.1962038
You can use setDF(df1) to convert it back to a data.frame, if necessary.

r equivalent of sql update? [duplicate]

This question already has answers here:
Conditional merge/replacement in R
(8 answers)
Closed 5 years ago.
I've looked around and I can't find a simple answer to this.
How do I do what in SQL would be an update table?
For example:
> df1 = data.frame(id=seq(1:3), v1=c("a", "b", NA))
> df1
id v1
1 1 a
2 2 b
3 3 <NA>
> df2 = data.frame(id=seq(1:3), v2=c("z", "y", "c"))
> df2
id v2
1 1 z
2 2 y
3 3 c
How do I update df1 with values from v2 in v1, but only when id matches and when id > 2?
I've looked at data.table, but can't figure out the := syntax, and hoping there is something simple in base R? Desired output would be:
> df1
id v1
1 1 a
2 2 b
3 3 c
SQLite One can use an update in sqlite via sqldf:
library(sqldf)
sqldf(c("update df1
set v1 = (select v2 from df2 where df2.id = df1.id)
where id > 2",
"select * from df1"))
which gives:
id v1
1 1 a
2 2 b
3 3 c
MySQL This works in MySQL:
library(RMySQL)
library(sqldf)
sqldf(c("update df1
left join df2 on (df1.id = df2.id and df1.id > 2)
set df1.v1 = coalesce(df2.v2, df1.v1)",
"select * from df1")
)
giving:
id v1
1 1 a
2 2 b
3 3 c
base R This also works. The first two lines are just to convert v1 and v2 to character and they can be avoided if v1 and v2 were already character:
df1c <- transform(df1, v1 = as.character(v1))
df2c <- transform(df2, v2 = as.character(v2))
transform(df1c, v1 = ifelse(id > 2, df2c[match(id, df2c$id), "v2"], v1))
Update Have incorporated comments and added base R solution.
Updated to work when there are ids present in df1 not in df2, and also if orders are different. This works so long as there is only one id column:
df1 <- data.frame(id=seq(1:5), v1=c("a", "b", NA, NA, NA), stringsAsFactors=F)
df2 <- data.frame(id=seq(1:3), v2=c("z", "y", "c"), stringsAsFactors=F)
df1[df1$id > 2, -1] <- df2[df1$id[df1$id > 2], -1]
df1
Produces:
id v1
1 1 a
2 2 b
3 3 c
4 4 <NA>
5 5 <NA>
Here is a simple solution that works so long as both data frames have the same id set:
df1[df1$id > 2, ] <- df2[df1$id > 2, ]
Produces:
id v1
1 1 a
2 2 b
3 3 c
Big note though, v1 and v2 need to be character, so run this before as they are factor by default:
df1$v1 <- as.character(df1$v1)
df2$v2 <- as.character(df2$v2)
If you need to join on multiple columns or if the ids in one table don't all exist in the other you can use merge or data.table to get both variables on one table, and then construct the new column by combining the columns with ifelse.

Is cut() style binning available in dplyr?

Is there a way to do something like a cut() function for binning numeric values in a dplyr table? I'm working on a large postgres table and can currently either write a case statement in the sql at the outset, or output unaggregated data and apply cut(). Both have pretty obvious downsides... case statements are not particularly elegant and pulling a large number of records via collect() not at all efficient.
Just so there's an immediate answer for others arriving here via search engine, the n-breaks form of cut is now implemented as the ntile function in dplyr:
> data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = ntile(x, 2))
x bin
1 5 2
2 1 1
3 3 2
4 2 1
5 2 1
6 3 2
I see this question was never updated with the tidyverse solution so I'll add it for posterity.
The function to use is cut_interval from the ggplot2 package. It works similar to base::cut but it does a better job of marking start and end points than the base function in my experience because cut increases the range by 0.1% at each end.
data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = cut_interval(x, n = 2))
x bin
1 5 (3,5]
2 1 [1,3]
3 3 [1,3]
4 2 [1,3]
5 2 [1,3]
6 3 [1,3]
You can also specify the bin width with cut_width.
data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = cut_width(x, width = 2, center = 1))
x bin
1 5 (4,6]
2 1 [0,2]
3 3 (2,4]
4 2 [0,2]
5 2 [0,2]
6 3 (2,4]
The following works with dplyr, assuming x is the variable we wish to bin:
# Make n bins
df %>% mutate( x_bins = cut( x, breaks = n )
# Or make specific bins
df %>% mutate( x_bins = cut( x, breaks = c(0,2,6,10) )