This question already has answers here:
Conditional merge/replacement in R
(8 answers)
Closed 5 years ago.
I've looked around and I can't find a simple answer to this.
How do I do what in SQL would be an update table?
For example:
> df1 = data.frame(id=seq(1:3), v1=c("a", "b", NA))
> df1
id v1
1 1 a
2 2 b
3 3 <NA>
> df2 = data.frame(id=seq(1:3), v2=c("z", "y", "c"))
> df2
id v2
1 1 z
2 2 y
3 3 c
How do I update df1 with values from v2 in v1, but only when id matches and when id > 2?
I've looked at data.table, but can't figure out the := syntax, and hoping there is something simple in base R? Desired output would be:
> df1
id v1
1 1 a
2 2 b
3 3 c
SQLite One can use an update in sqlite via sqldf:
library(sqldf)
sqldf(c("update df1
set v1 = (select v2 from df2 where df2.id = df1.id)
where id > 2",
"select * from df1"))
which gives:
id v1
1 1 a
2 2 b
3 3 c
MySQL This works in MySQL:
library(RMySQL)
library(sqldf)
sqldf(c("update df1
left join df2 on (df1.id = df2.id and df1.id > 2)
set df1.v1 = coalesce(df2.v2, df1.v1)",
"select * from df1")
)
giving:
id v1
1 1 a
2 2 b
3 3 c
base R This also works. The first two lines are just to convert v1 and v2 to character and they can be avoided if v1 and v2 were already character:
df1c <- transform(df1, v1 = as.character(v1))
df2c <- transform(df2, v2 = as.character(v2))
transform(df1c, v1 = ifelse(id > 2, df2c[match(id, df2c$id), "v2"], v1))
Update Have incorporated comments and added base R solution.
Updated to work when there are ids present in df1 not in df2, and also if orders are different. This works so long as there is only one id column:
df1 <- data.frame(id=seq(1:5), v1=c("a", "b", NA, NA, NA), stringsAsFactors=F)
df2 <- data.frame(id=seq(1:3), v2=c("z", "y", "c"), stringsAsFactors=F)
df1[df1$id > 2, -1] <- df2[df1$id[df1$id > 2], -1]
df1
Produces:
id v1
1 1 a
2 2 b
3 3 c
4 4 <NA>
5 5 <NA>
Here is a simple solution that works so long as both data frames have the same id set:
df1[df1$id > 2, ] <- df2[df1$id > 2, ]
Produces:
id v1
1 1 a
2 2 b
3 3 c
Big note though, v1 and v2 need to be character, so run this before as they are factor by default:
df1$v1 <- as.character(df1$v1)
df2$v2 <- as.character(df2$v2)
If you need to join on multiple columns or if the ids in one table don't all exist in the other you can use merge or data.table to get both variables on one table, and then construct the new column by combining the columns with ifelse.
Related
I am looking for joining two data frame on one column and if there is a multi match then append the results to another column.
NB. using a different example as yours is not reproducible.
You can convert to str.lower, then explode and map the values to groupby.agg again as string:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = (df1['name']
.str.upper().str.split(',')
.explode()
.map(mapper)
.groupby(level=0).agg(','.join)
)
Or, with a list comprehension:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = [','.join([mapper[x] for x in s.split(',') if x in mapper])
for s in df1['name']]
output:
name ID
0 A 1
1 b 2
2 A,B 1,2
3 C,a 3,1
4 D 4
Used input:
# df1
name
0 A
1 b
2 A,B
3 C,a
4 D
# df2
name ID
0 A 1
1 B 2
2 C 3
3 D 4
I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13
Is there a way to merge (left outer join) data frames by multiple columns, but with OR condition?
Example: There are two data frames df1 and df2 with columns x, y, num. I would like to have a data frame with all rows from df1, but with only those rows from df2 which satisfy the conditions: df1$x == df2$x OR df2$y == df2y.
Here are sample data:
df1 <- data.frame(x = LETTERS[1:5],
y = 1:5,
num = rnorm(5), stringsAsFactors = F)
df1
x y num
1 A 1 0.4209480
2 B 2 0.4687401
3 C 3 0.3018787
4 D 4 0.0669793
5 E 5 0.9231559
df2 <- data.frame(x = LETTERS[3:7],
y = 3:7,
num = rnorm(5), stringsAsFactors = F)
df2$x[4] <- NA
df2$y[3] <- NA
df2
x y num
1 C NA -0.7160824
2 <NA> 4 -0.3283618
3 E 5 -1.8775298
4 F 6 -0.9821082
5 G 7 1.8726288
Then, the result is expected to be:
x y num x y num
1 A 1 0.4209480 <NA> NA NA
2 B 2 0.4687401 <NA> NA NA
3 C 3 0.3018787 C NA -0.7160824
4 D 4 0.0669793 <NA> 4 -0.3283618
5 E 5 0.9231559 E 5 -1.8775298
The most obvious solution is to use the sqldf package:
mergedData <- sqldf::sqldf("SELECT * FROM df1
LEFT OUTER JOIN df2
ON df1.x = df2.x
OR df1.y = df2.y")
Unfortunately this simple solution is extremely slow, and it will take ages to merge data frames with more than 100k rows each.
Another option is to split the right data frame and merge by parts, but it is there any more elegant or even "out of the box" solution?
Here's one approach using data.table. For each column, we perform a join, but only extract the indices (as opposed to materialising the entire join).. Then, we can combine these indices from all the columns (this part would need some changes if there can be multiple matches).
require(data.table)
setDT(df1)
setDT(df2)
foo <- function(dx, dy, cols) {
ix = lapply(cols, function(col) {
dy[dx, on=col, which=TRUE] # for each row in dx, get matching indices of dy
# by matching on column specified in "col"
})
ix = do.call(function(...) pmax(..., na.rm=TRUE), ix)
}
ix = foo(df1, df2, c("x", "y")) # obtain matching indices of df2 for each row in df1
df1[, paste0("col", 1:3) := df2[ix]] # update df1 by reference
df1
# x y num col1 col2 col3
# 1: A 1 2.09611034 NA NA NA
# 2: B 2 -1.06795571 NA NA NA
# 3: C 3 1.38254433 C 3 1.0173476
# 4: D 4 -0.09367922 D 4 -0.6379496
# 5: E 5 0.47552072 E NA -0.1962038
You can use setDF(df1) to convert it back to a data.frame, if necessary.
I have a DataFrame like this
DataFrame({"key":["a","b","c","d","e"], "value": [5,4,3,2,1]})
I am mainly interested in row "a", "b" and "c". I want to merge everything else into an "others" row like this
key value
0 a 5
1 b 4
2 c 3
3 others 3
I wonder how can this be done.
First create a dataframe without d and e:
df2 = df[df.key.isin(["a","b","c"])]
Then find the value that you want the other column to have (using the sum function in this example):
val = df[~df["key"].isin(["a","b","c"])].sum()["value"]
Finally, append this column to the second df:
df2.append({"key":"others", "value":val},ignore_index=True)
df2 is now:
key value
0 a 5
1 b 4
2 c 3
3 others 3
I have found a way to do it. Not sure if it is the best way.
In [3]: key_map = {"a":"a", "b":"b", "c":"c"}
In [4]: data['key1'] = data['key'].map(lambda k: key_map.get(k, "others"))
In [5]: data.groupby("key1").sum()
Out[5]:
value
key1
a 5
b 4
c 3
others 3
I am trying to transform my data.frame by calculating the log-differences of each column
and controlling for the rows id. So basically I like to calculate the growth rates for each id's variable.
So here is a random df with an id column, a time period colum p and three variable columns:
df <- data.frame (id = c("a","a","a","c","c","d","d","d","d","d"),
p = c(1,2,3,1,2,1,2,3,4,5),
var1 = rnorm(10, 5),
var2 = rnorm(10, 5),
var3 = rnorm(10, 5)
)
df
id p var1 var2 var3
1 a 1 5.375797 4.110324 5.773473
2 a 2 4.574700 6.541862 6.116153
3 a 3 3.029428 4.931924 5.631847
4 c 1 5.375855 4.181034 5.756510
5 c 2 5.067131 6.053009 6.746442
6 d 1 3.846438 4.515268 6.920389
7 d 2 4.910792 5.525340 4.625942
8 d 3 6.410238 5.138040 7.404533
9 d 4 4.637469 3.522542 3.661668
10 d 5 5.519138 4.599829 5.566892
Now I have written a function which does exactly what I want BUT I had to take a detour which is possibly unnecessary and can be removed. However, somehow I am not able to locate
the shortcut.
Here is the function and the output for the posted data frame:
fct.logDiff <- function (df) {
df.log <- dlply (df, "code", function(x) data.frame (p = x$p, log(x[, -c(1,2)])))
list.nalog <- llply (df.log, function(x) data.frame (p = x$p, rbind(NA, sapply(x[,-1], diff))))
ldply (list.nalog, data.frame)
}
fct.logDiff(df)
id p var1 var2 var3
1 a 1 NA NA NA
2 a 2 -0.16136569 0.46472004 0.05765945
3 a 3 -0.41216720 -0.28249264 -0.08249587
4 c 1 NA NA NA
5 c 2 -0.05914281 0.36999681 0.15868378
6 d 1 NA NA NA
7 d 2 0.24428771 0.20188025 -0.40279188
8 d 3 0.26646102 -0.07267311 0.47041227
9 d 4 -0.32372771 -0.37748866 -0.70417351
10 d 5 0.17405309 0.26683625 0.41891802
The trouble is due to the added NA-rows. I don't want to collapse the frame and reduce it, which would be automatically done by the diff() function. So I had 10 rows in my original frame and am keeping the same amount of rows after the transformation. In order to keep the same length I had to add some NAs. I have taken a detour by transforming the data.frame into a list, add the NAs to each id's first line, and afterwards transform the list back into a data.frame. That looks tedious.
Any ideas to avoid the data.frame-list-data.frame class transformation and optimize the function?
How about this?
nadiff <- function(x, ...) c(NA, diff(x, ...))
ddply(df, "code", colwise(nadiff, c("var1", "var2", "var3")))