How to make smaller table grow and match the contents of larger table in R? - sql

I have three columns. The first is large and contains various letters. The second is the same size but contains fewer letters with some NAs. Each letter can be found in the larger column. The third is the same size also but contains values with the second column and corresponding NAs.
My question is how do I make it so that the second and third column are re-arranged so that the second column matches the first column where possible.
I feel the answer is something to do with left join but I can't figure it out.
A bit weird to explain in words but example shows it easily.
# Original Situation
Large <- c("B", "D", "C", "A", "E")
Small <- c("D", "A", NA, NA, NA)
Number <- c(5, 12, NA, NA, NA)
data.frame(Large, Small, Number)
#> Large Small Number
#> 1 B D 5
#> 2 D A 12
#> 3 C <NA> NA
#> 4 A <NA> NA
#> 5 E <NA> NA
# I want it to finish like this:
Large <- c("B", "D", "C", "A", "E")
Small <- c(NA, "D", NA, "A", NA)
Number <- c(NA, 5, NA, 12, NA)
data.frame(Large, Small, Number)
#> Large Small Number
#> 1 B <NA> NA
#> 2 D D 5
#> 3 C <NA> NA
#> 4 A A 12
#> 5 E <NA> NA

library(dplyr)
# I find `tibble` generally better than `data.frame`
# If you want to use `data.frame` remember to especify stringAsFactors = FALSE
df_large <- tibble(large = Large)
df_small <- tibble(small = Small, number = Number)
left_join(df_large, df_small, by = c("large" = "small"))
If you want to keep both large and small columns (I don't really see a reason to):
left_join(df_large, df_small, by = c("large" = "small")) %>%
mutate(small = if_else(!is.na(number), large, NA_character_))
# A tibble: 5 x 3
large number small
<chr> <dbl> <chr>
1 B NA NA
2 D 5 D
3 C NA NA
4 A 12 A
5 E NA NA

Here is a base way:
x <- df[1]
y <- setNames(df[c(2, 2, 3)], names(df))
merge(x, y, all.x = T)
# Large Small Number
# 1 A A 12
# 2 B <NA> NA
# 3 C <NA> NA
# 4 D D 5
# 5 E <NA> NA
Use the same logic on left_join():
library(tidyverse)
df %>%
mutate(Large = Small) %>%
right_join(df[1])
# Large Small Number
# 1 B <NA> NA
# 2 D D 5
# 3 C <NA> NA
# 4 A A 12
# 5 E <NA> NA

Related

How to melt a dataframe with tidyverse, and create a new column

I have pet survey data from 6 households.
The households are split into levels (a,b).
I would like to melt the dataframe by aminal name (id.var), household (var.name), abundance (value.name), whilst adding a new column ("level") for the levels a&b.
My dataframe looks like this:
pet abundance data
I can split it using reshape2:melt, but I don't know how to cut the a, b, from the column names and make a new column of them? Please help.
raw_data = as.dataframe(raw_data)
melt(raw_data,
id.variable = 'Animal', variable.name = 'Site', value.name = 'Abundance')
Having a go on some simulated data, pivot_longer is your best bet:
library(tidyverse)
df <- tibble(
Animal = c("dog", "cat", "fish", "horse"),
`1a` = sample(1:10, 4),
`1b` = sample(1:10, 4),
`2a` = sample(1:10, 4),
`2b` = sample(1:10, 4),
`3a` = sample(1:10, 4),
`3b` = sample(1:10, 4)
)
df |>
pivot_longer(
-Animal,
names_to = c("Site", "level"),
values_to = "Abundance",
names_pattern = "(.)(.)"
) |>
arrange(Site, level)
#> # A tibble: 24 × 4
#> Animal Site level Abundance
#> <chr> <chr> <chr> <int>
#> 1 dog 1 a 9
#> 2 cat 1 a 5
#> 3 fish 1 a 8
#> 4 horse 1 a 6
#> 5 dog 1 b 4
#> 6 cat 1 b 2
#> 7 fish 1 b 8
#> 8 horse 1 b 10
#> 9 dog 2 a 8
#> 10 cat 2 a 3
#> # … with 14 more rows

Iteratively get the max of a data frame column, add one and repeat for all rows in r

I need to perform a database operation where I'll be adding new data to an existing table and then assigning the new rows a unique id. I'm asking about this in R so I can get the logic straight before I attempt to rewrite it in sql or pyspark.
Imagine that I've already added the new data to the existing data. Here's a simplified version of what it might look like:
library(tidyverse)
df <- tibble(id = c(1, 2, 3, NA, NA),
descriptions = c("dodgers", "yankees","giants", "orioles", "mets"))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
What I want is:
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets
An I can't use arrange with rowid_to_columns id's be deleted.
To get a unique id for the NA rows while not changing the existing ones, I want to get the max of the id column, add one, replace NA with that value and then move to the next row. My instinct was to do something like this: df %>% mutate(new_id = max(id, na.rm = TRUE) + 1) but that only get's the max plus one, not a new max for each row. I feel like I could do this with a mapping function but what I've tried returns a result identical to the input dataframe:
df %>%
mutate(id = ifelse(is.na(id),
map_dbl(id, ~max(.) + 1, na.rm = FALSE),
id))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
Thanks in advance--now if someone can help me directly in sql, that's also a plus!
SQL option, using sqldf for demo:
sqldf::sqldf("
with cte as (
select max(id) as maxid from df
)
select cte.maxid + row_number() over () as id, df.descriptions
from df
left join cte where df.id is null
union
select * from df where id is not null")
# id descriptions
# 1 1 dodgers
# 2 2 yankees
# 3 3 giants
# 4 4 orioles
# 5 5 mets
Here is one method where we add the max value with the cumulative sum of logical vector based on the NA values and coalesce with the original column 'id'
library(dplyr)
df <- df %>%
mutate(id = coalesce(id, max(id, na.rm = TRUE) + cumsum(is.na(id))))
-output
df
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets

return rows which elements are duplicates, not the logical vector

I know the duplicated-function of the package dplyr. The problem is that it only returns a logical vector indicating which elements (rows) are duplicates.
I want to get a vector which gives back those rows with the certain elements.
I want to get back all the observations of A and B because they have for the key Name and year duplicated values.
I already have coded this:
>df %>% group_by(Name) %>% filter(any(( ?????)))
but I dont know how to write the last part of code.
Anyone any ideas?
Thanks :)
An option using dplyr can be achieved by grouping on both Name and Year to calculate count. Afterwards group on only Name and filter for groups having any count > 1 (meaning duplicate):
library(dplyr)
df %>% group_by(Name, Year) %>%
mutate(count = n()) %>%
group_by(Name) %>%
filter(any(count > 1)) %>%
select(-count)
# # A tibble: 7 x 3
# # Groups: Name [2]
# Name Year Value
# <chr> <int> <int>
# 1 A 1990 5
# 2 A 1990 3
# 3 A 1991 5
# 4 A 1995 5
# 5 B 2000 0
# 6 B 2000 4
# 7 B 1998 5
Data:
df <- read.table(text =
"Name Year Value
A 1990 5
A 1990 3
A 1991 5
A 1995 5
B 2000 0
B 2000 4
B 1998 5
C 1890 3
C 1790 2",
header = TRUE, stringsAsFactors = FALSE)

left outer join in R with conditions

Is there a way to merge (left outer join) data frames by multiple columns, but with OR condition?
Example: There are two data frames df1 and df2 with columns x, y, num. I would like to have a data frame with all rows from df1, but with only those rows from df2 which satisfy the conditions: df1$x == df2$x OR df2$y == df2y.
Here are sample data:
df1 <- data.frame(x = LETTERS[1:5],
y = 1:5,
num = rnorm(5), stringsAsFactors = F)
df1
x y num
1 A 1 0.4209480
2 B 2 0.4687401
3 C 3 0.3018787
4 D 4 0.0669793
5 E 5 0.9231559
df2 <- data.frame(x = LETTERS[3:7],
y = 3:7,
num = rnorm(5), stringsAsFactors = F)
df2$x[4] <- NA
df2$y[3] <- NA
df2
x y num
1 C NA -0.7160824
2 <NA> 4 -0.3283618
3 E 5 -1.8775298
4 F 6 -0.9821082
5 G 7 1.8726288
Then, the result is expected to be:
x y num x y num
1 A 1 0.4209480 <NA> NA NA
2 B 2 0.4687401 <NA> NA NA
3 C 3 0.3018787 C NA -0.7160824
4 D 4 0.0669793 <NA> 4 -0.3283618
5 E 5 0.9231559 E 5 -1.8775298
The most obvious solution is to use the sqldf package:
mergedData <- sqldf::sqldf("SELECT * FROM df1
LEFT OUTER JOIN df2
ON df1.x = df2.x
OR df1.y = df2.y")
Unfortunately this simple solution is extremely slow, and it will take ages to merge data frames with more than 100k rows each.
Another option is to split the right data frame and merge by parts, but it is there any more elegant or even "out of the box" solution?
Here's one approach using data.table. For each column, we perform a join, but only extract the indices (as opposed to materialising the entire join).. Then, we can combine these indices from all the columns (this part would need some changes if there can be multiple matches).
require(data.table)
setDT(df1)
setDT(df2)
foo <- function(dx, dy, cols) {
ix = lapply(cols, function(col) {
dy[dx, on=col, which=TRUE] # for each row in dx, get matching indices of dy
# by matching on column specified in "col"
})
ix = do.call(function(...) pmax(..., na.rm=TRUE), ix)
}
ix = foo(df1, df2, c("x", "y")) # obtain matching indices of df2 for each row in df1
df1[, paste0("col", 1:3) := df2[ix]] # update df1 by reference
df1
# x y num col1 col2 col3
# 1: A 1 2.09611034 NA NA NA
# 2: B 2 -1.06795571 NA NA NA
# 3: C 3 1.38254433 C 3 1.0173476
# 4: D 4 -0.09367922 D 4 -0.6379496
# 5: E 5 0.47552072 E NA -0.1962038
You can use setDF(df1) to convert it back to a data.frame, if necessary.

Calculating Growth-Rates by applying log-differences

I am trying to transform my data.frame by calculating the log-differences of each column
and controlling for the rows id. So basically I like to calculate the growth rates for each id's variable.
So here is a random df with an id column, a time period colum p and three variable columns:
df <- data.frame (id = c("a","a","a","c","c","d","d","d","d","d"),
p = c(1,2,3,1,2,1,2,3,4,5),
var1 = rnorm(10, 5),
var2 = rnorm(10, 5),
var3 = rnorm(10, 5)
)
df
id p var1 var2 var3
1 a 1 5.375797 4.110324 5.773473
2 a 2 4.574700 6.541862 6.116153
3 a 3 3.029428 4.931924 5.631847
4 c 1 5.375855 4.181034 5.756510
5 c 2 5.067131 6.053009 6.746442
6 d 1 3.846438 4.515268 6.920389
7 d 2 4.910792 5.525340 4.625942
8 d 3 6.410238 5.138040 7.404533
9 d 4 4.637469 3.522542 3.661668
10 d 5 5.519138 4.599829 5.566892
Now I have written a function which does exactly what I want BUT I had to take a detour which is possibly unnecessary and can be removed. However, somehow I am not able to locate
the shortcut.
Here is the function and the output for the posted data frame:
fct.logDiff <- function (df) {
df.log <- dlply (df, "code", function(x) data.frame (p = x$p, log(x[, -c(1,2)])))
list.nalog <- llply (df.log, function(x) data.frame (p = x$p, rbind(NA, sapply(x[,-1], diff))))
ldply (list.nalog, data.frame)
}
fct.logDiff(df)
id p var1 var2 var3
1 a 1 NA NA NA
2 a 2 -0.16136569 0.46472004 0.05765945
3 a 3 -0.41216720 -0.28249264 -0.08249587
4 c 1 NA NA NA
5 c 2 -0.05914281 0.36999681 0.15868378
6 d 1 NA NA NA
7 d 2 0.24428771 0.20188025 -0.40279188
8 d 3 0.26646102 -0.07267311 0.47041227
9 d 4 -0.32372771 -0.37748866 -0.70417351
10 d 5 0.17405309 0.26683625 0.41891802
The trouble is due to the added NA-rows. I don't want to collapse the frame and reduce it, which would be automatically done by the diff() function. So I had 10 rows in my original frame and am keeping the same amount of rows after the transformation. In order to keep the same length I had to add some NAs. I have taken a detour by transforming the data.frame into a list, add the NAs to each id's first line, and afterwards transform the list back into a data.frame. That looks tedious.
Any ideas to avoid the data.frame-list-data.frame class transformation and optimize the function?
How about this?
nadiff <- function(x, ...) c(NA, diff(x, ...))
ddply(df, "code", colwise(nadiff, c("var1", "var2", "var3")))