Finding Identical Rows in Multiple Datasets - sql

I am trying to find out if 3 datasets (df1, df2, df3) have any common rows (i.e. entire row is a duplicate).
I figured out how to do this for pairs of 2 datasets:
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"))
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"))
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"))
library(dplyr)
inner_join(df1, df2)
inner_join(df1, df3)
inner_join(df2, df3)
Is it possible to do this for 3 datasets all at once?
The straightforward way does not seem to work:
inner_join(df1, df2, df3)
Error in `[.data.frame`(by, c("x", "y")) : undefined columns selected
I thought I had found a way to do this:
library(plyr)
join_all(list(df1, df2, df3), type='inner')
But this is telling me that there are no common rows (i.e. same id, same name) between these 3 dataframes:
Joining by: id, names
Joining by: id, names
[1] id names
<0 rows> (or 0-length row.names)
This is not correct, seeing as in the example I created:
Row 3 for df1 and df2 are identical (id = 3,name = peter)
Row 2 for df2 and df3 are identical (id = 2, name= john)
I am trying to find a way to determine if these 3 datasets share any common rows. Can this be done in R?
Thank you!

Here is how you could achieve your task:
library(dplyr)
bind_rows(df1, df2, df3) %>%
group_by(id, names) %>%
filter(n()>1) %>%
unique()
id names
<dbl> <chr>
1 3 peter
2 2 john

You can do this using get_dupes from janitor package.
library(tidyverse)
library(janitor)
# Added a new column 'df_id' to identify the data frame
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"), df_id = 1)
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"), df_id = 2)
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"), df_id = 3)
# Bind dataframes
# Get duplicates
df1 %>%
bind_rows(df2) %>%
bind_rows(df3) %>%
get_dupes(c(id, names))
#> id names dupe_count df_id
#> 1 2 john 2 2
#> 2 2 john 2 3
#> 3 3 peter 2 1
#> 4 3 peter 2 2

Does this count?
dfall<-bind_rows(df1,df2,df3)
dfall[duplicated(dfall),]
id names
6 3 peter
8 2 john

A possible solution (in case you want a dataframe as result, just pipe bind_rows at the end):
library(dplyr)
combn(paste0("df", 1:3), 2, simplify = F, \(x) inner_join(get(x[1]), get(x[2])))
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> [[1]]
#> id names
#> 1 3 peter
#>
#> [[2]]
#> [1] id names
#> <0 rows> (or 0-length row.names)
#>
#> [[3]]
#> id names
#> 1 2 john

Related

How to melt a dataframe with tidyverse, and create a new column

I have pet survey data from 6 households.
The households are split into levels (a,b).
I would like to melt the dataframe by aminal name (id.var), household (var.name), abundance (value.name), whilst adding a new column ("level") for the levels a&b.
My dataframe looks like this:
pet abundance data
I can split it using reshape2:melt, but I don't know how to cut the a, b, from the column names and make a new column of them? Please help.
raw_data = as.dataframe(raw_data)
melt(raw_data,
id.variable = 'Animal', variable.name = 'Site', value.name = 'Abundance')
Having a go on some simulated data, pivot_longer is your best bet:
library(tidyverse)
df <- tibble(
Animal = c("dog", "cat", "fish", "horse"),
`1a` = sample(1:10, 4),
`1b` = sample(1:10, 4),
`2a` = sample(1:10, 4),
`2b` = sample(1:10, 4),
`3a` = sample(1:10, 4),
`3b` = sample(1:10, 4)
)
df |>
pivot_longer(
-Animal,
names_to = c("Site", "level"),
values_to = "Abundance",
names_pattern = "(.)(.)"
) |>
arrange(Site, level)
#> # A tibble: 24 × 4
#> Animal Site level Abundance
#> <chr> <chr> <chr> <int>
#> 1 dog 1 a 9
#> 2 cat 1 a 5
#> 3 fish 1 a 8
#> 4 horse 1 a 6
#> 5 dog 1 b 4
#> 6 cat 1 b 2
#> 7 fish 1 b 8
#> 8 horse 1 b 10
#> 9 dog 2 a 8
#> 10 cat 2 a 3
#> # … with 14 more rows

Iteratively get the max of a data frame column, add one and repeat for all rows in r

I need to perform a database operation where I'll be adding new data to an existing table and then assigning the new rows a unique id. I'm asking about this in R so I can get the logic straight before I attempt to rewrite it in sql or pyspark.
Imagine that I've already added the new data to the existing data. Here's a simplified version of what it might look like:
library(tidyverse)
df <- tibble(id = c(1, 2, 3, NA, NA),
descriptions = c("dodgers", "yankees","giants", "orioles", "mets"))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
What I want is:
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets
An I can't use arrange with rowid_to_columns id's be deleted.
To get a unique id for the NA rows while not changing the existing ones, I want to get the max of the id column, add one, replace NA with that value and then move to the next row. My instinct was to do something like this: df %>% mutate(new_id = max(id, na.rm = TRUE) + 1) but that only get's the max plus one, not a new max for each row. I feel like I could do this with a mapping function but what I've tried returns a result identical to the input dataframe:
df %>%
mutate(id = ifelse(is.na(id),
map_dbl(id, ~max(.) + 1, na.rm = FALSE),
id))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
Thanks in advance--now if someone can help me directly in sql, that's also a plus!
SQL option, using sqldf for demo:
sqldf::sqldf("
with cte as (
select max(id) as maxid from df
)
select cte.maxid + row_number() over () as id, df.descriptions
from df
left join cte where df.id is null
union
select * from df where id is not null")
# id descriptions
# 1 1 dodgers
# 2 2 yankees
# 3 3 giants
# 4 4 orioles
# 5 5 mets
Here is one method where we add the max value with the cumulative sum of logical vector based on the NA values and coalesce with the original column 'id'
library(dplyr)
df <- df %>%
mutate(id = coalesce(id, max(id, na.rm = TRUE) + cumsum(is.na(id))))
-output
df
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets

Pandas find columns with wildcard names

I have a pandas dataframe with column names like this:
id ColNameOrig_x ColNameOrig_y
There are many such columns, the 'x' and 'y' came about because 2 datasets with similar column names were merged.
What I need to do:
df.ColName = df.ColNameOrig_x + df.ColNameOrig_y
I am now manually repeating this line for many cols(close to 50), is there a wildcard way of doing this?
You can use DataFrame.filter with DataFrame.groupby by lambda function and axis=1 for grouping per columns names with aggregate sum or use text functions like Series.str.split with indexing:
df1 = df.filter(like='_').groupby(lambda x: x.split('_')[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str.split('_').str[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str[:12], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
You can use the subscripting syntax to access column names dynamically:
col_groups = ['ColName1', 'ColName2']
for grp in col_groups:
df[grp] = df[f'{grp}Orig_x'] + df[f'{grp}Orig_y']
Or you can aggregate by column group. For example
df = pd.DataFrame([
[1,2,3,4],
[5,6,7,8]
], columns=['ColName1Orig_x', 'ColName1Orig_y', 'ColName2Orig_x', 'ColName2Orig_y'])
# Here's your opportunity to define the wildcard
col_groups = df.columns.str.extract('(.+)Orig_[x|y]')[0]
df.columns = [col_groups, df.columns]
df.groupby(level=0, axis=1).sum()
Input:
ColName1Orig_x ColName1Orig_y ColName2Orig_x ColName2Orig_y
1 2 3 4
5 6 7 8
Output:
ColName1 ColName2
3 7
11 15

Dataframe, keep only one column

I can't find the pandas function that returns a one column Dataframe from a multi column DF.
I need the exact opposit function of the drop(['']) one.
Any ideas?
You can use the following notation to return a single column dataframe:
df = pd.DataFrame(data=np.random.randint(1, 100 ,(10, 5)), columns=list('ABCDE'))
df_out = df[['C']]
Output:
C
0 65
1 48
2 1
3 41
4 85
5 55
6 45
7 10
8 44
9 11
Note: df['C'] returns a series. And, you can use the to_frame method to convert that series into a dataframe. Or use the double brackets, [[]].
For completeness, I would like to show how we can use the parameter drop to obtain a one column dataframe from a multicolumn one. Also, I explain the result using the tidyverse universe (paper).
Working with a minimal example for a dataframe DF
library(tidyverse)
DF <- data.frame(a = 1:2, b = c("e", "f"))
str(DF)
#> 'data.frame': 2 obs. of 2 variables:
#> $ a: int 1 2
#> $ b: chr "e" "f"
By the way, note that in R versions lower than 4.0, column b would be a factor by default (unless we use stringsAsFactors= FALSE)
Operator [ returns a list (dataframe) as it preserves the original structure (dataframe)
DF[1]
#> a
#> 1 1
#> 2 2
DF['a']
#> a
#> 1 1
#> 2 2
On the other hand, operator [[ simplifies the result to the simplest structure possible, a vector for a one-column dataframe. In the three expressions of it, you always get the simplified version (a vector)
DF[[1]]
#> [1] 1 2
DF[['a']]
#> [1] 1 2
DF$a
#> [1] 1 2
Finally, using [ with row and column dimension
DF[, 1]
#> [1] 1 2
also returns the simplified version because the parameter drop is set to TRUE by default. Setting it to FALSE, you preserve the structure and obtain a one-column dataframe
DF[, 1, drop = FALSE]
#> a
#> 1 1
#> 2 2
A good explanation of this point can be found at: Advanced R by Hadley Wickham, CRC, 2015, section 3.2.1 or section 4.2.5 in the on-line version of the book (June 2021)
Finally, within the tidyverse universe CRAN, you always obtain a dataframe (tibble) when selecting one column
DF %>%
select(2)
#> b
#> 1 e
#> 2 f
DF %>%
select("a")
#> a
#> 1 1
#> 2 2
DF %>%
select(a)
#> a
#> 1 1
#> 2 2
Created on 2021-06-04 by the reprex package (v0.3.0)
It is very simple just use the double brace to select it.
It will return the result in Data Frame. You can check it by type(df)
# First create a data frame to check this
column = df[['Risk']]
print(column)
print(type(column))

Pandas changing value in a column for selected rows

Trying to create a new dataframe first spliting the original one in two:
df1 - that contains only rows from original frame which in selected colomn has values from a given list
df2 - that contains only rows from original which in selected colomn has other values, with these values then changed to a new given value.
Return new dataframe as concatenation of df1 and df2
This works fine:
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
print(df)
cat val
0 a 1
1 b 2
2 c 3
3 d 4
4 a 5
5 b 6
df['cat'] = df['cat'].apply(lambda x: 'other')
print(df)
cat val
0 other 1
1 other 2
2 other 3
3 other 4
4 other 5
5 other 6
Yet when I define function:
def create_df(df, select, vals, other):
df1 = df.loc[df[select].isin(vals)]
df2 = df.loc[~df[select].isin(vals)]
df2[select] = df2[select].apply(lambda x: other)
result = pd.concat([df1, df2])
return result
and call it:
df3 = create_df(df, 'cat', ['a','b'], 'xxx')
print(df3)
Which results in what I actually need:
cat val
0 a 1
1 b 2
4 a 5
5 b 6
2 xxx 3
3 xxx 4
And for some reason in this case I get a warning:
..\usr\conda\lib\site-packages\ipykernel\__main__.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
So how this case (when I assign value to a column in a function) is different from the first one, when I assign value not in a function?
What is the right way to change column value?
Well there are many ways that code can be optimized I guess but for it to work you could simply save copies of the input dataframe and concat those:
def create_df(df, select, vals, other):
df1 = df.copy()[df[select].isin(vals)] #boolean.index
df2 = df.copy()[~df[select].isin(vals)] #boolean-index
df2[select] = other # this is sufficient
result = pd.concat([df1, df2])
return result
Alternative version:
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
# define a mask
mask = df['cat'].isin(list("ab"))
# concatenate mask, nonmask
df2 = pd.concat([df[mask],df[-mask]])
# change values to 'xxx'
df2.loc[-mask,["cat"]] = "xxx"
Outputs
cat val
0 a 1
1 b 2
4 a 5
5 b 6
2 xxx 3
3 xxx 4
Or function:
def create_df(df, filter_, isin_, value):
# define a mask
mask = df[filter_].isin(isin_)
# concatenate mask, nonmask
df = pd.concat([df[mask],df[-mask]])
# change values to 'xxx'
df.loc[-mask,[filter_]] = value
return df
df2 = create_df(df, 'cat', ['a','b'], 'xxx')
df2