How to melt a dataframe with tidyverse, and create a new column - tidyverse

I have pet survey data from 6 households.
The households are split into levels (a,b).
I would like to melt the dataframe by aminal name (id.var), household (var.name), abundance (value.name), whilst adding a new column ("level") for the levels a&b.
My dataframe looks like this:
pet abundance data
I can split it using reshape2:melt, but I don't know how to cut the a, b, from the column names and make a new column of them? Please help.
raw_data = as.dataframe(raw_data)
melt(raw_data,
id.variable = 'Animal', variable.name = 'Site', value.name = 'Abundance')

Having a go on some simulated data, pivot_longer is your best bet:
library(tidyverse)
df <- tibble(
Animal = c("dog", "cat", "fish", "horse"),
`1a` = sample(1:10, 4),
`1b` = sample(1:10, 4),
`2a` = sample(1:10, 4),
`2b` = sample(1:10, 4),
`3a` = sample(1:10, 4),
`3b` = sample(1:10, 4)
)
df |>
pivot_longer(
-Animal,
names_to = c("Site", "level"),
values_to = "Abundance",
names_pattern = "(.)(.)"
) |>
arrange(Site, level)
#> # A tibble: 24 × 4
#> Animal Site level Abundance
#> <chr> <chr> <chr> <int>
#> 1 dog 1 a 9
#> 2 cat 1 a 5
#> 3 fish 1 a 8
#> 4 horse 1 a 6
#> 5 dog 1 b 4
#> 6 cat 1 b 2
#> 7 fish 1 b 8
#> 8 horse 1 b 10
#> 9 dog 2 a 8
#> 10 cat 2 a 3
#> # … with 14 more rows

Related

groupby to show same row value from other columns

After groupby by "Mode" column and take out the value from "indicator" of "max, min", how to let the relative value to show in the same dataframe like below:
df = pd.read_csv(r'relative.csv')
Grouped = df.groupby('Mode')['Indicator'].agg(['max', 'min'])
print(Grouped)
(from google, maybe can use from col_value or row_value function, but seem be more complicated, could someone can help to solve it by easy ways? thank you.)
You can do it in two steps, using groupby and idxmin() or idxmix():
# Create a df with the min values of 'Indicator', renaming the column 'Value' to 'B'
min = df.loc[df.groupby('Mode')['Indicator'].idxmin()].reset_index(drop=True).rename(columns={'Indicator': 'min', 'Value': 'B'})
print(min)
# Mode min B
# 0 A 1 6
# 1 B 1 7
# Create a df with the max values of 'Indicator', renaming the column 'Value' to 'A'
max = df.loc[df.groupby('Mode')['Indicator'].idxmax()].reset_index(drop=True).rename(columns={'Indicator': 'max', 'Value': 'A'})
print(max)
# Mode max A
# 0 A 3 2
# 1 B 4 3
# Merge the dataframes together
result = pd.merge(min, max)
# reorder the columns to match expected output
print(result[['Mode', 'max','min','A', 'B']])
# Mode max min A B
# 0 A 3 1 2 6
# 1 B 4 1 3 7
The logic is unclear, there is no real reason why you would call your columns A/B since the 6/3 values in it are not coming from A/B.
I assume you want to achieve:
(df.groupby('Mode')['Indicator'].agg(['idxmax', 'idxmin'])
.rename(columns={'idxmin': 'min', 'idxmax': 'max'}).stack()
.to_frame('x').merge(df, left_on='x', right_index=True)
.drop(columns=['x', 'Mode']).unstack()
)
Output:
Indicator Value
max min max min
Mode
A 3 1 2 6
B 4 1 3 7
C 10 10 20 20
Used input:
Mode Indicator Value
0 A 1 6
1 A 2 5
2 A 3 2
3 B 4 3
4 B 3 6
5 B 2 8
6 B 1 7
7 C 10 20
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"Mode": ["A", "A", "A", "B", "B", "B", "B"],
"Indicator": [1, 2, 3, 4, 3, 2, 1],
"Value": [6, 5, 2, 3, 6, 8, 7],
}
)
new_df = df.groupby("Mode")["Indicator"].agg(["max", "min"])
print(new_df)
# Output
max min
Mode
A 3 1
B 4 1
Here is one way to do it with product from Python standard library's itertools module and Pandas at property:
from itertools import product
for row, (col, func) in product(["A", "B"], [("A", "max"), ("B", "min")]):
new_df.at[row, col] = df.loc[
(df["Mode"] == row) & (df["Indicator"] == new_df.loc[row, func]), "Value"
].values[0]
new_df = new_df.astype(int)
Then:
print(new_df)
# Output
max min A B
Mode
A 3 1 2 6
B 4 1 3 7

How to convert rows into columns by group?

I'd like to do association analysis using apriori algorithm.
To do that, I have to make a dataset.
What I have data is like this.
data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
order_number order_item
1 100145 27684535
2 100155 15755576
3 100155 1357954
4 100155 124776249
5 500002 12478324
6 500002 15755576
7 500002 13577
8 500007 27684535
and I want to transfer the data like this
data.frame("order_number"=c("100145","100155","500002","500007"),
"col1"=c("27684535","15755576","12478324","27684535"),
"col2"=c(NA,"1357954","15755576",NA),
"col3"=c(NA,"124776249","13577",NA))
order_number col1 col2 col3
1 100145 27684535 <NA> <NA>
2 100155 15755576 1357954 124776249
3 500002 12478324 15755576 13577
4 500007 27684535 <NA> <NA>
Thank you for your help.
This would be a case of pivot_wider (or other functions for changing column layout). First step would be creating a row id variable to note whether each is 1, 2 or 3, then shaping this into the dataframe you want:
df <- data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
library(tidyr)
library(dplyr)
df |>
group_by(order_number) |>
mutate(rank = row_number()) |>
pivot_wider(names_from = rank, values_from = order_item,
names_prefix = "col")
#> # A tibble: 4 × 4
#> # Groups: order_number [4]
#> order_number col1 col2 col3
#> <chr> <chr> <chr> <chr>
#> 1 100145 27684535 <NA> <NA>
#> 2 100155 15755576 1357954 124776249
#> 3 500002 12478324 15755576 13577
#> 4 500007 27684535 <NA> <NA>

Finding Identical Rows in Multiple Datasets

I am trying to find out if 3 datasets (df1, df2, df3) have any common rows (i.e. entire row is a duplicate).
I figured out how to do this for pairs of 2 datasets:
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"))
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"))
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"))
library(dplyr)
inner_join(df1, df2)
inner_join(df1, df3)
inner_join(df2, df3)
Is it possible to do this for 3 datasets all at once?
The straightforward way does not seem to work:
inner_join(df1, df2, df3)
Error in `[.data.frame`(by, c("x", "y")) : undefined columns selected
I thought I had found a way to do this:
library(plyr)
join_all(list(df1, df2, df3), type='inner')
But this is telling me that there are no common rows (i.e. same id, same name) between these 3 dataframes:
Joining by: id, names
Joining by: id, names
[1] id names
<0 rows> (or 0-length row.names)
This is not correct, seeing as in the example I created:
Row 3 for df1 and df2 are identical (id = 3,name = peter)
Row 2 for df2 and df3 are identical (id = 2, name= john)
I am trying to find a way to determine if these 3 datasets share any common rows. Can this be done in R?
Thank you!
Here is how you could achieve your task:
library(dplyr)
bind_rows(df1, df2, df3) %>%
group_by(id, names) %>%
filter(n()>1) %>%
unique()
id names
<dbl> <chr>
1 3 peter
2 2 john
You can do this using get_dupes from janitor package.
library(tidyverse)
library(janitor)
# Added a new column 'df_id' to identify the data frame
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"), df_id = 1)
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"), df_id = 2)
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"), df_id = 3)
# Bind dataframes
# Get duplicates
df1 %>%
bind_rows(df2) %>%
bind_rows(df3) %>%
get_dupes(c(id, names))
#> id names dupe_count df_id
#> 1 2 john 2 2
#> 2 2 john 2 3
#> 3 3 peter 2 1
#> 4 3 peter 2 2
Does this count?
dfall<-bind_rows(df1,df2,df3)
dfall[duplicated(dfall),]
id names
6 3 peter
8 2 john
A possible solution (in case you want a dataframe as result, just pipe bind_rows at the end):
library(dplyr)
combn(paste0("df", 1:3), 2, simplify = F, \(x) inner_join(get(x[1]), get(x[2])))
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> [[1]]
#> id names
#> 1 3 peter
#>
#> [[2]]
#> [1] id names
#> <0 rows> (or 0-length row.names)
#>
#> [[3]]
#> id names
#> 1 2 john

Iteratively get the max of a data frame column, add one and repeat for all rows in r

I need to perform a database operation where I'll be adding new data to an existing table and then assigning the new rows a unique id. I'm asking about this in R so I can get the logic straight before I attempt to rewrite it in sql or pyspark.
Imagine that I've already added the new data to the existing data. Here's a simplified version of what it might look like:
library(tidyverse)
df <- tibble(id = c(1, 2, 3, NA, NA),
descriptions = c("dodgers", "yankees","giants", "orioles", "mets"))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
What I want is:
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets
An I can't use arrange with rowid_to_columns id's be deleted.
To get a unique id for the NA rows while not changing the existing ones, I want to get the max of the id column, add one, replace NA with that value and then move to the next row. My instinct was to do something like this: df %>% mutate(new_id = max(id, na.rm = TRUE) + 1) but that only get's the max plus one, not a new max for each row. I feel like I could do this with a mapping function but what I've tried returns a result identical to the input dataframe:
df %>%
mutate(id = ifelse(is.na(id),
map_dbl(id, ~max(.) + 1, na.rm = FALSE),
id))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
Thanks in advance--now if someone can help me directly in sql, that's also a plus!
SQL option, using sqldf for demo:
sqldf::sqldf("
with cte as (
select max(id) as maxid from df
)
select cte.maxid + row_number() over () as id, df.descriptions
from df
left join cte where df.id is null
union
select * from df where id is not null")
# id descriptions
# 1 1 dodgers
# 2 2 yankees
# 3 3 giants
# 4 4 orioles
# 5 5 mets
Here is one method where we add the max value with the cumulative sum of logical vector based on the NA values and coalesce with the original column 'id'
library(dplyr)
df <- df %>%
mutate(id = coalesce(id, max(id, na.rm = TRUE) + cumsum(is.na(id))))
-output
df
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets

Adding the lower levels of two Pandas MultiIndex columns

I have the following DataFrame:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['n1', 'n1', 'n2', 'n2'],
['p', 'm', 'p', 'm']])
values = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
]
df = pd.DataFrame(values, columns=columns)
n1 n2
p m p m
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Now I want to add another column (n3) to this DataFrame whose lower-level columns p and m should be the sums of the corresponding lower-level columns of n1 and n2:
n1 n2 n3
p m p m p m
0 1 2 3 4 4 6
1 5 6 7 8 12 14
2 9 10 11 12 20 22
Here's the code I came up with:
n3 = df[['n1', 'n2']].sum(axis=1, level=1)
level1 = df.columns.levels[1]
n3.columns = pd.MultiIndex.from_arrays([['n3'] * len(level1), level1])
df = pd.concat([df, n3], axis=1)
This does what I want, but feels very cumbersome compared to code that doesn't use MultiIndex columns:
df['n3'] = df[['n1', 'n2']].sum(axis=1)
My current code also only works for a column MultiIndex consisting of two levels, and I'd be interested in doing this for arbitrary levels.
What's a better way of doing this?
One way to do so with stack and unstack:
new_df = df.stack(level=1)
new_df['n3'] = new_df.sum(axis=1)
new_df.unstack(level=-1)
Output:
n1 n2 n3
m p m p m p
0 2 1 4 3 6 4
1 6 5 8 7 14 12
2 10 9 12 11 22 20
If you build the structure like:
df['n3','p']=1
df['n3','m']=1
then you can write:
df['n3'] = df[['n1', 'n2']].sum(axis=1, level=1)
Here's another way that I just discovered which does not reorder the columns:
# Sum column-wise on level 1
s = df.loc[:, ['n1', 'n2']].sum(axis=1, level=1)
# Prepend a column level
s = pd.concat([s], keys=['n3'], axis=1)
# Add column to DataFrame
df = pd.concat([df, s], axis=1)