Joining two dataframes which have variable names as values in R / SQL - sql

I am using R
I am trying to join two which look like this:
DF1:
Name Species Value Variable_id
Jake Human 99 1
Jake Human 20 2
Mike Lizard 12 1
Mike Lizard 30 2
DF2:
Variable_id Varible_name
1 Height
2 Age
And I need it in the form of
Name Species Height Age
Jake Human 99 20
Mike Lizard 12 30

library(dplyr)
library(tidyr)
DF1 %>% left_join(DF2) %>%
select(-Variable_id) %>%
pivot_wider(names_from = Varible_name, values_from = Value)
# Joining, by = "Variable_id"
# # A tibble: 2 x 4
# Name Species Height Age
# <chr> <chr> <int> <int>
# 1 Jake Human 99 20
# 2 Mike Lizard 12 30
Using this data:
DF1 = read.table(text = 'Name Species Value Variable_id
Jake Human 99 1
Jake Human 20 2
Mike Lizard 12 1
Mike Lizard 30 2', header = T)
DF2 = read.table(text = "Variable_id Varible_name
1 Height
2 Age", header = TRUE)

Related

Vlookup from the same pandas dataframe

I have a hierarchical dataset that looks like this:
emp_id
emp_name
emp_manager
emp_org_lvl
1
John S
Bob A
1
2
Bob A
Paul P
2
3
Paul P
Charles Y
3
What I want to do is extend this table to have the emp_name for each manager going up the org chart. E.g.
emp_id
emp_name
emp_manager
emp_org_lvl
lvl2_name
lvl3_name
1
John S
Bob A
1
Paul P
Charles Y
In Excel, I would do a vlookup in column lvl2_name to see who Bob A's manager is e.g. something like this vlookup(c2,B:C,2,False). Using pandas, the direction seems to be to use Merge. The problem with this is that Merge seems to require two separate dataframes and you can't specify what column to return. Is there a better way than having a separate dataframe for each emp_org_lvl?
# Code to create table:
header = ['emp_id','emp_name','emp_manager','emp_org_lvl']
data = [[ 1,'John S' ,'Bob A', 1],[2, 'Bob A', 'Paul P', 2],[3, 'Paul P', 'Charles Y', 3]]
df = pd.DataFrame(data, columns=header)
You can try this:
# provide a lookup for employee to manager
manager_dict = dict(zip(df.emp_name, df.emp_manager))
# initialize the loop
levels_to_go_up = 3
employee_column_name = 'emp_manager'
# loop and keep adding columns to the dataframe
for i in range(2, levels_to_go_up + 1):
new_col_name = f'lvl{i}_name'
# create a new column by looking up employee_column_name's manager
df[new_col_name] = df[employee_column_name].map(manager_dict)
employee_column_name = new_col_name
>>>df
Out[67]:
emp_id emp_name emp_manager emp_org_lvl lvl2_name lvl3_name
0 1 John S Bob A 1 Paul P Charles Y
1 2 Bob A Paul P 2 Charles Y NaN
2 3 Paul P Charles Y 3 NaN NaN
Alternatively if you wanted to retrieve ALL managers in the tree, you could use a recursive function, and return the results as a list:
def retrieve_managers(name, manager_dict, manager_list=None):
if not manager_list:
manager_list = []
manager = manager_dict.get(name)
if manager:
manager_list.append(manager)
return retrieve_managers(manager, manager_dict, manager_list)
return manager_list
df['manager_list'] = df.emp_name.apply(lambda x: retrieve_managers(x, manager_dict))
>>> df
Out[71]:
emp_id emp_name emp_manager emp_org_lvl manager_list
0 1 John S Bob A 1 [Bob A, Paul P, Charles Y]
1 2 Bob A Paul P 2 [Paul P, Charles Y]
2 3 Paul P Charles Y 3 [Charles Y]
Finally, you can in fact self-join a dataframe while subselecting columns.
df = df.merge(df[['emp_name', 'emp_manager']], left_on='emp_manager', right_on='emp_name', suffixes=("", f"_joined"), how='left')
>>> df
Out[82]:
emp_id emp_name emp_manager emp_org_lvl emp_name_joined emp_manager_joined
0 1 John S Bob A 1 Bob A Paul P
1 2 Bob A Paul P 2 Paul P Charles Y
2 3 Paul P Charles Y 3 NaN NaN

Converting dataframe to long format based on unique pieces of information stored in current variable names (separated by "_") using R

I'm looking for a way to pull out pieces of information from existing variable names, create new variables to store that extracted information, and convert the data frame to long format based on the newly defined variables. Dealing with psychology treatment survey responses.
Using R or SQL.
Information contained in the exiting variable names:
episode = each time individual participates in program is an episode
subject = the individual filling out the survey (can be participant, mother, father, etc.)
type = the name of the current survey (note: some surveys have additional identifying information separated by "_")
instance = number of days since admission or discharge
description = question number or other information unique to that column
Currently, the each piece of information is separated by "_".
Here is the format: episode_subject_type_instance_description
## Have data currently in this format, but with almost 5000 variables
tibble(case_name = c("Joe", "Mary", "Jane"),
episode1_student_survey1_day0_Q1 = c(1, 2, 3),
episode1_student_survey1_day0_Q2 = c("A", "B", "C"))
# A tibble: 3 x 3
case_name episode1_student_survey1_day0_Q1 episode1_student_survey1_day0_Q2
<chr> <dbl> <chr>
1 Joe 1 A
2 Mary 2 B
3 Jane 3 C
## Want to transform to long like this:
tibble(case_name = c("Joe", "Joe", "Mary", "Mary", "Jane", "Jane"),
episode = "episode1",
subject = "student",
type = "survey1",
instance = "day0",
description = c("Q1", "Q1", "Q1", "Q2", "Q2", "Q2"),
value = c(1, 2, 3, "A", "B", "C"))
# A tibble: 6 x 7
case_name episode subject type instance description value
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Joe episode1 student survey1 day0 Q1 1
2 Joe episode1 student survey1 day0 Q1 2
3 Mary episode1 student survey1 day0 Q1 3
4 Mary episode1 student survey1 day0 Q2 A
5 Jane episode1 student survey1 day0 Q2 B
6 Jane episode1 student survey1 day0 Q2 C
I'm assuming there is some way to pull each piece of information out at a time but not sure how to go about this.
Thanks for any and all help!!
In R, convert the column types to character other than the 'case_name', then reshape to 'long' format with pivot_longer and separate the name column
library(dplyr)
library(tidyr)
nm1 <- c("episode", "subject", "type", "instance", "description")
df1 %>%
mutate(across(-case_name, as.character)) %>%
pivot_longer(cols = -case_name) %>%
separate(name, into = nm1)
-output
# A tibble: 6 x 7
case_name episode subject type instance description value
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Joe episode1 student survey1 day0 Q1 1
2 Joe episode1 student survey1 day0 Q2 A
3 Mary episode1 student survey1 day0 Q1 2
4 Mary episode1 student survey1 day0 Q2 B
5 Jane episode1 student survey1 day0 Q1 3
6 Jane episode1 student survey1 day0 Q2 C
data
df1 <- tibble(case_name = c("Joe", "Mary", "Jane"),
episode1_student_survey1_day0_Q1 = c(1, 2, 3),
episode1_student_survey1_day0_Q2 = c("A", "B", "C"))
Here is a solution which uses melt() to reshape from wide to long format and the new measure() function to split the column names:
library(data.table) # development version 1.14.1 used
melt(setDT(df1), measure.vars = measure(episode, subject, type, instance, description,
sep = "_"))
case_name episode subject type instance description value
1: Joe episode1 student survey1 day0 Q1 1
2: Mary episode1 student survey1 day0 Q1 2
3: Jane episode1 student survey1 day0 Q1 3
4: Joe episode1 student survey1 day0 Q2 A
5: Mary episode1 student survey1 day0 Q2 B
6: Jane episode1 student survey1 day0 Q2 C

Iteratively get the max of a data frame column, add one and repeat for all rows in r

I need to perform a database operation where I'll be adding new data to an existing table and then assigning the new rows a unique id. I'm asking about this in R so I can get the logic straight before I attempt to rewrite it in sql or pyspark.
Imagine that I've already added the new data to the existing data. Here's a simplified version of what it might look like:
library(tidyverse)
df <- tibble(id = c(1, 2, 3, NA, NA),
descriptions = c("dodgers", "yankees","giants", "orioles", "mets"))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
What I want is:
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets
An I can't use arrange with rowid_to_columns id's be deleted.
To get a unique id for the NA rows while not changing the existing ones, I want to get the max of the id column, add one, replace NA with that value and then move to the next row. My instinct was to do something like this: df %>% mutate(new_id = max(id, na.rm = TRUE) + 1) but that only get's the max plus one, not a new max for each row. I feel like I could do this with a mapping function but what I've tried returns a result identical to the input dataframe:
df %>%
mutate(id = ifelse(is.na(id),
map_dbl(id, ~max(.) + 1, na.rm = FALSE),
id))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
Thanks in advance--now if someone can help me directly in sql, that's also a plus!
SQL option, using sqldf for demo:
sqldf::sqldf("
with cte as (
select max(id) as maxid from df
)
select cte.maxid + row_number() over () as id, df.descriptions
from df
left join cte where df.id is null
union
select * from df where id is not null")
# id descriptions
# 1 1 dodgers
# 2 2 yankees
# 3 3 giants
# 4 4 orioles
# 5 5 mets
Here is one method where we add the max value with the cumulative sum of logical vector based on the NA values and coalesce with the original column 'id'
library(dplyr)
df <- df %>%
mutate(id = coalesce(id, max(id, na.rm = TRUE) + cumsum(is.na(id))))
-output
df
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets

how to apply one hot encoding or get dummies on 2 columns together in pandas?

I have below dataframe which contain sample values like:-
df = pd.DataFrame([["London", "Cambridge", 20], ["Cambridge", "London", 10], ["Liverpool", "London", 30]], columns= ["city_1", "city_2", "id"])
city_1 city_2 id
London Cambridge 20
Cambridge London 10
Liverpool London 30
I need the output dataframe as below which is built while joining 2 city columns together and applying one hot encoding after that:
id London Cambridge Liverpool
20 1 1 0
10 1 1 0
30 1 0 1
Currently, I am using the below code which works one time on a column, please could you advise if there is any pythonic way to get the above output
output_df = pd.get_dummies(df, columns=['city_1', 'city_2'])
which results in
id city_1_Cambridge city_1_London and so on columns
You can add parameters prefix_sep and prefix to get_dummies and then use max if want only 1 or 0 values (dummies or indicator columns) or sum if need count 1 values :
output_df = (pd.get_dummies(df, columns=['city_1', 'city_2'], prefix_sep='', prefix='')
.max(axis=1, level=0))
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
Or if want processing all columns without id convert not processing column(s) to index first by DataFrame.set_index, then use get_dummies with max and last add DataFrame.reset_index:
output_df = (pd.get_dummies(df.set_index('id'), prefix_sep='', prefix='')
.max(axis=1, level=0)
.reset_index())
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1

return rows which elements are duplicates, not the logical vector

I know the duplicated-function of the package dplyr. The problem is that it only returns a logical vector indicating which elements (rows) are duplicates.
I want to get a vector which gives back those rows with the certain elements.
I want to get back all the observations of A and B because they have for the key Name and year duplicated values.
I already have coded this:
>df %>% group_by(Name) %>% filter(any(( ?????)))
but I dont know how to write the last part of code.
Anyone any ideas?
Thanks :)
An option using dplyr can be achieved by grouping on both Name and Year to calculate count. Afterwards group on only Name and filter for groups having any count > 1 (meaning duplicate):
library(dplyr)
df %>% group_by(Name, Year) %>%
mutate(count = n()) %>%
group_by(Name) %>%
filter(any(count > 1)) %>%
select(-count)
# # A tibble: 7 x 3
# # Groups: Name [2]
# Name Year Value
# <chr> <int> <int>
# 1 A 1990 5
# 2 A 1990 3
# 3 A 1991 5
# 4 A 1995 5
# 5 B 2000 0
# 6 B 2000 4
# 7 B 1998 5
Data:
df <- read.table(text =
"Name Year Value
A 1990 5
A 1990 3
A 1991 5
A 1995 5
B 2000 0
B 2000 4
B 1998 5
C 1890 3
C 1790 2",
header = TRUE, stringsAsFactors = FALSE)