Match fields within one data frame with column names in another data frame - dataframe

I have two data frames. In the last column ("Bill") in the first data frame, I want to apply a function (fixed price + Quantity*price/qty). In order to apply the function, R should match the values in the first column of df1 to the column names of df2.
I have solved the problem by creating a function and several ifelse statements, but I would want to use a statement that automatically matches the values in df1 with the column names in df2. The data set that I have contains more than 2 million rows and I would need to apply the same rationale into building other similar functions. It would be nice to use something that does not require a loop or takes too long to process.

### Set up your data frames like so ###
Code <- c("a1", "a2", "c3", "a1")
Name <- c("Dan", "David", "Anna", "Lisa")
Quantity <- c(30, 12, 10, 10)
df1 <- as.data.frame(cbind("Code" = Code, "Name" = Name, "Quantity" = Quantity), stringsAsFactors = F)
df1$Quantity <- as.numeric(df1$Quantity)
fixed_price <- c(12, 5, 23)
price_per_qty <- c(1, 4, 7)
df2 <- as.data.frame(rbind("fixed_price" = fixed_price, "price_per_qty" = price_per_qty))
colnames(df2) <- c("a1", "a2", "c3")
### Combine dataframe 1 and 2 into a single dataframe ###
# Code below pulls individual columns from df2 based on the
# index provided by the "Code" column in df1, transposes them
# so they'll line up with df1, then column binds them to df1
df3 <- cbind(df1, t(df2[,df1$Code]))
# the bill is calculated simply enough
bill <- df3[4] + df3[3] * df3[5]
colnames(bill) <- "bill"
# Finally, output the results as you wanted
cbind(df3, bill)

So I have a fairly similar answer to graggsd, but here is what worked for me. I merged two data frames based on the key word "Code" and then combined it into the big data frame into combined_data. I then used a function which I think is what you defined above and then passed the respective data frames through it.
df2 <- t(data.frame(c(12,1),c(5,4),c(23,7)))
rownames(df2) <- c("a1","a2","c3")
test <- rownames(df2)
df2 <- cbind.data.frame(df2,test)
colnames(df2) <- c("fixed price","price/qty","Code")
df1 <- data.frame(c("a1","a2","c3","a1"), c("Dan","David","Anna","Lisa"),c(30,12,10,10))
colnames(df1) <- c("Code","Name","Quantity")
combined_data <- dplyr::inner_join(df1,df2, by = "Code")
f1 <- function(x,y,z){
x + y * z
}
bill <- f1(combined_data[,4],combined_data[,3],combined_data[,5])
finalDataSet <- cbind.data.frame(combined_data,bill)
The final data set:
Code Name Quantity fixed price price/qty bill
1 a1 Dan 30 12 1 42
2 a2 David 12 5 4 53
3 c3 Anna 10 23 7 93
4 a1 Lisa 10 12 1 22

Related

How to make dataframe from different parts of an Excel sheet given specific keywords?

I have one Excel file where multiple tables are placed in same sheet. My requirement is to read certain tables based on keyword. I have read tables using skip rows and nrows method, which is working as of now, but in future it won't work due to dynamic table length.
Is there any other workaround apart from skip rows & nrows method to read table as shown in picture?
I want to read data1 as one table & data2 as another table. Out of which in particular I want columns "RR","FF" & "WW" as two different data frames.
Appreciate if some one can help or guide to do this.
Method I have tried:
all_files=glob.glob(INPATH+"*sample*")
df1 = pd.read_excel(all_files[0],skiprows=11,nrows= 3)
df2 = pd.read_excel(all_files[0],skiprows=23,nrows= 3)
This works fine, the only problem is table length will vary every time.
With an Excel file identical to the one of your image, here is one way to do it:
import pandas as pd
df = pd.read_excel("file.xlsx").dropna(how="all").reset_index(drop=True)
# Setup
targets = ["Data1", "Data2"]
indices = [df.loc[df["Unnamed: 0"] == target].index.values[0] for target in targets]
dfs = []
for i in range(len(indices)):
# Slice df starting from first indice to second one
try:
data = df.loc[indices[i] : indices[i + 1] - 1, :]
except IndexError:
data = df.loc[indices[i] :, :]
# For one slice, get only values where row starts with 'rr'
r_idx = data.loc[df["Unnamed: 0"] == "rr"].index.values[0]
data = data.loc[r_idx:, :].reset_index(drop=True).dropna(how="all", axis=1)
# Cleanup
data.columns = data.iloc[0]
data.columns.name = ""
dfs.append(data.loc[1:, :].iloc[:, 0:3])
And so:
for item in dfs:
print(item)
# Output
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout

Creating new variables in a data frame

maybe you can help. I want to create a new variable with for loop on something like that
df1 <- read.table(textConnection("Subject Module ID
History WW21 1
English Literature1 2
Maths Algebra1 3"),
stringsAsFactors=FALSE, header=TRUE)
df2 <- read.table(textConnection("Subject Module ID
History WW22 1
English Literature2 2
Maths Algebra2 3"),
stringsAsFactors=FALSE, header=TRUE)
df3 <- read.table(textConnection("Subject Module ID
History WW23 1
English Literature3 2
Maths Algebra3 3"),
stringsAsFactors=FALSE, header=TRUE)
df <- rbind(df1,df2,df3)
dframe <- arrange(df,Subject)
Now I want to create a new variable (Allthreemodule) that concatenates the three-level of each Subject
paste(unlist(dframe$Module), collapse=",")
** This has concatenated ALL Modules**
1 "WW21,Literature1,Algebra1,WW22,Literature2,Algebra2,WW23,Literature3,Algebra3"
So I want to base on for loop (like for I in Subject -> concatenated the modules ) to have as results
I tried this without success
for (i in Subject) {
dframe$CONCA[i] <- paste(unlist(dframe$Module, Subject == i), collapse=",")
}
Is is giving these results
[Wrong results]2
My final table should look like
[Good result]1

How do I use df.add_suffix to add suffixes to duplicate column names in Pandas?

I have a large dataframe with 400 columns. 200 of the column names are duplicates of the first 200. How can I used df.add_suffix to add a suffix only to the duplicate column names?
Or is there a better way to do it automatically?
Here is my solution, starting with:
df=pd.DataFrame(np.arange(4).reshape(1,-1),columns=['a','b','a','b'])
output
a b a b
0 1 2 3 4
Then I use Lambda function
df.columns += df.columns+np.vectorize(lambda x:'_' if x else '')(df.columns.duplicated())
Output
a b a_ b_
0 0 1 2 3
If you have more than one duplicate then you can loop until there is none left. This works for duplicated indices too, it also keeps the index name.
If I understand your question correct you have each name twice. If so it is possible to ask for duplicated values using df.columns.duplicated(). Then you can create a new list only modifying duplicated values and adding your self definied suffix. This is different from the other posted solution which modifies all entries.
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
my_suffix = 'T'
df.columns = [name if duplicated == False else name + my_suffix for duplicated, name in zip(df.columns.duplicated(), df.columns)]
df
>>>
a aT b bT
0 1 2 3 4
My answer has the disadvantage that the dataframe can have duplicated column names if one name is used three or more times.
You could do:
import pandas as pd
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3]], columns=list('aaa'))
# create unique identifier for each repeated column
identifier = df.columns.to_series().groupby(level=0).transform('cumcount')
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype('string')
print(df)
Output
a0 a1 a2
0 1 2 3
If there is only one duplicate column, you could do:
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
# create unique identifier for each repeated column
identifier = df.columns.duplicated().astype(int)
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype(str)
print(df)
Output (for only one duplicate)
a0 a1 b0 b1
0 1 2 3 4
Add numbering suffix starts with '_1' started with the first duplicated column and applicable to columns appearing more than once.
E.g a column name list: [a, b, c, a, b, a] will return [a, b, c, a_1, b_1, a_2]
from collections import Counter
counter = Counter()
empty_list= []
for x in range(df.shape[1]):
counter.update([df.columns[x]])
if counter[df.columns[x]] == 1:
empty_list.append(df.columns[x])
else:
tx = counter[df.columns[x]] -1
empty_list.append(df.columns[x] + '_' + str(tx))
df.columns = empty_list
df.columns

Query if value in list using R and PostgreSQL

I have a dataframe like this
df1
ID value
1 c(YD11,DD22,EW23)
2 YD34
3 c(YD44,EW23)
4
And I want to query another database to tell me how many rows had these values in them. This will eventually be done in a loop through all rows but for now I just want to know how to do it for one row.
Let's say the database looks like this:
sql_database
value data
YD11 2222
WW20 4040
EW23 2114
YD44 3300
XH29 2040
So if I just looked at row 1, I would get:
dbGetQuery(con,
sprintf("SELECT * FROM sql_database WHERE value IN %i",
df1$value[1]) %>%
nrow()
OUTPUT:
2
And the other rows would be :
Row 2: 0
Row 3: 2
Row 4: 0
I don't need the loop created but because my code doesn't work I would like to know how to query all rows of a table which have a value in an R list.
You do not need a for loop for this.
library(tidyverse)
library(DBI)
library(dbplyr)
df1 <- tibble(
id = 1:4,
value = list(c("YD11","DD22","EW23"), "YD34", c("YD44","EW23"), NA)
)
# creating in memory database table
df2 <- tibble(
value = c("YD11", "WW20", "EW23", "YD44", "XH29"),
data = c(2222, 4040, 2114, 3300, 2040)
)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Add auxilary schema
tmp <- tempfile()
DBI::dbExecute(con, paste0("ATTACH '", tmp, "' AS some_schema"))
copy_to(con, df2, in_schema("some_schema", "some_sql_table"), temporary = FALSE)
# counting rows
df1 %>%
unnest(cols = c(value)) %>%
left_join(tbl(con, dbplyr::in_schema("some_schema", "some_sql_table")) %>% collect(), by = "value") %>%
mutate(data = if_else(is.na(data), 0, 1)) %>%
group_by(id) %>%
summarise(n = sum(data))

How to update a SQL table with an R dataframe?

This seems like it should be simple but as a complete beginner to SQL, I've spent a large amount of time trying to figure this out without any luck
Say I have a dataframe in R like this:
df1 <- data.frame(value = c(1, 2, 3, 4),
ID = c("foo", "bar", "baz", "waffle"))
# value ID
# 1 1 foo
# 2 2 bar
# 3 3 baz
# 4 4 waffle
I can load it into a SQLite database easily:
table_name <- "mytable"
library("RSQLite")
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, table_name, df1, row.names = FALSE)
dbListTables(con)
# [1] "mytable"
Now say I create another dataframe, that has a mix of
entries which are exactly the same as the first df
entries which are partially present in the first df but have updated values
entries which are not present in the first df
-
df2 <- data.frame(value = c(1, 3, 5, 6),
ID = c("foo", "bar", "abc", "zzz"))
# value ID
# 1 1 foo # present in df1
# 2 3 bar # updated from df1
# 3 5 abc # not in df1
# 4 6 zzz # not in df1
Now, I want to update my SQLite table with this new dataframe;
currently existing entries which are exactly the same should be skipped
currently existing entries which are different should be updated
new entries should be appended
-
My best guess is that the code required would be structured like this:
if(! dbExistsTable(con, table_name)){
# write the df2 to table, if the table doesn't already exist
dbWriteTable(con, table_name, df2, row.names = FALSE)
} else {
# update the entries in the table with the entries in df2
for(i in seq(nrow(df2))){
irow <- df2[i, ]
# check if irow is already in the table
# if its already in the table, update the table's irow entry if its different
# otherwise append irow to the table
# or break() if irow is already present and identical
}
}
But what code would be used to actually check if the row (irow) is already present in the SQLite, and then update it? It seems like it might be some usage of dbBind() but I haven't been able to find a working example of how to use it for this purpose, and the docs are not clear. Is this kind of method going to be efficient for millions of entries and an arbitrary number of columns?