Add values from a column when two other columns match - vba

I have an ecology data table with about 12,000 rows. There are three columns: site, species, and value. I need to add up the values for each set of matching site and species - for example, all "red maple" values at "site A". I have the data sorted by site and species, so I can do it by hand, but it's slow going. The number of site/species matches varies, so I can't just add up the values in sets of three or anything.
Similar types of questions have talked about pivot tables, but none have needed to match two columns and add a third column, and I haven't been able to figure out how to extrapolate to my situation.
I'm reasonably comfortable coding and would like to do something that looks like this pseudocode, but I'm not clear on the syntax in VBA:
For each row
if a(x) = a(x+1) and b(x) = b(x+1) then
sum = sum + c(x)
else
d(x) = sum
sum = 0
next
Any ideas?

In a PivotTable, put site in Row Labels and species in Column Labels (or vice versa) and Sum of value in Σ Values:

Related

Subtract the mean of a group for a column away from a column value

I have a companies dataset with 35 columns. The companies can belong to one of 8 different groups. How do I for each group create a new dataframe which subtract the mean of the column for that group away from the original value?
Here is an example of part of the dataset.
So for example for row 1 I want to subtract the mean of BANK_AND_DEP for Consumer Markets away from the value of 7204.400207. I need to do this for each column.
I assume this is some kind of combination of a transform and a lambda - but cannot hit the syntax.
Although it might seem counter-intuitive for this to involve a loop at all, looping through the columns themselves allows you to do this as a vectorized operation, which will be quicker than .apply(). For what to subtract by, you'll combine .groupby() and .transform() to get the value you need to subtract from a column. Then, just subtract it.
for column in df.columns:
df['new_'+column] = df[column]-df.groupby('Cluster')['column'].transform('mean')

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

One ggplot from two data frames (1 bar each)

I was looking for an answer everywhere, but I just couldn't find one to this problem (maybe I was just too stupid to use other answers, because I'm new to R).
I have two data frames with different numbers of rows. I want to create a plot containing a single bar per data frame. Both should have the same length and the count of different variables should be stacked over each other. For example: I want to compare the proportions of gender in those to data sets.
t1<-data.frame(cbind(c(1:6), factor(c(1,2,2,1,2,2))))
t2<-data.frame(cbind(c(1:4), factor(c(1,2,2,1))))
1 represents male, 2 represents female
I want to create two barplots next to each other that represent, that the proportions of gender in the first data frame is 2:4 and in the second one 2:2.
My attempt looked like this:
ggplot() + geom_bar(aes(1, t1$X2, position = "fill")) + geom_bar(aes(1, t2$X2, position = "fill"))
That leads to the error: "Error: stat_count() must not be used with a y aesthetic."
First I should merge the two dataframes. You need to add a variable that will identify the origin of the data, add in both dataframes a column with an ID (like t1 and t2). Keep in mind that your columnames are the same in both frames so you will be able to use the function rbind.
t1$data <- "t1"
t2$data <- "t2"
t <- (rbind(t1,t2))
Now you can make the plot:
ggplot(t[order(t$X2),], aes(data, X2, fill=factor(X2))) +
geom_bar(stat="identity", position="stack")

recoding multiple variables in the same way

I am looking for the shortest way to recode many variables in the same way.
For example I have data frame where columns a,b,c are names of items of survey and rows are observations.
d <- data.frame(a=c(1,2,3), b=c(1,3,2), c=c(1,2,1))
I want to change values of all observations for selected columns. For instance value 1 of column "a" and "c" should be replaced to string "low" and values 2,3 of these columns should be replaced to "high".
I do it often with many columns so I am looking for function which can do it in very simple way, like this:
recode2(data=d, columns=a,d, "1=low, 2,3=high").
Almost ok is function recode from package cars, but if I have 10 columns to recode I have to rewrite it 10 times and it is not as effective as I want.

R: Matching two tables on multiple columns and creating a matched/not matched flag

I'm a beginner to R from a SAS background trying to do a basic "case when" match on two tables to get a flag where I have and have not found a match. Please see the SAS code I have in mind below. I just need something analogous to this in R. Thanks in advance.
proc sql;
create table
x as
select
a.*,
b.*,
case when a.first_column=b.column_first and
a.second_column=b.column_second
then 1 else 0 end as matched_flag
from table1 as a
left join
table2 as b
on a.first_column=b.column_first and a.second_column=b.column_second;
quit;
I'm not familiar with SAS, but I think I understand what you are trying to do. To see how many rows/columns are similar between two tables, you can use %in% and the length function.
For example, initialize two matrices of different dimensions and given them similar row names and column names:
mat.a <- matrix(1, nrow=3, ncol = 2)
mat.b <- matrix(1, nrow=2, ncol = 3)
rownames(mat.a) <- c('a','b','c')
rownames(mat.b) <- c('a','d')
colnames(mat.a) <- c('g','h')
colnames(mat.b) <- c('h','i')
mat.a and mat.b now exist with different row and column names. To match the rows by names, you can use:
row.match <- rownames(mat.a)[rownames(mat.a) %in% rownames(mat.b)]
num.row.match <- length(row.match)
Note that row.match can now be used to index into both of the matrices. The %in% operator returns a logical of the same length of the first argument (in this case, rownames(mat.a)) that indicates if the ith element of the first argument was found anywhere in the elements of the second argument. This nature of %in% means that you have to be sensitive to how you order the arguments for your indexing.
If you simply want to quantify how many rows or columns are the same between the two matrices, then you can use the sum function with the %in% operator:
sum(rownames(mat.a) %in% rownames(mat.b))
With the sum function used like this, you do not need to be sensitive to how you order the arguments, because the number of row names of mat.a in row names of mat.b is equivalent to the number of row names of mat.b in row names of mat.a. That is to say that this usage of %in% is commutative.
I hope this helps!
You will want to use dataframe objects. These are like datasets in SAS. You can use bind to put two dataframe objects together side by side. Then you can select rows based on conditions and set the flag based on this. In the code below you will see that I did this twice: once to set the 1 flag and once to set the 0 flag.
To select the rows where all fields match you can do something similar, but instead of assigning a new column you can assign all the results back to the name of the table you are working on.
Here's the code:
# make up example a and b data frames
table1 <- data.frame(list(a.first_column=c(1,2,3),a.second_column=c(4,5,6)))
table2 <- data.frame(list(b.first_column=c(1,3,6),b.second_column=c(4,5,9)))
# Combine columns (horizontally)
x <- cbind(table1, table2)
print("Combined Data Frames")
print(x)
# create matched flag (1 when the first columns match)
x$matched_flag[x$a.first_column==x$b.first_column] <- 1
x$matched_flag[!x$a.first_column==x$b.first_column] <- 0
# only select records that match both data frames
x <- x[x$a.first_column==x$b.first_column & x$a.second_column==x$b.second_column,]
print("Matched Data Frames")
print(x)
BTW: since you are used to using SQL, you might want to try the sqldf package in R. It will let you use the same techniques that you are used to but in R and on data frames.