Create new data frame of percentages from values of an old data frame? - sql

So I want to create a new data frame adding the values of the Sometimes and Often column and dividing it by the values of the total column and multiplying it by 100 to get percentages (unless there is a function that automatically does this in R). How would I go about doing that?

You have added an "sql" tag to your question. Should you prefer SQL over R for reasons of experience and/or knowledge you might be interested in the fabulous sqldf package which allows you to use SQL syntax within R. You will have to download it first via install.packages("sqldf") and then you can use it as in
expl <- data.frame(sometimes = c(1, 2, 4), often = c(2, 2, 2), total =c(6, 9, 8))
library(sqldf)
sqldf("SELECT 100*(sometimes+often)/total FROM expl")

The far more often used way is to add a percent column to the same data.frame instead of introducing a new one. That way, all data are kept together and you do not loose the link to e. g. the week column.
One way to go about that would be the following one-liner:
expl <- data.frame(sometimes = c(1, 2, 4), often = c(2, 2, 2), total =c(6, 9, 8))
print(expl)
expl$percent = 100 * (expl$sometimes + expl$often)/expl$total
print(expl)

First, it looks as though Total, Sometimes, and Often are character because you have commas in them, so you would need to get rid of the commas and convert them to numeric. You can do that as follows (assuming your dataframe is called mydata):
for(i in c("Total","Sometimes","Often")) mydata[[i]] = as.numeric(gsub(",", "", mydata[[i]])
Then you can use the answer by Bernard:
mydata$percent = 100 * (mydata$Sometimes + mydata$Often)/mydata$Total

Another option using the tidyverse:
library(tidyverse)
newdataframe <- olddataframe %>%
mutate(percent = (Sometimes+Often)/Total*100) %>%
select(percent)
But as said before, better leave the percentage column with the other data. In that case, remove the %>% select(percent).

Related

How to apply a value from one row to all rows with the same ID in R

I currently have a large data set that is in long format. I am hoping to apply a value only given at baseline (repeat_instance = 0) to all follow up instances (repeat_instance = 1, 2, 3+) based on the record_id.
While I cannot share the actual data I have created a simplified example below to illustrate the quesiton.
record_id <- c(1,1,1,2,3,4,4,5,6,7,8,8,9,10,10,10)
repeat_instance <- c(0,1,2,0,0,0,1,0,0,0,0,1,0,0,1,2)
reason_for_visit <- c(1,NA,NA,1,2,1,NA,1,2,3,1,NA,1,1,NA,NA)
Current Format:
Desired Outcome:
I have seen solutions in Excel, however am not sure which formula may be useful in R.
We can use fill from tidyr
library(tidyr)
fill(df1, reason_for_visit)
data
df1 <- data.frame(record_id, repeat_instance, reason_for_visit)

Taking mean of N largest values of group by absolute value

I have some DataFrame:
d = {'fruit': ['apple', 'pear', 'peach'] * 6, 'values': np.random.uniform(-5,5,18), 'values2': np.random.uniform(-5,5,18)}
df = pd.DataFrame(data=d)
I can take the mean of each fruit group as such:
df.groupby('fruit').mean()
However, for each group of fruit, I'd like to take the mean of the N number of largest values as
ranked by absolute value.
So for example, if my values were as follows and N=3:
[ 0.7578507 , 3.81178045, -4.04810913, 3.08887538, 2.87999752, 4.65670954]
The desired outcome would be (4.65670954 + -4.04810913 + 3.81178045) / 3 = ~1.47
Edit - to clarify that sign is preserved in outcome:
(4.65670954 + -20.04810913 + 3.81178045) / 3 = -3.859
Updating with a new approach that I think is simpler. I was avoiding apply like the plague but maybe this is one of the more acceptable uses. Plus it fixes the fact that you want to mean the original values as ranked by their absolute values:
def foo(d):
return d[d.abs().nlargest(3).index].mean()
out = df.groupby('fruit')['values'].apply(foo)
So you index each group by the 3 largest absolute values, then mean.
And for the record my original, incorrect, and slower code was:
df['values'].abs().groupby(df['fruit']).nlargest(3).groupby("fruit").mean()

geom_nodelabel_repel() position for circular ggraph plot

I have a network diagram that looks like this:
I made it using ggraph and added the labels using geom_nodelabel_repel() from ggnetwork:
( ggraph_plot <- ggraph(layout) +
geom_edge_fan(aes(color = as.factor(responses), edge_width = as.factor(responses))) +
geom_node_point(aes(color = as.factor(group)), size = 10) +
geom_nodelabel_repel(aes(label = name, x=x, y=y), segment.size = 1, segment.color = "black", size = 5) +
scale_color_manual("Group", values = c("#2b83ba", "#d7191c", "#fdae61")) +
scale_edge_color_manual("Frequency of Communication", values = c("Once a week or more" = "#444444","Monthly" = "#777777",
"Once every 3 months" = "#888888", "Once a year" = "#999999"),
limits = c("Once a week or more", "Monthly", "Once every 3 months", "Once a year")) +
scale_edge_width_manual("Frequency of Communication", values = c("Once a week or more" = 3,"Monthly" = 2,
"Once every 3 months" = 1, "Once a year" = 0.25),
limits = c("Once a week or more", "Monthly", "Once every 3 months", "Once a year")) +
theme_void() +
theme(legend.text = element_text(size=16, face="bold"),
legend.title = element_text(size=16, face="bold")) )
I want to have the labels on the left side of the plot be off to the left, and the labels on the right side of the plot to be off to the right. I want to do this because the actual labels are quite long (organization names) and they get in the way of the lines in the actual plot.
How can I do this using geom_nodelabel_repel()? i've tried different combinations of box_padding and point_padding, as well as h_just and v_just but these apply to all labels and it doesn't seem like there is a way to subset or position specific points.
Apologies for not providing a reproducible example but I wasn't sure how to do this without compromising the identities of respondents from my survey.
Well, there is always the manually-intensive, yet effective method of separately adding the geom_node_label_repel function for the nodes on the "left" vs. the "right" of the plot. It's not at all elegant and probably bad coding practice, but I've done similar things myself when I can't figure out an elegant solution. It works really well when you don't have a very large dataset to begin with and if you are not planning to make the same plot over and over again. Basically, it would entail:
Identifying if there exists a property in your dataset that places points on the "left" vs. the "right". In this case, it doesn't look like it, so you would just have to create a list manually of those entries on the "left" vs. "right" of your plot.
Using separate calls to geom_node_label_repel with different nudge_x values. Use any reasonable method to subset the "left" and "right datapoints. You can create a new column in the dataset, or use formatting in-line like data = subset(your.data.frame, property %in% left.list)
For example, if you created a column called subset.side, being either "left" or "right" in your data.frame (here: your.data.frame), your calls to geom_node_label_repel might look something like:
geom_node_label_repel(
data=subset(your.data.frame, subset.side=='left'),
aes(label=name, x=x, y=y), segment.size=1, segment.color='black', size=5,
nudge_x=-10
) +
geom_node_label_repel(
data=subset(your.data.frame, subset.side=='right'),
aes(label=name, x=x, y=y), segment.size=1, segment.color='black', size=5,
nudge_x=10
) +
Alternatively, you can create a list based on the label name itself--let's say you called those lists names.left and names.right, where you can subset accordingly by swapping in as represented in the pseudo code below:
geom_node_label_repel(
data=subset(your.data.frame, name %in% names.left),...
nudge_x = -10, ...
) +
geom_node_label_repel(
data=subset(your.data.frame, name %in% names.right),...
nudge_x = 10, ...
)
To be fair, I have not worked with the node geoms before, so I am assuming here that the positioning of the labels will not affect the mapping (as it would not with other geoms).

Rstudio and ggplot, stacked bar graph: Two almost identical bits of code, fct_reorder works with one, and not the other?

I wrote this which perfectly reorders the variable mcr_variant by count on my bar graph.
mcrxgenus %>%
mutate(mcr_variant = fct_reorder(mcr_variant, count)) %>%
ggplot( aes(fill=isolate_genus, y=count, x=mcr_variant)) +
geom_bar(position="stack", stat="identity") +
coord_flip() +
labs(x="MCR variant", y="Count", fill="Isolate genus")
I wrote this to display the same dataset a bit differently.
mcrxgenus %>%
mutate(isolate_genus = fct_reorder(isolate_genus, count)) %>%
ggplot( aes(fill=mcr_variant, y=count, x=isolate_genus)) +
geom_bar(position="stack", stat="identity")+
coord_flip()+
labs(x="Isolate genus", y="Count", fill="MCR variant")
It does NOT reorder my bar graph by count. I have absolutely no idea what is going on. It seems to me there should be no reason for this. mcr_variant and isolate_genus are both categorical variables. mcr_variant has 12 levels and isolate_genus has 6 possible levels. That is the only difference I can think of. Anyone run in to this problem before? It's been driving me mad! I have no idea what is happening here!
When you stack bars up, you're adding their values. fct_reorder, by default, takes the median of the values. So if MCR Variant A has counts 1, 1, 1, 2, 1, its order will be determined by the median count, 1, while its height is the sum of the counts, 6. Meanwhile if MCR Variant B has counts 2, 3, its order will be the median 2.5, but its sum is 5.
You need to make fct_reorder use sum, just like your stacked bar graph. Replace fct_reorder(isolate_genus, count) with fct_reorder(isolate_genus, count, sum).
If this doesn't work, please share a reproducible sample of data, preferably with dput so the classes are preserved and everything is copy/pasteable e.g., dput(mcrxgenus[1:10, ]) for the first 10 rows. Pick a suitable sample to illustrate the problem.

R forecast output to SQL Server

I am doing database analysis using SQL Server and forecasting using R. I need to get the results from R back into the SQL Server database. One approach is to output the forecast data to a text file using write.table and import using BULK INSERT. Is there a better way?
You can use dbBulkCopy from rsqlserver package. It is a DBI extension that interfaces the Microsoft SQL Server popular command-line utility named bcp to quickly bulk copying large files into table.
dat <- matrix(round(rnorm(nrow * ncol), 2), nrow = nrow, ncol = ncol)
colnames(dat) <- cnames
id.file = "temp_file.csv"
write.csv(dat, file = id.file, row.names = FALSE)
dbBulkCopy(conn, "NEW_BP_TABLE", value = id.file)
Thanks for your comments and answers! I went with a solution based on the comment by nrussell. Below is my code. The specific command is the last line; I am providing the preceding lines to provide a little bit of context for anyone trying to use this answer.
data <- sqlQuery(myconn, query) # returns time series with year, month (both numeric), and value
data_ts <- ts(data$value,
start=c(data$year[1],data$month[1]), # start is first year and month
end=c(data$year[nrow(data)],data$month[nrow(data)]), # end is last year and month
frequency=12)
data_fit <- auto.arima(data_ts)
fct <- forecast(data_ts, 12)
sqlQuery(myconn, 'truncate table dgtForecast') # Pre-existing table
sqlSave(myconn, data.frame(fct), tablename='dgtForecast', rownames='MonthYear', append=TRUE)