I want to calculate a moving difference for my dataset below. - dataframe

How do I add another column with a moving difference of Column2?
For Example: I want to add a column where it will have the following values: (0,-372706.6,-284087.1, -119883.7, etc.)

Here's a way to go about it.
## For a small dataset
x <- data.frame(matrix(nrow=7,ncol=2,c(0,12,1,10,2,9.5,3,8,4,7,5,5,6,2),byrow = T))
names(x) <- c("Time","Count")
x[1,"Diff"] <- NA
x[2:nrow(x),"Diff"] <- rev(diff(rev(x$Count)))
There is a way to do it with the plyr package as well.

Related

How to specify vartype for sqlSave() for multiple columns without manually typing in R?

Some reproducible code. Note that your database and server name may be different.
library(RODBC)
iris <- iris
connection <- odbcDriverConnect(
"Driver={SQL Server};
Server=localhost\\SQLEXPRESS;
Database=testdb;
Trusted_connection=true;"
)
# create table in sql and move dataframe values in
columnTypes <- list(
Sepal.Length="decimal(28,0)",
Sepal.Width="decimal(28,0)",
Petal.Length ="decimal(28,0)",
Petal.Width ="decimal(28,0)",
Species = "varchar(255)"
)
sqlSave(connection,iris,varTypes = columnTypes)
This is how i export a dataframe to sql management studio as a table and it works but, say I have one hundred new columns in iris. Do I have to specify each column name = to decimal(28,0) in my columnTypes variable?
# but what if i have
iris$random1 <- rnorm(150)
iris$random2 <- rnorm(150)
iris$random3-999 ..... <- rnorm(150) ....
iris$random1000 <- rnorm(150)
By default the columns go in as floats as least in my actual dataframe (iris is just the example), so that's why I need to manually change them in columnTypes. I want everything after the 5 original columns in iris to be decimal(28,0) format without manually including them in columnTypes.
I did not read into the sqlSave() statement and possible alternatives. Meaning: there might be a more apropriate solution. Anyhow you can generate the list of wanted definitions in base R by repitition and combining lists:
# dummy data
df <- data.frame(Sepal.Lengt = 1:5,Sepal.Width = 1:5,Petal.Length = 1:5,Petal.Width = 1:5,Species = 1:5,col6 = 1:5,col7 = 1:5)
# all column names after the 5th -> if less then 6 columns you will get an error here!
vec <- names(df)[6:ncol(df)]
# generate list with same definition for all columns after the 5th
list_after_5 <- as.list(rep("decimal(28,0)", length(vec)))
# name the list items acording to the columns after the 5th
names(list_after_5) <- vec
# frist 5 column definitions manualy plus remaining columns with same definition from list generate above
columnTypes <- c(list(Sepal.Length="decimal(28,0)",
Sepal.Width="decimal(28,0)",
Petal.Length ="decimal(28,0)",
Petal.Width ="decimal(28,0)",
Species = "varchar(255)"),
list_after_5)
columnTypes
$Sepal.Length
[1] "decimal(28,0)"
$Sepal.Width
[1] "decimal(28,0)"
$Petal.Length
[1] "decimal(28,0)"
$Petal.Width
[1] "decimal(28,0)"
$Species
[1] "varchar(255)"
$col6
[1] "decimal(28,0)"
$col7
[1] "decimal(28,0)"
Since you example code seems to be working for you, this should also though (judging by the output) - I did not test it with a DB, as I have no test setup available atm.

How is writing a for loop in R different from Stata?

I am working on a project in R and, coming from Stata, am having a hard time adjusting to how for loops work.
When I run this code below, I get exactly what I want. I create a new dataframe "states_1" out of an already existing dataframe "data_frame1."
states_1 <- data_frame1[["interest_over_time"]]
states_1 <- states_1[-c(1:66),]
states_1 = dcast(states_1, date ~ keyword + geo, value.var = "hits")
However, I have almost 15 dataframes and this is a shared project, so I would rather not have to repeat this for every dataframe and use a for loop instead. However, the code below tells me that I am trying to treat a function like a dataframe. Would appreciate help with how to proceed
for(i in 1:13) {
states[i] <- data_frame[i][["interest_over_time"]]
states[i] <- states[i][-c(1:66),]
states[i] = dcast(states[i], date ~ keyword + geo, value.var = "hits")
}

Leave axis ticks for blank treatments in ggplot r

I have a bunch of dfs with 19 treatments that I'm plotting subsets of. Trying to figure out how to leave columns for all 19 treatments in the plots, even if they have no values that are being plotted in that specific plot. Smaller reproducible example below.
set.seed(3)
df <- data.frame(matrix(ncol=2,nrow=200))
df$X1 <- rep(c("A","B","C","D","E"),each = 40)
df$X2 <- runif(200,20,50)
ggplot(df,aes(x=X1,y=X2,color=X1))+
geom_dotplot(binaxis="y",data= df[df$X2>48,])+
geom_boxplot(data=df[df$X2>48,],varwidth = T)
See how there are only columns for A,B,C,E? How do I make sure it leaves a column for D? Also, I would need it to skip a color as well so in all the different plots, A, B, C, D, and E are always consistent colors.
(it would also be preferably if I could just put the subset code in the ggplot() box if possible so I wouldn't have to write the subsets over and over again).
I tried adding
scale_x_date(drop=F)+
as a line, but it didn't change.
Thanks.
You can subset your data in the main ggplot() call. The limits of the x-axis and colours can be set manually:
fullvar <- unique(df$X1)
ggplot(df[df$X2 > 48,],aes(x=X1,y=X2,color=X1))+
geom_dotplot(binaxis="y")+
geom_boxplot(varwidth = T) +
scale_x_discrete(limits = fullvar) +
scale_color_discrete(limits = fullvar)
EDIT: Overlooked the colours question.

How to add two boxplots in a same graph in ggplot2

I have this sample data.
sample <- data.frame(sample = 1:12,
site = c('A','A','A','B','B','B','A','A','A','B','B','B'),
month = c(rep('Feb', 6), rep('Aug', 6)),
Ar = c(7,8,9,8,9,9,4,5,7,5,8,9))
And created two boxplots
ggplot(sample, aes(x=factor(month), y=Ar)) +
geom_boxplot(aes(fill=site))
ggplot(sample, aes(x=factor(month), y=Ar)) +
geom_boxplot()
I wonder if there is a way to combine them in the same graph so that total, site A and site B are right next to each other per each month.
You could utilize dplyr (via the tidyverse package) and reshape2.
library(dplyr)
library(reshape2)
sample%>%
dplyr::select(-sample) %>%
mutate(global = 'Global') %>%
melt(., id.vars=c("month", "Ar")) %>%
ggplot(aes(month, Ar)) + geom_boxplot(aes(month, Ar, fill=value))
This drops the sample column as you aren't currently using it, adds the term global in a separate column, reshapes the data via the melt function and generates a figure. Note that I changed the input code format in your original question. With the changes to the data.frame you no longer need to coerce the variables to factors.

selecting every Nth column in using SQLDF or read.csv.sql

I am rather new to using SQL statements, and am having a little trouble using them to select the desired columns from a large table and pulling them into R.
I want to take a csv file and read selected columns into r, in particular, every 9th and 10th column. In R, something like:
read.csv.sql("myfile.csv", sql(select * from file [EVERY 9th and 10th COLUMN])
My trawl of the internet suggests that selecting every nth row could be done with an SQL statement using MOD something like this (please correct me if I am wrong):
"SELECT *
FROM file
WHERE (ROWID,0) IN (SELECT ROWID, MOD(ROWNUM,9) OR MOD(ROWNUM,10)"
Is there a way to make this work for columns? Thanks in advance.
read.csv read.csv would be adequate for this:
# determine number of columns
DF1 <- read.csv(myfile, nrows = 1)
nc <- ncol(DF1)
# create a list nc long where unwanted columns are NULL and wanted are NA
colClasses <- rep(rep(list("NULL", NA), c(8, 2)), length = nc)
# read in
DF <- read.csv(myfile, colClasses = colClasses)
sqldf To use sqldf replace the last line with these:
nms <- names(DF1)
vars <- toString(nms[is.na(colClasses)])
DF <- fn$read.csv.sql(myfile, "select $vars from file")
UPDATE: switched to read.csv.sql
UPDATE 2: correction.