Selecting with sqldf values greater than median of all values - sql

I am trying to find all the stock records where the yield is greater than the median for all stock using sqldf but I am getting this message.
I have tried using the actual number 2.39 and it works but I have not been successful substituting a variable to make it dynamic. Maybe a sub-select would be better?
mYd <- median(df3$Yield, na.rm = TRUE)
df4 <- sqldf("SELECT a.*
FROM df3 a
WHERE (a.Yield > mYd)
;")
Error in rsqlite_send_query(conn#ptr, statement) : no such column: mYd

The error is stemming from SQLdf's inability to find a column in df3 called mYd. It needs to find columns in the data frame for every corresponding column referenced in your query. Try adding the mYd variable to your df3 data frame as a column proper:
df3$mYd <- median(df3$Yield, na.rm=TRUE)
df4 <- sqldf("SELECT * FROM df3 WHERE Yield > mYd;")
Note that you don't really need to alias df3 here since it is the only table in the query, and you aren't generating any computed columns.

Related

SQL query output as a dataframe in R

I am very new to R. I wanted to use a sql query to get data with R. (I am using athena but I think this doesn't matter).
con <- dbConnect(RAthena::athena(),
s3_staging_dir = 's3://bla/bla/'
)
df <- tbl(con, sql("SELECT * FROM db.my_data"))
My problem is that df is not a dataframe. So when I do names(df) for example I don't get the columns (as I would with Python) but get 'src''ops' instead. I don't know what this means (these are not columns). How can I convert this df so it is a dataframe in R?
You can use function dbGetQuery from package DBI
From documentation:
Description Returns the result of a query as a data frame.
dbGetQuery() comes with a default implementation (which should work
with most backends) that calls dbSendQuery(), then dbFetch(), ensuring
that the result is always free-d by dbClearResult().
Implemantation from your example:
df <- dbGetQuery(con, "SELECT * FROM db.my_data")

Multiple column selection on a Julia DataFrame

Imagine I have the following DataFrame :
10 rows x 26 columns named A to Z
What I would like to do is to make a multiple subset of the columns by their name (not the index). For instance, assume that I want columns A to D and P to Z in a new DataFrame named df2.
I tried something like this but it doesn't seem to work :
df2=df[:,[:A,:D ; :P,:Z]]
syntax: unexpected semicolon in array expression
top-level scope at Slicing.jl:1
Any idea of the way to do it ?
Thanks for any help
df2 = select(df, Between(:A,:D), Between(:P,:Z))
or
df2 = df[:, All(Between(:A,:D), Between(:P,:Z))]
if you are sure your columns are only from :A to :Z you can also write:
df2 = select(df, Not(Between(:E, :O)))
or
df2 = df[:, Not(Between(:E, :O))]
Finally, you can easily find an index of the column using columnindex function, e.g.:
columnindex(df, :A)
and later use column numbers - if this is something what you would prefer.
In Julia you can also build Ranges with Chars and hence when your columns are named just by single letters yet another option is:
df[:, Symbol.(vcat('A':'D', 'P':'Z'))]

Conditional join using sqldf in R with time data

So I have a have a table (~2000 rows, call it df1) of when a particular subject received a medication on a particular date, and I have a large excel file (>1 million rows) of weight data for subjects for different dates (call it df2).
AIM: I want to group by subject and find the weight in df2 that was recorded closest to the medication admin time in df1 using sqldf(because tables are too big to load into R). Or alternatively, I can set up a time frame of interest (e.g. +/- 1 week of medication given) and find a row that falls within that timeframe.
Example:
df1 <- data.frame(
PtID = rep(c(1:5), each=2),
Dose = rep(seq(100,200,25),2),
ADMIN_TIME =seq.Date(as.Date("2016/01/01"), by = "month", length.out = 10)
)
df2 <- data.frame(
PtID = rep(c(1:5),each=10),
Weight = rnorm(50, 50, 10),
Wt_time = seq.Date(as.Date("2016/01/01"), as.Date("2016/10/31"), length.out = 50)
)
So I think i want to left_join df1 and df2, group by PtID, and set up some condition that identifies either the closest df2$Weight to the df1$Admin_time or a df2$Weight within an acceptable range around df1$Admin_time using sql formatting.
So I tried creating a range and then querying the following:
library(dplry)
library(lubridate)
df1 <- df1 %>%
mutate(ADMIN_START = ADMIN_TIME - ddays(30),
ADMIN_END = ADMIN_TIME + ddays(30))
#df2.csv is the large spreadsheet saved in my working directory
result <- read.csv.sql("df2.csv", sql = "select Weight from file
left join df1
on file.Wt_time between df1.ADMIN_START and df1.ADMIN_END")
This will run but it never results anything and I have to escape out of it. Any thoughts are appreciated.
Thanks!

Pandas - select dataframe columns if statistic is greater than certain value

I have pandas dataframe df. I would like to select columns which have standard deviation grater than 1. Here is what I tried
df2 = df[df.std() >1]
df2 = df.loc[df.std() >1]
Both generated error. What am I doing wrong?
Use df.loc[:, df.std() > 1] and it will fix it.
The first part which is [: refers to the rows and the second part df.std() > 1 refers to the columns

selecting every Nth column in using SQLDF or read.csv.sql

I am rather new to using SQL statements, and am having a little trouble using them to select the desired columns from a large table and pulling them into R.
I want to take a csv file and read selected columns into r, in particular, every 9th and 10th column. In R, something like:
read.csv.sql("myfile.csv", sql(select * from file [EVERY 9th and 10th COLUMN])
My trawl of the internet suggests that selecting every nth row could be done with an SQL statement using MOD something like this (please correct me if I am wrong):
"SELECT *
FROM file
WHERE (ROWID,0) IN (SELECT ROWID, MOD(ROWNUM,9) OR MOD(ROWNUM,10)"
Is there a way to make this work for columns? Thanks in advance.
read.csv read.csv would be adequate for this:
# determine number of columns
DF1 <- read.csv(myfile, nrows = 1)
nc <- ncol(DF1)
# create a list nc long where unwanted columns are NULL and wanted are NA
colClasses <- rep(rep(list("NULL", NA), c(8, 2)), length = nc)
# read in
DF <- read.csv(myfile, colClasses = colClasses)
sqldf To use sqldf replace the last line with these:
nms <- names(DF1)
vars <- toString(nms[is.na(colClasses)])
DF <- fn$read.csv.sql(myfile, "select $vars from file")
UPDATE: switched to read.csv.sql
UPDATE 2: correction.