r column values in sql where statement - sql

I have a dataset and I am trying to pass the contents of a specific column into the SQL where statement.
For example, assuming iris is my dataset
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
I want to pass the contents of column Species { setosa, setosa, setosa.....setosa} to my sql query where statement
sqlQuery(abcd, paste("Select * from TestTableName1
where WHERE DESCRIPTION
IN (values of Species column from r dataframe)");
Need help here

Your question is really about string manipulation (it's incidental that your string will eventually be passed to sqldf), and the answer is that you paste it together, or use sprintf if you're feeling fancy:
vals = paste(paste0('"', levels(iris$Species), '"'), collapse = ", ")
vals
## [1] "\"setosa\", \"versicolor\", \"virginica\""
vals.paren = paste0("(", vals, ")")
qry = paste("select * from table where description in ", vals.paren)
qry
## [1] "select * from table where description in (setosa, versicolor, virginica)"
# sprintf makes the bounding parentheses cleaner
qry = sprintf("select * from table where description in (%s)", vals)
qry
## [1] "select * from table where description in (setosa, versicolor, virginica)"

By prefacing any function call with fn$ from the gsubfn package string interpolation is enabled on its arguments. See ?fn for more info. This is often used with sqldf in the sqldf package but can be used with any function as we show here. In particular inserting $variable into a string argument of the function call substitutes the value of that variable into that string:
library(gsubfn)
lvls <- toString(shQuote(levels(iris$Species)))
fn$sqlQuery(abcd, "select * from TestTableName1 where DESCRIPTION in ($lvls)")
or if we want to examine the string first:
sql <- fn$identity("select * from TestTableName1 where DESCRIPTION in ($lvls)")
cat(sql, "\n") # look at sql string
sqlQuery(abcd, sql)
The output from the cat statement is:
select * from TestTableName1 where DESCRIPTION in ("setosa", "versicolor", "virginica")

Related

Pandas / How to insert variable number of lines inside a DataFrame?

Here is the structure of my dataframe
plan
ADO_ver_x
ADO_incr_x
ADO_ver_y
ADO_incr_y
3ABP3
25.0
4.0
25.0
7.0
I would like to add ADO_incr_y - ADO_incr_x lines, which means in this case the result would be :
plan
ADO_ver_x
ADO_incr_x
ADO_ver_y
ADO_incr_y
3ABP3
25.0
4.0
25.0
5.0
3ABP3
25.0
5.0
25.0
6.0
3ABP3
25.0
6.0
25.0
7.0
Is there a Panda/Pythonic way to do that ?
I was thinking something like :
reps = [ val2-val1 for val2, val1 in zip(df_insert["ADO_incr_y"],df_insert["ADO_incr_x"]) ]
df_insert.loc[np.repeat(df.index_insert.values, reps)]
But I don't get the incremental progression :
4 -> 5, 5->-6, 6 -> 7
How can I get the index inside the list comprehension ?
You can repeat the data, then modify with groupby.cumcount():
repeats = df['ADO_incr_y'].sub(df['ADO_incr_x']).astype(int)
out = df.reindex(df.index.repeat(repeats))
out['ADO_incr_x'] += out.groupby(level=0).cumcount()
out['ADO_incr_y'] = out['ADOE_incr_x'] + 1

Selecting Data Using Conditions Stored in a Variable

Pretend I have this table on a server:
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
iris$id = 1:nrow(iris)
dbWriteTable(con, "iris", iris)
I want to select some some random rows from this dataset - suppose I create an R variable that contains the random rows that I want to select:
rows_to_select = sample.int(10, 5, replace = TRUE)
[1] 1 1 8 8 7
I then tried to select these rows from my table - but this "rows_to_select" variable is not being recognized for some reason:
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (rows_to_select) limit 100;")
Error: no such column: rows_to_select
This code works fine if I manually specify which rows I want (e.g. I want the first row, and the fifth row selected twice):
#works - but does not return the 5th row twice
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (1,5,5) limit 100;")
Does anyone know how to fix this?
Thank you!
In general, just including rows_to_select in a query is not going to know to reach out of the SQLite environment and "invade" the R environment (completely different!) and look for a variable. (For that matter, why doesn't select a.* ... find dplyr::select?) This is the case both for pragmatic reasons and security (though mostly pragmatic).
You may want to consider parameterized queries vice constructing query strings manually. In addition to security concerns about malicious SQL injection (e.g., XKCD's Exploits of a Mom aka "Little Bobby Tables"), it is also a concern for malformed strings or Unicode-vs-ANSI mistakes, even if it's one data analyst running the query. DBI supports parameterized queries.
Long story short, try this:
set.seed(42)
rows_to_select = sample.int(10, 5, replace = TRUE)
rows_to_select
# [1] 1 5 1 9 10
qmarks <- paste(rep("?", length(rows_to_select)), collapse = ",")
qmarks
# [1] "?,?,?,?,?"
DBI::dbGetQuery(con, paste(
"select a.*
from (select *, row_number() over (order by id) as rnum from iris) a
where a.rnum in (", qmarks, ") limit 100;"),
params = as.list(rows_to_select))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species id rnum
# 1 5.1 3.5 1.4 0.2 setosa 1 1
# 2 5.0 3.6 1.4 0.2 setosa 5 5
# 3 4.4 2.9 1.4 0.2 setosa 9 9
# 4 4.9 3.1 1.5 0.1 setosa 10 10
In this case it is rather trivial, but if you have a more complicated query where you use question marks ("bindings") at different places in the query, the order must align perfectly with the elements of the list assigned to the params= argument of dbGetQuery.
Alternative: insert a temp table with your candidate values, then left-join against it.
dbWriteTable(con, "mytemp", data.frame(rnum = rows_to_select), temporary = TRUE)
DBI::dbGetQuery(con,
"select i.* from mytemp m left join iris i on i.id=m.rnum")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
# 1 5.1 3.5 1.4 0.2 setosa 1
# 2 5.0 3.6 1.4 0.2 setosa 5
# 3 5.1 3.5 1.4 0.2 setosa 1
# 4 4.4 2.9 1.4 0.2 setosa 9
# 5 4.9 3.1 1.5 0.1 setosa 10
DBI::dbExecute(con, "drop table mytemp")
# [1] 0

R SQL: Is the Default Option Sampling WITH Replacement?

I want to sample a file WITH REPLACEMENT on a server using SQL with R:
Pretend that this file in the file I am trying to sample:
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "iris", iris)
I want to sample with replacement 30 rows where species = setosa and 30 rows where species = virginica. I used the following code to do this:
rbind(DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (`Species` = 'setosa') ORDER BY RANDOM() LIMIT 30;"), DBI::dbGetQuery(con, "SELECT * FROM iris WHERE (`Species` = 'virginica') ORDER BY RANDOM() LIMIT 30;"))
However at this point, I am not sure if the random sampling being performed is WITH REPLACEMENT or WITHOUT REPLACEMENT.
Can someone please help me determine if the random sampling being performed is WITH REPLACEMENT or WITHOUT REPLACEMENT - and if it is being done WITHOUT REPLACEMENT, how can I change this so that its done WITH REPLACEMENT?
Thank you!

Python - Looping through dataframe using methods other than .iterrows()

Here is the simplified dataset:
Character x0 x1
0 T 0.0 1.0
1 h 1.1 2.1
2 i 2.2 3.2
3 s 3.3 4.3
5 i 5.5 6.5
6 s 6.6 7.6
8 a 8.8 9.8
10 s 11.0 12.0
11 a 12.1 13.1
12 m 13.2 14.2
13 p 14.3 15.3
14 l 15.4 16.4
15 e 16.5 17.5
16 . 17.6 18.6
The simplified dataset is generated by the following code:
ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
ch.append(s)
x0.append(round(x1[-1]+0.1,1))
x1.append(round(x0[-1]+1,1))
df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)
x0 and x1 represents the starting and ending position of each Character, respectively. Assume that the distance between any two adjacent characters equals to 0.1. In other words, if the difference between x0 of a character and x1 of the previous character is 0.1, the two characters belongs to the same string. If such difference is larger than 0.1, the character should be the start of a new string, etc. I need to produce a dataframe of strings and their respective x0 and x1, which is done by looping through the dataframe using .iterrows()
string = []
x0 = []
x1 = []
for index, row in df.iterrows():
if index == 0:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
else:
if round(row['x0']-x1[-1],1) == 0.1:
string[-1] += row['Character']
x1[-1] = row['x1']
else:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])
Here is the result:
String x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
Is there any other faster way to achieve this?
You could use groupby + agg:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
print(res)
Output
Character x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
The tricky part is this one:
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
The idea is to convert the column of diffs (same) into a True or False column, where every time a True appears it means a new group needs to be created. The cumsum will take care of assigning the same id to each group.
As suggested by #ShubhamSharma, you could do:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)
# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()
The other part remains the same.

Similar function to SQL 'WHERE' clause in R

I have a data set assigned to a variable named 'temps', which have columns 'date', 'temperature', 'country'.
I want to do something like this, which I can do in SQL
SELECT * FROM temps WHERE country != 'mycountry'
How can I do similar selection in R?
We can use similar syntax in base R
temps[temps$country != "mycountry",]
Benchmarks
set.seed(24)
temps1 <- data.frame(country = sample(LETTERS, 1e7, replace=TRUE),
val = rnorm(1e7))
system.time(temps1[!temps1$country %in% "A",])
# user system elapsed
# 0.92 0.11 1.04
system.time(temps1[temps1$country != "A",])
# user system elapsed
# 0.70 0.17 0.88
If we are using package solutions
library(sqldf)
system.time(sqldf("SELECT * FROM temps1 WHERE country != 'A'"))
# user system elapsed
# 12.78 0.37 13.15
library(data.table)
system.time(setDT(temps1, key = 'country')[!("A")])
# user system elapsed
# 0.62 0.19 0.37
This should do it.
temps2 <- temps[!temps$country %in% "mycountry",]
Here are sqldf and base R approaches with the source and sample output based on the input shown in the Note below.
1) sqldf
library(sqldf)
sqldf("SELECT * FROM temps WHERE country != 'mycountry'")
## country value
## 1 other 2
2) base R
subset(temps, country != "mycountry")
## country value
## 2 other 2
Note: The test data used above are shown here. Next time pleaes provide such reproducible sample data in the question.
# test data
temps <- data.frame(country = c("mycountry", "other"), value = 1:2)