Selecting Data Using Conditions Stored in a Variable

Selecting Data Using Conditions Stored in a Variable - sql

Pretend I have this table on a server:
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
iris$id = 1:nrow(iris)
dbWriteTable(con, "iris", iris)
I want to select some some random rows from this dataset - suppose I create an R variable that contains the random rows that I want to select:
rows_to_select = sample.int(10, 5, replace = TRUE)
[1] 1 1 8 8 7
I then tried to select these rows from my table - but this "rows_to_select" variable is not being recognized for some reason:
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (rows_to_select) limit 100;")
Error: no such column: rows_to_select
This code works fine if I manually specify which rows I want (e.g. I want the first row, and the fifth row selected twice):
#works - but does not return the 5th row twice
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (1,5,5) limit 100;")
Does anyone know how to fix this?
Thank you!

In general, just including rows_to_select in a query is not going to know to reach out of the SQLite environment and "invade" the R environment (completely different!) and look for a variable. (For that matter, why doesn't select a.* ... find dplyr::select?) This is the case both for pragmatic reasons and security (though mostly pragmatic).
You may want to consider parameterized queries vice constructing query strings manually. In addition to security concerns about malicious SQL injection (e.g., XKCD's Exploits of a Mom aka "Little Bobby Tables"), it is also a concern for malformed strings or Unicode-vs-ANSI mistakes, even if it's one data analyst running the query. DBI supports parameterized queries.
Long story short, try this:
set.seed(42)
rows_to_select = sample.int(10, 5, replace = TRUE)
rows_to_select
# [1] 1 5 1 9 10
qmarks <- paste(rep("?", length(rows_to_select)), collapse = ",")
qmarks
# [1] "?,?,?,?,?"
DBI::dbGetQuery(con, paste(
"select a.*
from (select *, row_number() over (order by id) as rnum from iris) a
where a.rnum in (", qmarks, ") limit 100;"),
params = as.list(rows_to_select))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species id rnum
# 1 5.1 3.5 1.4 0.2 setosa 1 1
# 2 5.0 3.6 1.4 0.2 setosa 5 5
# 3 4.4 2.9 1.4 0.2 setosa 9 9
# 4 4.9 3.1 1.5 0.1 setosa 10 10
In this case it is rather trivial, but if you have a more complicated query where you use question marks ("bindings") at different places in the query, the order must align perfectly with the elements of the list assigned to the params= argument of dbGetQuery.
Alternative: insert a temp table with your candidate values, then left-join against it.
dbWriteTable(con, "mytemp", data.frame(rnum = rows_to_select), temporary = TRUE)
DBI::dbGetQuery(con,
"select i.* from mytemp m left join iris i on i.id=m.rnum")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
# 1 5.1 3.5 1.4 0.2 setosa 1
# 2 5.0 3.6 1.4 0.2 setosa 5
# 3 5.1 3.5 1.4 0.2 setosa 1
# 4 4.4 2.9 1.4 0.2 setosa 9
# 5 4.9 3.1 1.5 0.1 setosa 10
DBI::dbExecute(con, "drop table mytemp")
# [1] 0

Related

Pandas / How to insert variable number of lines inside a DataFrame?

Here is the structure of my dataframe
plan
ADO_ver_x
ADO_incr_x
ADO_ver_y
ADO_incr_y
3ABP3
25.0
4.0
25.0
7.0
I would like to add ADO_incr_y - ADO_incr_x lines, which means in this case the result would be :
plan
ADO_ver_x
ADO_incr_x
ADO_ver_y
ADO_incr_y
3ABP3
25.0
4.0
25.0
5.0
3ABP3
25.0
5.0
25.0
6.0
3ABP3
25.0
6.0
25.0
7.0
Is there a Panda/Pythonic way to do that ?
I was thinking something like :
reps = [ val2-val1 for val2, val1 in zip(df_insert["ADO_incr_y"],df_insert["ADO_incr_x"]) ]
df_insert.loc[np.repeat(df.index_insert.values, reps)]
But I don't get the incremental progression :
4 -> 5, 5->-6, 6 -> 7
How can I get the index inside the list comprehension ?

You can repeat the data, then modify with groupby.cumcount():
repeats = df['ADO_incr_y'].sub(df['ADO_incr_x']).astype(int)
out = df.reindex(df.index.repeat(repeats))
out['ADO_incr_x'] += out.groupby(level=0).cumcount()
out['ADO_incr_y'] = out['ADOE_incr_x'] + 1

Python - Looping through dataframe using methods other than .iterrows()

Here is the simplified dataset:
Character x0 x1
0 T 0.0 1.0
1 h 1.1 2.1
2 i 2.2 3.2
3 s 3.3 4.3
5 i 5.5 6.5
6 s 6.6 7.6
8 a 8.8 9.8
10 s 11.0 12.0
11 a 12.1 13.1
12 m 13.2 14.2
13 p 14.3 15.3
14 l 15.4 16.4
15 e 16.5 17.5
16 . 17.6 18.6
The simplified dataset is generated by the following code:
ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
ch.append(s)
x0.append(round(x1[-1]+0.1,1))
x1.append(round(x0[-1]+1,1))
df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)
x0 and x1 represents the starting and ending position of each Character, respectively. Assume that the distance between any two adjacent characters equals to 0.1. In other words, if the difference between x0 of a character and x1 of the previous character is 0.1, the two characters belongs to the same string. If such difference is larger than 0.1, the character should be the start of a new string, etc. I need to produce a dataframe of strings and their respective x0 and x1, which is done by looping through the dataframe using .iterrows()
string = []
x0 = []
x1 = []
for index, row in df.iterrows():
if index == 0:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
else:
if round(row['x0']-x1[-1],1) == 0.1:
string[-1] += row['Character']
x1[-1] = row['x1']
else:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])
Here is the result:
String x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
Is there any other faster way to achieve this?

You could use groupby + agg:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
print(res)
Output
Character x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
The tricky part is this one:
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
The idea is to convert the column of diffs (same) into a True or False column, where every time a True appears it means a new group needs to be created. The cumsum will take care of assigning the same id to each group.
As suggested by #ShubhamSharma, you could do:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)
# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()
The other part remains the same.

Select all columns except one in R sqldf package

Is there a way in R using the sqldf package to select all columns except one?

Your call to sqldf based on some query should return a data frame, where each DF column corresponds to one of the columns appearing in the select clause of your SQL query. Consider the following example:
sql <- "SELECT * FROM yourTable WHERE <some conditions>"
df <- sqldf(sql)
drop <- c("some_column")
df <- df[, !(names(df) %in% drop)]
Note in the above I am doing a SELECT * to fetch all columns in the table (what I assume is your use case). I then subset off a column some_column from the resulting data frame.
Note that doing this from SQL directly generally is not possible. That is, once you do SELECT *, the cat is out of the bag, and you end up with all columns.

1) SQLite Using the default SQLite backend, suppose we want to return the first 3 rows of all columns in mtcars except for the cyl column. First create a comma separated string, sel, of all such column names and then use fn$sqldf to allow string interpolation referring to it in the SQL statement as $sel. Add verbose=TRUE argument to sqldf if you want to see the SQL statement that was generated.
library(sqldf)
sel <- toString(setdiff(names(mtcars), "cyl"))
fn$sqldf("select $sel from mtcars limit 3")
giving:
mpg disp hp drat wt qsec vs am gear carb
1 21.0 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 108 93 3.85 2.320 18.61 1 1 4 1
2) H2 The H2 backend supports alter table ... drop column ... so we can write the following. Since alter does not return anything we add a select which returns the altered table.
library(RH2)
library(sqldf)
sqldf(c("alter table mtcars drop column cyl",
"select * from mtcars limit 3"))

Assign titles to a column according to percentiles in SQL

Trying to solve Python problems into SQL code.
I would like to assign titles according to the grade in a new column.
For example:
A for the top 0.9% of the column
B for next 15% of the column
C for next 25% of the column
D for next 30% of the column
E for next 13% of the column
F for rest of the column
There is this column:
Grades
2.3
3
2
3.3
3.5
3.6
3.2
2.1
2.3
3.7
3.3
3.1
4.4
4.3
1.4
4.5
3.5
I don't know how sqlite can work with this since it doesn't have a function like quantile that languages like R have.
Something that tried but not even close is this :
SELECT x
FROM MyTable
ORDER BY x
LIMIT 1
OFFSET (SELECT COUNT(*)
FROM MyTable) / 2
to get at half of the column.

r column values in sql where statement

I have a dataset and I am trying to pass the contents of a specific column into the SQL where statement.
For example, assuming iris is my dataset
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
I want to pass the contents of column Species { setosa, setosa, setosa.....setosa} to my sql query where statement
sqlQuery(abcd, paste("Select * from TestTableName1
where WHERE DESCRIPTION
IN (values of Species column from r dataframe)");
Need help here

Your question is really about string manipulation (it's incidental that your string will eventually be passed to sqldf), and the answer is that you paste it together, or use sprintf if you're feeling fancy:
vals = paste(paste0('"', levels(iris$Species), '"'), collapse = ", ")
vals
## [1] "\"setosa\", \"versicolor\", \"virginica\""
vals.paren = paste0("(", vals, ")")
qry = paste("select * from table where description in ", vals.paren)
qry
## [1] "select * from table where description in (setosa, versicolor, virginica)"
# sprintf makes the bounding parentheses cleaner
qry = sprintf("select * from table where description in (%s)", vals)
qry
## [1] "select * from table where description in (setosa, versicolor, virginica)"

By prefacing any function call with fn$ from the gsubfn package string interpolation is enabled on its arguments. See ?fn for more info. This is often used with sqldf in the sqldf package but can be used with any function as we show here. In particular inserting $variable into a string argument of the function call substitutes the value of that variable into that string:
library(gsubfn)
lvls <- toString(shQuote(levels(iris$Species)))
fn$sqlQuery(abcd, "select * from TestTableName1 where DESCRIPTION in ($lvls)")
or if we want to examine the string first:
sql <- fn$identity("select * from TestTableName1 where DESCRIPTION in ($lvls)")
cat(sql, "\n") # look at sql string
sqlQuery(abcd, sql)
The output from the cat statement is:
select * from TestTableName1 where DESCRIPTION in ("setosa", "versicolor", "virginica")

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Selecting Data Using Conditions Stored in a Variable - sql

Related

Pandas / How to insert variable number of lines inside a DataFrame?

Python - Looping through dataframe using methods other than .iterrows()

Select all columns except one in R sqldf package

Assign titles to a column according to percentiles in SQL

r column values in sql where statement

Categories

Resources