How to connect database with R - sql

I have a problem using sqlQuery to connect database with R.
library(RODBC)
res =sqlQuery(channel,
paste0("select pb.col1,pb.col2 from pb,
mp,fb,st
where fb.col10 in ('%s',input),
and fb.col20=mp.col30
and pb.col40=st.col50
and pb.col45=st.col60
and mp.col40=pb.col80 and
pb.col80=st.col90"),
believeNRows=F)
Here, input=c("abc","def","wvy","etz"), but the real input has more than 10,000 string elements.
Channel is already set up for connecting with the database.
It looks like there are some problems with where-clause but I do not know how to fix it.
Can anyone help me with this?

paste0 does not work the way you are using it. You would need to use:
sprintf("select pb.col1,pb.col2
from pb,mp,fb,st
where fb.col10 in %s
and fb.col20=mp.col30
and pb.col40=st.col50
and pb.col45=st.col60
and mp.col40=pb.col80 and
pb.col80=st.col90", input)
Next, the way you have this structured will result in the query argument being a vector. You should aim to have query be a single string.
You might be better off using RODBCext
library(RODBCext)
res =sqlExecute(channel,
"select pb.col1,pb.col2
from pb,mp,fb,st
where fb.col10 in ?,
and fb.col20=mp.col30
and pb.col40=st.col50
and pb.col45=st.col60
and mp.col40=pb.col80
and pb.col80=st.col90",
data = list(c("abc", "def", "wvy", "etz")),
fetch = TRUE,
stringsAsFactors = FALSE)
Lastly, I'm not sure this query is valid SQL syntax. Maybe I'm mistaken, but I don't think you can list multiple tables in the FROM clause like you have here. If you need multiple tables, there should be some way of joining them.
FROM Table1 LEFT JOIN Table2 ON Table1.ID = Table2.Ref
EDIT: I just saw that your input has over 10,000 elements, which will make sqlExecute pretty slow. Are you sure a LIKE is the best way to query these data. If possible, I would recommend some other approach to isolating the data that you need.

Related

R function that takes in CL argument and queries SQL database

Brand new to SQL and SQLite here. I'm trying to create a function in R studio that takes in an argument from the command line and queries an SQL database to see if the record exists in one specific column, displaying a message to the user whether the record was found or not (I have a table within this database, lets call it my_table, that has 2 columns, we'll name them column_1 and column_2. column_1 has ID numbers and column_2 has names that are associated with those ID numbers, and there are a total of about 700 rows).
So far I have something that looks like this:
my_function() <- function(CL_record) { db <- dbConnect(SQLite(), dbname = "mysql.db") dbGetQuery(db, sql("SELECT * FROM my_table WHERE column_2 == #CL_record")) }
But this is obviously not the right way to go about it and I keep getting errors thrown regarding invalid (NULL) left side of assignment.
Any help here would be much appreciated.
I recommend using parameterized queries, something like:
my_function <- function(CL_record) {
db <- dbConnect(SQLite(), dbname = "mysql.db")
on.exit(dbDisconnect(db), add = TRUE)
dbGetQuery(db, "SELECT * FROM my_table WHERE column_2 = ?",
params = list(CL_record))
}
The params= argument does not need to be a list(.), it works just fine here as params=CL_record, but if you have two or more, and especially if they are of different classes (strings vs numbers), then you really should list them.
You'll see many suggestions, or even howtos/tutorials that suggest using paste or similar for adding parameters to the query. There are at least two reasons for not using paste or sprintf:
Whether malicious or inadvertent (e.g., typos), sql injection is something that should be actively avoided. Even if your code is not public-facing, accidents happen.
Most (all?) DBMSes have query optimization/compiling. When it receives a query it has not seen before, it parses and attempts to optimize the query; when it sees a repeat query, it can re-use the previous optimization, saving time. When you paste or sprintf an argument into a query, each query is different from the previous, so it is overall less efficient. When using bound parameters, the query itself does not change (just its parameters), so we can take advantage of the compiling/optimization.

Select rows from database using varying multiple WHERE conditions

I have a map with lots drawn on it, when a person select one or more lots I want to grab the information of those lots from the database and return it. The lots are identified by IDs like the ones in "lots_list".
At the moment I'm using a for loop to iterate over the list and fetch the data, passing the ID to a where clause with a placeholder, but the execution is fairly slow this way.
def getLotInfo(lots_list):
lots = []
for lot in lots_list:
try:
connection = psycopg2.connect(user=user,
password=password,
host=host,
port=port,
database=database)
cursor = connection.cursor()
Psql_Query = '''SELECT setor, quadra, lote, area_ocupada FROM iptu_sql_completo WHERE sql
LIKE %s'''
cursor.execute(Psql_Query, (lot,))
lots.append(cursor.fetchone())
print(lots)
except (Exception, psycopg2.Error) as error:
print("Error fetching data from PostgreSQL table", error)
finally:
# closing database connection.
if (connection):
cursor.close()
connection.close()
print("PostgreSQL connection is closed")
return lots
lots_list = ["0830480002", "0830480003", "0830480004"]
Lots = getLotInfo(lots_list)
I tried to use psycopg2 execute_batch command
Psql_Query = '''SELECT setor, quadra, lote, area_ocupada FROM
iptu_sql_completo WHERE sql LIKE %s'''
ppgextra.execute_batch(cursor, Psql_Query, SQLs)
lots.append(cursor.fetchall())
print(lots)
but I get the following error "not all arguments converted during string formatting" I am imagining that's because I should use a placeholder in the query for every item in the list, but if the list is ever changing in size, would there be a way to fix this? The IDs are not always sequential.
My question is: Is there a way to achieve better performance than using the for loop?
Your current code is pretty much the worst case I had in mind here:
What are the pros and cons of performing calculations in sql vs. in your application
What is faster, one big query or many small queries?
Maurice already mentioned the repeated connection overhead. But even with a single connection, this is far from ideal. Instead, run a single query and pass the whole list lots_list as Postgres array:
SELECT setor, quadra, lote, area_ocupada
FROM iptu_sql_completo
WHERE sql = ANY (%s);
Related:
Pass array literal to PostgreSQL function
IN vs ANY operator in PostgreSQL

Oracle - allowing the LIKE clause being configured by another select

In a script I'm running the following query on an Oracle database:
select nvl(max(to_char(DATA.VALUE)), 'OK')
from DATA
where DATA.FILTER like (select DATA2.FILTER from DATA2 where DATA2.FILTER2 = 'WYZ')
In the actual script it's a bit more complicated, but you get the idea. ;-)
DATA2.FILTER contains the filter which needs to applied on DATA as a LIKE -clause. The idea is to have it as generic as possible, meaning it should be possible to filter on:
FILTER%
%FILTER
FI%LTER
%FILTER%
FILTER (as if the clause was DATA.FILTER = (select DATA2.FILTER from DATA2 where DATA2.FILTER2 = 'WYZ')
The system the script runs on does not allow stored procedures to run for this kind of task, also I can't make the scrip build the query directly before running it.
Whatever data needs to be fetched, it has to be done with one single query.
I've already tried numerous solutions I found online, but no matter what I do, I seem to be misisng the mark.
What am I doing wrong here?
This one?
select nvl(max(to_char(DATA.VALUE)), 'OK')
from DATA
JOIN DATA2 on DATA.FILTER LIKE DATA2.FILTER
where DATA2.FILTER2 = 'WYZ'
Note: The performance of this query is not optimal, because for table DATA Oracle will perform always a "Full Table Scan" - but it works.

why would you use WHERE 1=0 statement in SQL?

I saw a query run in a log file on an application. and it contained a query like:
SELECT ID FROM CUST_ATTR49 WHERE 1=0
what is the use of such a query that is bound to return nothing?
A query like this can be used to ping the database. The clause:
WHERE 1=0
Ensures that non data is sent back, so no CPU charge, no Network traffic or other resource consumption.
A query like that can test for:
server availability
CUST_ATTR49 table existence
ID column existence
Keeping a connection alive
Cause a trigger to fire without changing any rows (with the where clause, but not in a select query)
manage many OR conditions in dynamic queries (e.g WHERE 1=0 OR <condition>)
This may be also used to extract the table schema from a table without extracting any data inside that table. As Andrea Colleoni said those will be the other benefits of using this.
A usecase I can think of: you have a filter form where you don't want to have any search results. If you specify some filter, they get added to the where clause.
Or it's usually used if you have to create a sql query by hand. E.g. you don't want to check whether the where clause is empty or not..and you can just add stuff like this:
where := "WHERE 0=1"
if X then where := where + " OR ... "
if Y then where := where + " OR ... "
(if you connect the clauses with OR you need 0=1, if you have AND you have 1=1)
As an answer - but also as further clarification to what #AndreaColleoni already mentioned:
manage many OR conditions in dynamic queries (e.g WHERE 1=0 OR <condition>)
Purpose as an on/off switch
I am using this as a switch (on/off) statement for portions of my Query.
If I were to use
WHERE 1=1
AND (0=? OR first_name = ?)
AND (0=? OR last_name = ?)
Then I can use the first bind variable (?) to turn on or off the first_name search criterium. , and the third bind variable (?) to turn on or off the last_name criterium.
I have also added a literal 1=1 just for esthetics so the text of the query aligns nicely.
For just those two criteria, it does not appear that helpful, as one might thing it is just easier to do the same by dynamically building your WHERE condition by either putting only first_name or last_name, or both, or none. So your code will have to dynamically build 4 versions of the same query. Imagine what would happen if you have 10 different criteria to consider, then how many combinations of the same query will you have to manage then?
Compile Time Optimization
I also might add that adding in the 0=? as a bind variable switch will not work very well if all your criteria are indexed. The run time optimizer that will select appropriate indexes and execution plans, might just not see the cost benefit of using the index in those slightly more complex predicates. Hence I usally advice, to inject the 0 / 1 explicitly into your query (string concatenating it in in your sql, or doing some search/replace). Doing so will give the compiler the chance to optimize out redundant statements, and give the Runtime Executer a much simpler query to look at.
(0=1 OR cond = ?) --> (cond = ?)
(0=0 OR cond = ?) --> Always True (ignore predicate)
In the second statement above the compiler knows that it never has to even consider the second part of the condition (cond = ?), and it will simply remove the entire predicate. If it were a bind variable, the compiler could never have accomplished this.
Because you are simply, and forcedly, injecting a 0/1, there is zero chance of SQL injections.
In my SQL's, as one approach, I typically place my sql injection points as ${literal_name}, and I then simply search/replace using a regex any ${...} occurrence with the appropriate literal, before I even let the compiler have a stab at it. This basically leads to a query stored as follows:
WHERE 1=1
AND (0=${cond1_enabled} OR cond1 = ?)
AND (0=${cond2_enabled} OR cond2 = ?)
Looks good, easily understood, the compiler handles it well, and the Runtime Cost Based Optimizer understands it better and will have a higher likelihood of selecting the right index.
I take special care in what I inject. Prime way for passing variables is and remains bind variables for all the obvious reasons.
This is very good in metadata fetching and makes thing generic.
Many DBs have optimizer so they will not actually execute it but its still a valid SQL statement and should execute on all DBs.
This will not fetch any result, but you know column names are valid, data types etc. If it does not execute you know something is wrong with DB(not up etc.)
So many generic programs execute this dummy statement for testing and fetching metadata.
Some systems use scripts and can dynamically set selected records to be hidden from a full list; so a false condition needs to be passed to the SQL. For example, three records out of 500 may be marked as Privacy for medical reasons and should not be visible to everyone. A dynamic query will control the 500 records are visible to those in HR, while 497 are visible to managers. A value would be passed to the SQL clause that is conditionally set, i.e. ' WHERE 1=1 ' or ' WHERE 1=0 ', depending who is logged into the system.
quoted from Greg
If the list of conditions is not known at compile time and is instead
built at run time, you don't have to worry about whether you have one
or more than one condition. You can generate them all like:
and
and concatenate them all together. With the 1=1 at the start, the
initial and has something to associate with.
I've never seen this used for any kind of injection protection, as you
say it doesn't seem like it would help much. I have seen it used as an
implementation convenience. The SQL query engine will end up ignoring
the 1=1 so it should have no performance impact.
Why would someone use WHERE 1=1 AND <conditions> in a SQL clause?
If the user intends to only append records, then the fastest method is open the recordset without returning any existing records.
It can be useful when only table metadata is desired in an application. For example, if you are writing a JDBC application and want to get the column display size of columns in the table.
Pasting a code snippet here
String query = "SELECT * from <Table_name> where 1=0";
PreparedStatement stmt = connection.prepareStatement(query);
ResultSet rs = stmt.executeQuery();
ResultSetMetaData rsMD = rs.getMetaData();
int columnCount = rsMD.getColumnCount();
for(int i=0;i<columnCount;i++) {
System.out.println("Column display size is: " + rsMD.getColumnDisplaySize(i+1));
}
Here having a query like "select * from table" can cause performance issues if you are dealing with huge data because it will try to fetch all the records from the table. Instead if you provide a query like "select * from table where 1=0" then it will fetch only table metadata and not the records so it will be efficient.
Per user milso in another thread, another purpose for "WHERE 1=0":
CREATE TABLE New_table_name as select * FROM Old_table_name WHERE 1 =
2;
this will create a new table with same schema as old table. (Very
handy if you want to load some data for compares)
An example of using a where condition of 1=0 is found in the Northwind 2007 database. On the main page the New Customer Order and New Purchase Order command buttons use embedded macros with the Where Condition set to 1=0. This opens the form with a filter that forces the sub-form to display only records related to the parent form. This can be verified by opening either of those forms from the tree without using the macro. When opened this way all records are displayed by the sub-form.
In ActiveRecord ORM, part of RubyOnRails:
Post.where(category_id: []).to_sql
# => SELECT * FROM posts WHERE 1=0
This is presumably because the following is invalid (at least in Postgres):
select id FROM bookings WHERE office_id IN ()
It seems like, that someone is trying to hack your database. It looks like someone tried mysql injection. You can read more about it here: Mysql Injection

How bad is my query?

Ok I need to build a query based on some user input to filter the results.
The query basically goes something like this:
SELECT * FROM my_table ORDER BY ordering_fld;
There are four text boxes in which users can choose to filter the data, meaning I'd have to dynamically build a "WHERE" clause into it for the first filter used and then "AND" clauses for each subsequent filter entered.
Because I'm too lazy to do this, I've just made every filter an "AND" clause and put a "WHERE 1" clause in the query by default.
So now I have:
SELECT * FROM my_table WHERE 1 {AND filters} ORDER BY ordering_fld;
So my question is, have I done something that will adversely affect the performance of my query or buggered anything else up in any way I should be remotely worried about?
MySQL will optimize your 1 away.
I just ran this query on my test database:
EXPLAIN EXTENDED
SELECT *
FROM t_source
WHERE 1 AND id < 100
and it gave me the following description:
select `test`.`t_source`.`id` AS `id`,`test`.`t_source`.`value` AS `value`,`test`.`t_source`.`val` AS `val`,`test`.`t_source`.`nid` AS `nid` from `test`.`t_source` where (`test`.`t_source`.`id` < 100)
As you can see, no 1 at all.
The documentation on WHERE clause optimization in MySQL mentions this:
Constant folding:
(a<b AND b=c) AND a=5
-> b>5 AND b=c AND a=5
Constant condition removal (needed because of constant folding):
(B>=5 AND B=5) OR (B=6 AND 5=5) OR (B=7 AND 5=6)
-> B=5 OR B=6
Note 5 = 5 and 5 = 6 parts in the example above.
You can EXPLAIN your query:
http://dev.mysql.com/doc/refman/5.0/en/explain.html
and see if it does anything differently, which I doubt. I would use 1=1, just so it is more clear.
You might want to add LIMIT 1000 or something, when no parameters are used and the table gets large, will you really want to return everything?
WHERE 1 is a constant, deterministic expression which will be "optimized out" by any decent DB engine.
If there is a good way in your chosen language to avoid building SQL yourself, use that instead. I like Python and Django, and the Django ORM makes it very easy to filter results based on user input.
If you are committed to building the SQL yourself, be sure to sanitize user inputs against SQL injection, and try to encapsulate SQL building in a separate module from your filter logic.
Also, query performance should not be your concern until it becomes a problem, which it probably won't until you have thousands or millions of rows. And when it does come time to optimize, adding a few indexes on columns used for WHERE and JOIN goes a long way.
TO improve performance, use column indexes on fields listen in "WHERE"
Standard SQL Injection Disclaimers here...
One thing you could do, to avoid SQL injection since you know it's only four parameters is use a stored procedure where you pass values for the fields or NULL. I am not sure of mySQL stored proc syntax, but the query would boil down to
SELECT *
FROM my_table
WHERE Field1 = ISNULL(#Field1, Field1)
AND Field2 = ISNULL(#Field2, Field2)
...
ORDRE BY ordering_fld
We've been doing something similiar not too long ago and there're a few things that we observed:
Setting up the indexes on the columns we were (possibly) filtering, improved performance
The WHERE 1 part can be left out completely if the filters're not used. (not sure if it applies to your case) Doesn't make a difference, but 'feels' right.
SQL injection shouldn't be forgotten
Also, if you 'only' have 4 filters, you could build up a stored procedure and pass in null values and check for them. (just like n8wrl suggested in the meantime)
That will work - some considerations:
About dynamically built SQL in general, some databases (Oracle at least) will cache execution plans for queries, so if you end up running the same query many times it won't have to completely start over from scratch. If you use dynamically built SQL, you are creating a different query each time so to the database it will look like 100 different queries instead of 100 runs of the same query.
You'd probably just need to measure the performance to find out if it works well enough for you.
Do you need all the columns? Explicitly specifying them is probably better than using * anyways because:
You can visually see what columns are being returned
If you add or remove columns to the table later, they won't change your interface
Not bad, i didn't know this snippet to get rid of the 'is it the first filter 3' question.
Tho you should be ashamed of your code ( ^^ ), it doesn't do anything to performance as any DB Engine will optimize it.
The only reason I've used WHERE 1 = 1 is for dynamic SQL; it's a hack to make appending WHERE clauses easier by using AND .... It is not something I would include in my SQL otherwise - it does nothing to affect the query overall because it always evaluates as being true and does not hit the table(s) involved so there aren't any index lookups or table scans based on it.
I can't speak to how MySQL handles optional criteria, but I know that using the following:
WHERE (#param IS NULL OR t.column = #param)
...is the typical way of handling optional parameters. COALESCE and ISNULL are not ideal because the query is still utilizing indexes (or worse, table scans) based on a sentinel value. The example I provided won't hit the table unless a value has been provided.
That said, my experience with Oracle (9i, 10g) has shown that it doesn't handle [ WHERE (#param IS NULL OR t.column = #param) ] very well. I saw a huge performance gain by converting the SQL to be dynamic, and used CONTEXT variables to determine what to add. My impression of SQL Server 2005 is that these are handled better.
I have usually done something like this:
for(int i=0; i<numConditions; i++) {
sql += (i == 0 ? "WHERE " : "AND ");
sql += dbFieldNames[i] + " = " + safeVariableValues[i];
}
Makes the generated query a little cleaner.
One alternative i sometimes use is to build the where clause an an array and then join them together:
my #wherefields;
foreach $c (#conditionfields) {
push #wherefields, "$c = ?",
}
my $sql = "select * from table";
if(#wherefields) { $sql.=" WHERE " . join (" AND ", #wherefields); }
The above is written in perl, but most languages have some kind of join funciton.