Select rows from database using varying multiple WHERE conditions - sql

I have a map with lots drawn on it, when a person select one or more lots I want to grab the information of those lots from the database and return it. The lots are identified by IDs like the ones in "lots_list".
At the moment I'm using a for loop to iterate over the list and fetch the data, passing the ID to a where clause with a placeholder, but the execution is fairly slow this way.
def getLotInfo(lots_list):
lots = []
for lot in lots_list:
try:
connection = psycopg2.connect(user=user,
password=password,
host=host,
port=port,
database=database)
cursor = connection.cursor()
Psql_Query = '''SELECT setor, quadra, lote, area_ocupada FROM iptu_sql_completo WHERE sql
LIKE %s'''
cursor.execute(Psql_Query, (lot,))
lots.append(cursor.fetchone())
print(lots)
except (Exception, psycopg2.Error) as error:
print("Error fetching data from PostgreSQL table", error)
finally:
# closing database connection.
if (connection):
cursor.close()
connection.close()
print("PostgreSQL connection is closed")
return lots
lots_list = ["0830480002", "0830480003", "0830480004"]
Lots = getLotInfo(lots_list)
I tried to use psycopg2 execute_batch command
Psql_Query = '''SELECT setor, quadra, lote, area_ocupada FROM
iptu_sql_completo WHERE sql LIKE %s'''
ppgextra.execute_batch(cursor, Psql_Query, SQLs)
lots.append(cursor.fetchall())
print(lots)
but I get the following error "not all arguments converted during string formatting" I am imagining that's because I should use a placeholder in the query for every item in the list, but if the list is ever changing in size, would there be a way to fix this? The IDs are not always sequential.
My question is: Is there a way to achieve better performance than using the for loop?

Your current code is pretty much the worst case I had in mind here:
What are the pros and cons of performing calculations in sql vs. in your application
What is faster, one big query or many small queries?
Maurice already mentioned the repeated connection overhead. But even with a single connection, this is far from ideal. Instead, run a single query and pass the whole list lots_list as Postgres array:
SELECT setor, quadra, lote, area_ocupada
FROM iptu_sql_completo
WHERE sql = ANY (%s);
Related:
Pass array literal to PostgreSQL function
IN vs ANY operator in PostgreSQL

Related

R function that takes in CL argument and queries SQL database

Brand new to SQL and SQLite here. I'm trying to create a function in R studio that takes in an argument from the command line and queries an SQL database to see if the record exists in one specific column, displaying a message to the user whether the record was found or not (I have a table within this database, lets call it my_table, that has 2 columns, we'll name them column_1 and column_2. column_1 has ID numbers and column_2 has names that are associated with those ID numbers, and there are a total of about 700 rows).
So far I have something that looks like this:
my_function() <- function(CL_record) { db <- dbConnect(SQLite(), dbname = "mysql.db") dbGetQuery(db, sql("SELECT * FROM my_table WHERE column_2 == #CL_record")) }
But this is obviously not the right way to go about it and I keep getting errors thrown regarding invalid (NULL) left side of assignment.
Any help here would be much appreciated.
I recommend using parameterized queries, something like:
my_function <- function(CL_record) {
db <- dbConnect(SQLite(), dbname = "mysql.db")
on.exit(dbDisconnect(db), add = TRUE)
dbGetQuery(db, "SELECT * FROM my_table WHERE column_2 = ?",
params = list(CL_record))
}
The params= argument does not need to be a list(.), it works just fine here as params=CL_record, but if you have two or more, and especially if they are of different classes (strings vs numbers), then you really should list them.
You'll see many suggestions, or even howtos/tutorials that suggest using paste or similar for adding parameters to the query. There are at least two reasons for not using paste or sprintf:
Whether malicious or inadvertent (e.g., typos), sql injection is something that should be actively avoided. Even if your code is not public-facing, accidents happen.
Most (all?) DBMSes have query optimization/compiling. When it receives a query it has not seen before, it parses and attempts to optimize the query; when it sees a repeat query, it can re-use the previous optimization, saving time. When you paste or sprintf an argument into a query, each query is different from the previous, so it is overall less efficient. When using bound parameters, the query itself does not change (just its parameters), so we can take advantage of the compiling/optimization.

Performance issue when using bind variable for a large list inside the IN clause

I'm using Sybase and had some code that looked like this:
String[] ids = ... an array containing 80-90k strings, which is retrieved from another table and varies.
for (String id : ids) {
// wrap every id with single-quotes
}
String idsAsString = String.join(",", ids);
String query = String.format("select * from someTable where idName in (%s)", idsAsString);
getNamedParameterJDBCTemplate().query(query, resultSetExtractor ->{
// do stuff with results
});
I've timed how long it took to get to the inner body of the resultSetExtractor and it never took longer than 4 seconds.
But to secure the code, I tried going the bind variable route. Thus, that code looked like the following:
String[] ids = ... an array containing 80-90k strings, which is retrieved from another table and varies.
String query = "select * from someTable where idName in (:ids)";
Map<String, Object> params = new HashMap<>();
params.put("ids", Arrays.asList(ids));
getNamedParameterJDBCTemplate().query(query, params, resultSetExtractor ->{
// do stuff with results
});
But doing it this way will take up to 4-5 minutes to finally spew out the following exception:
21-10-2019 14:04:01 DEBUG DefaultConnectionTester:126 - Testing a Connection in response to an Exception:
com.sybase.jdbc4.jdbc.SybSQLException: The token datastream length was not correct. This is an internal protocol error.
I also have other bits of code where I pass in arrays of sizes 1-10 as bind variables and noticed that those queries went from being instantaneous to taking up to 10 seconds.
I'm surprised doing it the bind variable way is at all different, let alone that drastically different. Can someone explain what is going on here? Is it that bind variable does something different underneath the hood as opposed to sending a formatted string through JDBC? Is there another way to secure my code without drastically slowing performance?
You should verify what's actually happening at the database end via a showplan/query plan, but using an 'in' clause will at best usually do one index search for every value in the 'in' clause, therefore 10 values does ten searches, 80k searches does 80k of them and thus massively slower. Oracle actually prohibits putting more than 1000 values in an 'in clause and whilst Sybase is not so restrictive that doesn't mean its a good idea. You risk stack and other issues in your database by putting massive amounts of values in this way I've seen such a query take out a production database instance with a stack failure.
It's much better to create a temporary table, load the 80k values into there and do an inner join between the temporary table and the main table using the column which previously you searched with the in clause.

How to connect database with R

I have a problem using sqlQuery to connect database with R.
library(RODBC)
res =sqlQuery(channel,
paste0("select pb.col1,pb.col2 from pb,
mp,fb,st
where fb.col10 in ('%s',input),
and fb.col20=mp.col30
and pb.col40=st.col50
and pb.col45=st.col60
and mp.col40=pb.col80 and
pb.col80=st.col90"),
believeNRows=F)
Here, input=c("abc","def","wvy","etz"), but the real input has more than 10,000 string elements.
Channel is already set up for connecting with the database.
It looks like there are some problems with where-clause but I do not know how to fix it.
Can anyone help me with this?
paste0 does not work the way you are using it. You would need to use:
sprintf("select pb.col1,pb.col2
from pb,mp,fb,st
where fb.col10 in %s
and fb.col20=mp.col30
and pb.col40=st.col50
and pb.col45=st.col60
and mp.col40=pb.col80 and
pb.col80=st.col90", input)
Next, the way you have this structured will result in the query argument being a vector. You should aim to have query be a single string.
You might be better off using RODBCext
library(RODBCext)
res =sqlExecute(channel,
"select pb.col1,pb.col2
from pb,mp,fb,st
where fb.col10 in ?,
and fb.col20=mp.col30
and pb.col40=st.col50
and pb.col45=st.col60
and mp.col40=pb.col80
and pb.col80=st.col90",
data = list(c("abc", "def", "wvy", "etz")),
fetch = TRUE,
stringsAsFactors = FALSE)
Lastly, I'm not sure this query is valid SQL syntax. Maybe I'm mistaken, but I don't think you can list multiple tables in the FROM clause like you have here. If you need multiple tables, there should be some way of joining them.
FROM Table1 LEFT JOIN Table2 ON Table1.ID = Table2.Ref
EDIT: I just saw that your input has over 10,000 elements, which will make sqlExecute pretty slow. Are you sure a LIKE is the best way to query these data. If possible, I would recommend some other approach to isolating the data that you need.

Groovy SQL Error - This ResultSet is closed

I am using Groovy's Sql object to perform queries on a postgres db. The queries are being executed as follows:
List<Map> results = sql.rows("select * from my_table")
List<Map> result2= sql.rows("select * from my_second_table")
I have a groovy method that performs two queries and then does some processing to loop through the data to make a different dataset, however, on some occasions I recieve a postgres exception "This ResultSet is closed" error.
having searched, I originally thought it might be to do with the issue here: SQLException: This ResultSet is closed (running multiple queries and trying to access the data from the resultsets after the fact) - however, we only seem to get the exception on quite high load - which suggests that it isnt as simple as the first dataset is closed on executing the second query as if this was the case I would expect it to happen consistently.
Can anyone shed any light on how Groovy's Sql object handles these situations or suggest what might be going wrong?
Groovy SQL is kind of a weird cat. Easy to use for simple stuff. If you have more complex scenarios you probably are better off using something else. IMHO
I first suggest doing one query, storing the results into a collection, do the second query and store the results in a collection and then do your operations between two collections rather than result sets. If you data is too large for that, find some way to store the data locally before you start doing your aggregation or whatever.
If you don't like that, you might need to checkout the GDK source code to get a better idea what is done with the Sql.getInstance() related to result sets etc. Then you can sidestep whatever land mine you are inadvertently stepping on.
Perhaps
List<Map> results = sql.rows("select * from my_table")
List<Map> result2= sql.rows("select * from my_second_table")
will not work even in plain Java (as already said in the answer you provided when second call is made on statement all resources dedicated during the previous call have to be released). As mentioned by #Todd W Crone Groovy can optimize resources, e.g. release them dynamically or don't release them depending on certain run.
Actually I've tried with only one query. E.g. I've tried to get ResultSet and then iterate through it, like this (don't mind the names of table and field, query is rather simple; and result is one row that contains one column due to LIMIT 1 clause):
def resultSet = sql.executeQuery("SELECT age FROM person WHERE id = 12345 LIMIT 1")
resultSet.next()
and got This ResultSet is closed error. Seems that Groovy optimizes resources and closes ResultSet immediately. I didn't look into the source code of Groovy SDK. I found that eachRow and other methods with closure-style handlers work fine and don't throw This ResultSet is closed error.
Perhaps, methods with closure-style handlers can help you. For example, look at except from the article where rows() method with closure is used:
String query = 'select id as identifier, name as langName from languages'
def rows = db.rows(query, { meta ->
assert meta.tableName == 'languages'
assert meta.columnCount == 2
// ...
})

why would you use WHERE 1=0 statement in SQL?

I saw a query run in a log file on an application. and it contained a query like:
SELECT ID FROM CUST_ATTR49 WHERE 1=0
what is the use of such a query that is bound to return nothing?
A query like this can be used to ping the database. The clause:
WHERE 1=0
Ensures that non data is sent back, so no CPU charge, no Network traffic or other resource consumption.
A query like that can test for:
server availability
CUST_ATTR49 table existence
ID column existence
Keeping a connection alive
Cause a trigger to fire without changing any rows (with the where clause, but not in a select query)
manage many OR conditions in dynamic queries (e.g WHERE 1=0 OR <condition>)
This may be also used to extract the table schema from a table without extracting any data inside that table. As Andrea Colleoni said those will be the other benefits of using this.
A usecase I can think of: you have a filter form where you don't want to have any search results. If you specify some filter, they get added to the where clause.
Or it's usually used if you have to create a sql query by hand. E.g. you don't want to check whether the where clause is empty or not..and you can just add stuff like this:
where := "WHERE 0=1"
if X then where := where + " OR ... "
if Y then where := where + " OR ... "
(if you connect the clauses with OR you need 0=1, if you have AND you have 1=1)
As an answer - but also as further clarification to what #AndreaColleoni already mentioned:
manage many OR conditions in dynamic queries (e.g WHERE 1=0 OR <condition>)
Purpose as an on/off switch
I am using this as a switch (on/off) statement for portions of my Query.
If I were to use
WHERE 1=1
AND (0=? OR first_name = ?)
AND (0=? OR last_name = ?)
Then I can use the first bind variable (?) to turn on or off the first_name search criterium. , and the third bind variable (?) to turn on or off the last_name criterium.
I have also added a literal 1=1 just for esthetics so the text of the query aligns nicely.
For just those two criteria, it does not appear that helpful, as one might thing it is just easier to do the same by dynamically building your WHERE condition by either putting only first_name or last_name, or both, or none. So your code will have to dynamically build 4 versions of the same query. Imagine what would happen if you have 10 different criteria to consider, then how many combinations of the same query will you have to manage then?
Compile Time Optimization
I also might add that adding in the 0=? as a bind variable switch will not work very well if all your criteria are indexed. The run time optimizer that will select appropriate indexes and execution plans, might just not see the cost benefit of using the index in those slightly more complex predicates. Hence I usally advice, to inject the 0 / 1 explicitly into your query (string concatenating it in in your sql, or doing some search/replace). Doing so will give the compiler the chance to optimize out redundant statements, and give the Runtime Executer a much simpler query to look at.
(0=1 OR cond = ?) --> (cond = ?)
(0=0 OR cond = ?) --> Always True (ignore predicate)
In the second statement above the compiler knows that it never has to even consider the second part of the condition (cond = ?), and it will simply remove the entire predicate. If it were a bind variable, the compiler could never have accomplished this.
Because you are simply, and forcedly, injecting a 0/1, there is zero chance of SQL injections.
In my SQL's, as one approach, I typically place my sql injection points as ${literal_name}, and I then simply search/replace using a regex any ${...} occurrence with the appropriate literal, before I even let the compiler have a stab at it. This basically leads to a query stored as follows:
WHERE 1=1
AND (0=${cond1_enabled} OR cond1 = ?)
AND (0=${cond2_enabled} OR cond2 = ?)
Looks good, easily understood, the compiler handles it well, and the Runtime Cost Based Optimizer understands it better and will have a higher likelihood of selecting the right index.
I take special care in what I inject. Prime way for passing variables is and remains bind variables for all the obvious reasons.
This is very good in metadata fetching and makes thing generic.
Many DBs have optimizer so they will not actually execute it but its still a valid SQL statement and should execute on all DBs.
This will not fetch any result, but you know column names are valid, data types etc. If it does not execute you know something is wrong with DB(not up etc.)
So many generic programs execute this dummy statement for testing and fetching metadata.
Some systems use scripts and can dynamically set selected records to be hidden from a full list; so a false condition needs to be passed to the SQL. For example, three records out of 500 may be marked as Privacy for medical reasons and should not be visible to everyone. A dynamic query will control the 500 records are visible to those in HR, while 497 are visible to managers. A value would be passed to the SQL clause that is conditionally set, i.e. ' WHERE 1=1 ' or ' WHERE 1=0 ', depending who is logged into the system.
quoted from Greg
If the list of conditions is not known at compile time and is instead
built at run time, you don't have to worry about whether you have one
or more than one condition. You can generate them all like:
and
and concatenate them all together. With the 1=1 at the start, the
initial and has something to associate with.
I've never seen this used for any kind of injection protection, as you
say it doesn't seem like it would help much. I have seen it used as an
implementation convenience. The SQL query engine will end up ignoring
the 1=1 so it should have no performance impact.
Why would someone use WHERE 1=1 AND <conditions> in a SQL clause?
If the user intends to only append records, then the fastest method is open the recordset without returning any existing records.
It can be useful when only table metadata is desired in an application. For example, if you are writing a JDBC application and want to get the column display size of columns in the table.
Pasting a code snippet here
String query = "SELECT * from <Table_name> where 1=0";
PreparedStatement stmt = connection.prepareStatement(query);
ResultSet rs = stmt.executeQuery();
ResultSetMetaData rsMD = rs.getMetaData();
int columnCount = rsMD.getColumnCount();
for(int i=0;i<columnCount;i++) {
System.out.println("Column display size is: " + rsMD.getColumnDisplaySize(i+1));
}
Here having a query like "select * from table" can cause performance issues if you are dealing with huge data because it will try to fetch all the records from the table. Instead if you provide a query like "select * from table where 1=0" then it will fetch only table metadata and not the records so it will be efficient.
Per user milso in another thread, another purpose for "WHERE 1=0":
CREATE TABLE New_table_name as select * FROM Old_table_name WHERE 1 =
2;
this will create a new table with same schema as old table. (Very
handy if you want to load some data for compares)
An example of using a where condition of 1=0 is found in the Northwind 2007 database. On the main page the New Customer Order and New Purchase Order command buttons use embedded macros with the Where Condition set to 1=0. This opens the form with a filter that forces the sub-form to display only records related to the parent form. This can be verified by opening either of those forms from the tree without using the macro. When opened this way all records are displayed by the sub-form.
In ActiveRecord ORM, part of RubyOnRails:
Post.where(category_id: []).to_sql
# => SELECT * FROM posts WHERE 1=0
This is presumably because the following is invalid (at least in Postgres):
select id FROM bookings WHERE office_id IN ()
It seems like, that someone is trying to hack your database. It looks like someone tried mysql injection. You can read more about it here: Mysql Injection