find match between 2 sql queries - sql

I have two queries from v$sqlarea. For example
query 1: select * from employee emp where emp.eid = 5
query 2: select * from employee v where v.eid = 15
Both are exactly the same in structure. but they will be compiled separately each time..
I need to match such queries that vary only by alias names or bind variables.
The inbuilt function utl_match.jaro_winkler_similarity() in oracle provides a pretty good string match algorithm. But sql perspective match is not provided. Is there any other solution?

there is a script on Asktom that will find exactly this kind of statements (statements NOT using binds).
The output of that last query will show you statements that are identical in the shared
pool after all numbers and character string constants have been removed. These
statements -- and more importantly their counts -- are the potential bottlenecks. In
addition to causing the contention, they will be HUGE cpu consumers.

Oracle actually does this internally, to support cursor_sharing = similar. I am not aware that they expose this functionality anywhere.

The alias names is a tricky one. You could look for SQLs with the same PLAN_HASH_VALUE as a starting point.

Related

how to separate the parameters in the sql query and push it in to array to avoid sql injection

SELECT * FROM table1 WHERE year_month BETWEEN '2021-08' AND '2022-01';
update table2 set note_description = 'test #8:57am', patient_id = '5840', note_updated_by = '10000019', note_update_date = '2022-07-13 09:45:49' where note_id = '639'
now my backend queries can be attacked by sql injection so i want to avoid the sql injection
in the above queries I want to separate the parameters from queries and replace it with special characters so that I can avoid sql injection is there any package or anything to do it.
If you have received the SQL statement with the parameters already concatenated in, then this is the wrong place to fix your issue - there’s no way to safely parse the statement and separate out the parameters from the query.
You should find the place in the code where the parameters are concatenated into the Statement and leveraging Prepared Statements/Parameterized Queries to safely pass/bind the parameters.
If that’s not possible (for example because the code is structured to only pass along the statement) a less desirable alternative is to encode/enquote the parameters before concatenating them in, while ensuring they are all quoted in the statement. How you do that part will depend on the database / language being used.
I've seen one product that does this: pt-query-digest. It's a free tool that parses the MySQL query log, and produces reports of aggregate time spent running each query. To do this, it must establish a query "fingerprint" which allows it to group queries that are the same except for constant values. Like SELECT * FROM mytable WHERE id = 123 has the same fingerprint as SELECT * FROM mytable WHERE id = 456.
This means it must parse the queries and replace each constant value, like a numeric or string literal, with a placeholder ?. In cases of IN() predicates, it replaces the list of values with ?+. Also it reduces whitespace and removes comments.
It's a non-trivial amount of code, about 100 lines of Perl: https://github.com/percona/percona-toolkit/blob/3.x/lib/QueryRewriter.pm#L139-L248
In spite of this, the function is preceded by a comment that the developers acknowledge it is not perfect, and may miss some cases. Implementing a recursive-descent parser using regular expressions is not efficient or correct.
But this is probably not what you want to do anyway. You shouldn't be starting from a query with constant values and making them into a parameterized query. You should design parameterized queries yourself, as needed.
Not every constant value in an SQL query necessarily must be parameterized. Only the ones that aren't fixed values. That is, if you need to combine a variable from your client code into the SQL query string, and you can't guarantee that the variable is safe, then use a parameter. If a query has a constant value that is fixed (not interpolated from a variable), then it can remain in the query. If a query has a value that comes from a variable, but that variable is known to be safe, and never can be tainted by untrusted input, then it can remain in the query.
It's more reliable and economical for you to make these judgments. You know the code and the context much better than any automated system can.

SQL purpose of colon

My colleague wrote this SQL statement and I had a hard time understanding it. What exactly is the purpose of using a colon in the where clause?
WHERE MGM_YYMM like :AS_YYMM
Full Query:
SELECT A.MGM_YYMM,
A.MGM_DATE,
A.MGM_GB,
A.INDATE,
A.SUDATE,
A.EMPNUM,
FROM SE_MAGAM A(NOLOCK)
WHERE MGM_YYMM like :AS_YYMM
ORDER BY MGM_YYMM DESC
It is a bind variable.
The program (or whatever else is issuing the query) will assign a value to :AS_YYMM, in this case the pattern to match against the MGM_YYMM column.
These kind of parameterized queries are useful because they can be prepared/parsed/compiled/analyzed once and then be run multiple times for varying inputs with reduced overhead (compared to a new query each time). Also helps against SQL injection (compared to building a dynamic SQL statement from user input).

Performance of SQL comparison using substring vs like with wildcard

I am working on a join condition between 2 tables where one of the columns to match on is a concatentation of values. I need to join columnA from tableA to the first 2 characters of columnB from tableB.
I have developed 2 different statements to handle this and I have tried to analyze the performance of each method.
Method 1:
ON tB.columnB like tA.columnA || '%'
Method 2:
ON substr(tB.columnB,1,2) = tA.columnA
The query execution plan has a lot less steps using Method 1 compared to Method 2, however, it looks like Method 2 executes much faster. Also, the execution plan shows a recommended index for Method 2 that could improve its performance.
I am running this on an IBM iSeries, though would be interested in answers in a general sense to learn more about sql query optimization.
Does it make sense that Method 2 would execute faster?
This SO question is similar, but it looks like no one provided any concrete answers to the performance difference of these approaches: T-SQL speed comparison between LEFT() vs. LIKE operator.
PS: The table design that requires this type of join is not something that I can get changed at this time. I realize having the fields separated which hold different types of data would be preferrable.
I ran the following in the SQL Advisor in IBM Data Studio on one of the tables in my DB2 LUW 10.1 database:
SELECT *
FROM PDM.DB30
WHERE DB30_SYSTEM_ID = 'XXX'
AND DB30_VERSION_ID = 'YYY'
AND SUBSTR(DB30_REL_TABLE_NM, 1, 4) = 'ZZZZ'
and
SELECT *
FROM PDM.DB30
WHERE DB30_SYSTEM_ID = 'XXX'
AND DB30_VERSION_ID = 'YYY'
AND DB30_REL_TABLE_NM LIKE 'ZZZZ%'
They both had the exact same access path utilizing the same index, the same estimated IO cost and the same estimated cardinality, the only difference being the estimated total CPU cost for the LIKE was 178,343.75 while the SUBSTR was 197,518.48 (~10% difference).
The cumulative total cost for both were the same though, so this difference is negligible as per the advisor.
Yes, Method 2 would be faster. LIKE is not as efficient a function.
To compare performance of various techniques, try using Visual Explain. You will find it buried in System i Navigator. Under your system connection, expand databases, then click onyour RDB name. In the lower right pane you can then click on the option to Run an SQL Script. Enter in your SELECT statement, and choose the menu option for Visual Explain or Run and Explain. Visual explain will break down the execution plan for your statement and show you the cost for each part as estimated on your tables with the indexes available.
You can actually run with real examples in your database.
LIKE is always better at my run.
select count(*) from u_log where log_text like 'AUT%';
1 row(s) returned : 90ms taken
select count(*) from u_log where substr(log_text,1,3)='AUT';
1 row(s) returned : 493ms taken
I found this reference in an IBM redbook related to SQL performance. It sounds like the SUBSTR scalar function can be handled in an optimized manner by an iSeries.
If you search for the first character and want to use the SQE instead
of the CQE, you can use the scalar function substring on the left sign
of the equal sign. If you have to search for additional characters in
the string, you can additionally use the scalar function POSSTR. By
splitting the LIKE predicate into several scalar function, you can
affect the query optimizer to use the SQE.
http://publib-b.boulder.ibm.com/abstracts/sg246654.html?Open

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.

Building Query from Multi-Selection Criteria

I am wondering how others would handle a scenario like such:
Say I have multiple choices for a user to choose from.
Like, Color, Size, Make, Model, etc.
What is the best solution or practice for handling the build of your query for this scneario?
so if they select 6 of the 8 possible colors, 4 of the possible 7 makes, and 8 of the 12 possible brands?
You could do dynamic OR statements or dynamic IN Statements, but I am trying to figure out if there is a better solution for handling this "WHERE" criteria type logic?
EDIT:
I am getting some really good feedback (thanks everyone)...one other thing to note is that some of the selections could even be like (40 of the selections out of the possible 46) so kind of large. Thanks again!
Thanks,
S
What I would suggest doing is creating a function that takes in a delimited list of makeIds, colorIds, etc. This is probably going to be an int (or whatever your key is). And splits them into a table for you.
Your SP will take in a list of makes, colors, etc as you've said above.
YourSP '1,4,7,11', '1,6,7', '6'....
Inside your SP you'll call your splitting function, which will return a table-
SELECT * FROM
Cars C
JOIN YourFunction(#models) YF ON YF.Id = C.ModelId
JOIN YourFunction(#colors) YF2 ON YF2.Id = C.ColorId
Then, if they select nothing they get nothing. If they select everything, they'll get everything.
What is the best solution or practice for handling the build of your query for this scenario?
Dynamic SQL.
A single parameter represents two states - NULL/non-existent, or having a value. Two more means squaring the number of parameters to get the number of total possibilities: 2 yields 4, 3 yields 9, etc. A single, non-dynamic query can contain all the possibilities but will perform horribly between the use of:
ORs
overall non-sargability
and inability to reuse the query plan
...when compared to a dynamic SQL query that constructs the query out of only the absolutely necessary parts.
The query plan is cached in SQL Server 2005+, if you use the sp_executesql command - it is not if you only use EXEC.
I highly recommend reading The Curse and Blessing of Dynamic SQL.
For something this complex, you may want a session table that you update when the user selects their criteria. Then you can join the session table to your items table.
This solution may not scale well to thousands of users, so be careful.
If you want to create dynamic SQL it won't matter if you use the OR approach or the IN approach. SQL Server will process the statements the same way (maybe with little variation in some situations.)
You may also consider using temp tables for this scenario. You can insert the selections for each criteria into temp tables (e.g., #tmpColor, #tmpSize, #tmpMake, etc.). Then you can create a non-dynamic SELECT statement. Something like the following may work:
SELECT <column list>
FROM MyTable
WHERE MyTable.ColorID in (SELECT ColorID FROM #tmpColor)
OR MyTable.SizeID in (SELECT SizeID FROM #tmpSize)
OR MyTable.MakeID in (SELECT MakeID FROM #tmpMake)
The dynamic OR/IN and the temp table solutions work fine if each condition is independent of the other conditions. In other words, if you need to select rows where ((Color is Red and Size is Medium) or (Color is Green and Size is Large)) you'll need to try other solutions.