Regular expression to match common SQL syntax? - sql

I was writing some Unit tests last week for a piece of code that generated some SQL statements.
I was trying to figure out a regex to match SELECT, INSERT and UPDATE syntax so I could verify that my methods were generating valid SQL, and after 3-4 hours of searching and messing around with various regex editors I gave up.
I managed to get partial matches but because a section in quotes can contain any characters it quickly expands to match the whole statement.
Any help would be appreciated, I'm not very good with regular expressions but I'd like to learn more about them.
By the way it's C# RegEx that I'm after.
Clarification
I don't want to need access to a database as this is part of a Unit test and I don't wan't to have to maintain a database to test my code. which may live longer than the project.

Regular expressions can match languages only a finite state automaton can parse, which is very limited, whereas SQL is a syntax. It can be demonstrated you can't validate SQL with a regex. So, you can stop trying.

SQL is a type-2 grammar, it is too powerful to be described by regular expressions. It's the same as if you decided to generate C# code and then validate it without invoking a compiler. Database engine in general is too complex to be easily stubbed.
That said, you may try ANTLR's SQL grammars.

As far as I know this is beyond regex and your getting close to the dark arts of BnF and compilers.
http://savage.net.au/SQL/
Same things happens to people who want to do correct syntax highlighting. You start cramming things into regex and then you end up writing a compiler...

I had the same problem - an approach that would work for all the more standard sql statements would be to spin up an in-memory Sqlite database and issue the query against it, if you get back a "table does not exist" error, then your query parsed properly.

Off the top of my head: Couldn't you pass the generated SQL to a database and use EXPLAIN on them and catch any exceptions which would indicate poorly formed SQL?

Have you tried the lazy selectors. Rather than match as much as possible, they match as little as possible which is probably what you need for quotes.

To validate the queries, just run them with SET NOEXEC ON, that is how Entreprise Manager does it when you parse a query without executing it.
Besides if you are using regex to validate sql queries, you can be almost certain that you will miss some corner cases, or that the query is not valid from other reasons, even if it's syntactically correct.

I suggest creating a database with the same schema, possibly using an embedded sql engine, and passing the sql to that.

I don't think that you even need to have the schema created to be able to validate the statement, because the system will not try to resolve object_name etc until it has successfully parsed the statement.
With Oracle as an example, you would certainly get an error if you did:
select * from non_existant_table;
In this case, "ORA-00942: table or view does not exist".
However if you execute:
select * frm non_existant_table;
Then you'll get a syntax error, "ORA-00923: FROM keyword not found where expected".
It ought to be possible to classify errors into syntax parsing errors that indicate incorrect syntax and errors relating to tables name and permissions etc..
Add to that the problem of different RDBMSs and even different versions allowing different syntaxes and I think you really have to go to the db engine for this task.

There are ANTLR grammars to parse SQL. It's really a better idea to use an in memory database or a very lightweight database such as sqlite. It seems wasteful to me to test whether the SQL is valid from a parsing standpoint, and much more useful to check the table and column names and the specifics of your query.

The best way is to validate the parameters used to create the query, rather than the query itself. A function that receives the variables can check the length of the strings, valid numbers, valid emails or whatever. You can use regular expressions to do this validations.

public bool IsValid(string sql)
{
string pattern = #"SELECT\s.*FROM\s.*WHERE\s.*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
return rgx.IsMatch(sql);
}

I am assuming you did something like .\* try instead [^"]* that will keep you from eating the whole line. It still will give false positives on cases where you have \ inside your strings.

Related

How to skip the SQL(part of the SQL) parsing in antlr4?

Sorry for this question was closed and can not be reopened, and my poor english, it was translated by website indeed. :)
https://stackoverflow.com/questions/70035964/how-to-skip-sql-parsing-in-antlr4
#BartKiers Thanks for being interested in this question, let me give it a detailed example.
There are lots of SQL queries, such as select * from user or update user set field1 = 'value1' where condition = 'value' etc, let's called it original SQL queries.
There is a java program which intercepts and parses all the original SQL queries into Parse Tree nodes by ANTLR4, and then rewrites the query (which depended on the parse phase) by the java program, so the original SQL queries may be parsed and rewritten as
select field1, field1_encrypted, field1_digest, field2 from user
or
update user
set field1 = value1,
field1_encrypted = encrypt_algorithm(value1),
field1_digest = digest_algorithm(value1)
where condition_digest = digest_algorithm(values)
etc.
While they finished the rewritten phase, they should be executed as SQLStatement, the SELECT was executed as SelectSQLStatement while UPDATE executed as UpdateSQLStatement.
Now I thought some of the original SQL queries should skip the parse phase, and the rewrite phase which should be skipped as the same, but the originalSQL queries should be executed as it was.
I thought to mark those with comment as
/* PARSE_PHASE_SKIPPED=TRUE */ originalSQL
or prefix SKIP as
SKIP originalSQL
, I wish to parse the whole marked but original SQL query part into Parse Tree nodes by ANTLR4, and executed it as ParsePhaseSkippedSQLStatement.
Can ANTLR4 support on this situation, and how should the grammar be written? Thanks in advance.
====================
Thank you for your reply #Mike Cargal, Yes, almost.
Let me say it again and give a more detailed example.
There is a java system that we call it X, X has lots of SQL queries that the developers write and guarantee that those SQLs can be executed correctly by Ibatis / JPA etc, let's named those SQL queries as original SQL queries.
Using below original SQL queries as examples:
insert into user (username, id_no) values ('xyz', '123456')
select username, id_no from user u where u.id_no = '123456'
We say the column id_no on table user is sensitive data, we should save ciphertext instead of plaintext, so the originalSQLs would be parsed by ANTLR and rewritten by java code as below, let's named those SQLs as rewritten SQL queries, also rewritten SQL queries should be executed correctly by Ibatis / JPA etc.
insert
into user (username, id_no, id_no_cipher, id_no_digest)
values ('xyz', '', 'encrypted_123456', 'digest_123456')
select username, id_no_cipher as id_no
from user u
where u.id_no_digest = 'digest_123456'
In this case:
1、we see that the rewrite phase depends on the parse phase, original SQL queries need to be correctly parsed then to be rewritten by java code.
2、all original SQL queries are parsed but only a few matching the sensitive rules are rewritten to rewritten SQL queries.
But there are lots original SQL queries we clearly know that do not need to be rewritten, and also no need to be parsed, and may report exceptions in various complex situations while parsing it, but it should be executed correctly by Ibatis / JPA etc.
So I planed to use sql comment / customized keyword annotation to "turn off" parse phase of it.
If I understand your question correctly, you wish to use some sort of comment/annotation to "turn off" execution of the following SQL statement.
(NOTE: You can't really skip "parsing" part of the input. This will address ways in which you could skip processing part of the parsed input, which I believe is what you're ultimately wanting to accomplish.)
This would not be an ANTLR concern. ANTLR's responsibility is to parse you input stream and produce a parse tree (not technically an AST) that correctly represents the structure of your input.
Executing the SQL is not what ANTLR does. It does, however, generate utility Listener and Visitor classes that can be used to cleanly and efficiently navigate the resulting parse tree. There can be a lot of code involved in actually executing the SQL from the parse tree. Often, the first step is to produce an AST from the parse tree to make it easier to deal with.
You have a couple of alternatives (as you hint at).
1 - Using the current grammar an putting these annotations inside of comments (/* PARSE_PHASE_SKIPPED=TRUE */)
This can be done, but it's a bit "messy". It's most likely that COMMENT tokens are -> skiped (or perhaps sent to -> channel(HIDDEN)). This makes it MUCH easier to write the parser rules as you don't have to include optional COMMENTs everywhere a comment could appear. That said, if you send COMMENT tokens to the HIDDEN channel, they are still in the token stream even though they are ignored by the parser. The COMMENT tokens won't be in the rule Context objects that the listeners/visitors deal with, but you could look backwards/forwards in the token stream for COMMENT nodes.
2 - you could introduce some new syntax for annotations (similar to your SKIP idea). To do this you'll have to extend the syntax in the grammar to recognize these annotations. They'd have to be distinguishable from valid SQL, so a simple SKIP is probably not going to work.
The benefit of this approach is that, when you extend the grammar to recognize annotations, you can be very specific about where annotations are allowed. You'd be able to include them in your parse tree, making them easier to locate.
With either of these approaches, you would use a visitor or listener to go through your parse tree looking for the comment/annotation and then mark the subsequent statement with an indicator that you don't want to execute it. (You might use the information to simply skip the parse tree to AST transformation of the "skipped" nodes).
Let me see if I understand your question correctly. In your environment you run SQL queries (not "SQLs", btw.), which may contain data that must not reach the server as is. It doesn't matter if that is sensitive data or what else. All what matters is that you want to replace the text in the queries.
For that you parse the queries and rewrite them, before sending them to the server. However, you don't want to do that for all queries, but only for specific ones. And you came up with the idea to mark queries (or query parts) that must not be transformed, with a special comment. Does this match your intention so far?
Now I wonder why you want to accomplish that on query (parsing) level. It's not the parsing you want to modify but the semantic handling of the parse result (here the parse tree, as Mike Cargal already mentioned). So, in my opinion you don't need to introduce special markup for your queries, but instead define criterions that indicate which data must be replaced.
When you think about that you will probably realise that data for specific fields (columns) in specific tables need to be replaced. You can actually keep a list of schema/table/columns tuples, which tell your rewriter if a value must be rewritten. Everything else stays as it is.
ANTLR4 has nothing to do in this process. It's all to be done in the semantic phase (the processing of the parse tree using a parse tree listener). In this phase you have to collect all column references that are used in a query. Then you compare that list with the list for the rewriter. If a column reference matches, the rewriter has to rewrite the text for it in the query.
That task is however non-trivial, because of nested queries (subqueries, where inner queries can reference tables from an outer query). This is btw. pretty similar to the way code completion works, where you have to provide a possible column list for all mentioned tables in a query. That's why I have written (C++) code to collect such references in MySQL Workbench's SQL code completion.

Custom, user-definable "wildcard" constants in SQL database search -- possible?

My client is making database searches using a django webapp that I've written. The query sends a regex search to the database and outputs the results.
Because the regex searches can be pretty long and unintuitive, the client has asked for certain custom "wildcards" to be created for the regex searches. For example.
Ω := [^aeiou] (all non-vowels)
etc.
This could be achieved with a simple permanent string substitution in the query, something like
query = query.replace("Ω", "[^aeiou]")
for all the elements in the substitution list. This seems like it should be safe, but I'm not really sure.
He has also asked that it be possible for the user to define custom wildcards for their searches on the fly. So that there would be some other input box where a user could define
∫ := some other regex
And to store them you might create a model
class RegexWildcard(models.Model):
symbol = ...
replacement = ...
I'm personally a bit wary of this, because it does not seem to add a whole lot of functionality, but does seem to add a lot of complexity and potential problems to the code. Clients can now write their queries to a db. Can they overwrite each other's symbols?
That I haven't seen this done anywhere before also makes me kind of wary of the idea.
Is this possible? Desirable? A great idea? A terrible idea? Resources and any guidance appreciated.
Well, you're getting paid by the hour....
I don't see how involving the Greek alphabet is to anyone's advantage. If the queries are stored anywhere, everyone approaching the system would have to learn the new syntax to understand them. Plus, there's the problem of how to type the special symbols.
If the client creates complex regular expressions they'd like to be able to reuse, that's understandable. Your application could maintain a list of such expressions that the user could add to and choose from. Notionally, the user would "click on" an expression, and it would be inserted into the query.
The saved expressions could have user-defined names, to make them easier to remember and refer to. And you could define a syntax that referenced them, something otherwise invalid in SQL, such as ::name. Before submitting the query to the DBMS, you substitute the regex for the name.
You still have the problem of choosing good names, and training.
To prevent malformed SQL, I imagine you'll want to ensure the regex is valid. You wouldn't want your system to store a ; drop table CUSTOMERS; as a "regular expression"! You'll either have to validate the expression or, if you can, treat the regex as data in a parameterized query.
The real question to me, though, is why you're in the vicinity of standardized regex queries. That need suggests a database design issue: it suggests the column being queried is composed of composite data, and should be represented as multiple columns that can be queried directly, without using regular expressions.

Is this method of building dynamic SQL vulnerable to SQL injection or bad for performance?

I would like to build a safe dynamic select statement that can handle multiple WHERE clauses.
For example the base SQL would look like:
SELECT * FROM Books Where Type='Novel'
I would pass the function something like:
SafeDynamicSQL("author,=,Herman Melville","pages,>,1000");
Which would sanitize inputs and concatenate like:
SELECT * FROM Books Where Type='Novel' AND author=#author AND pages>#pages
The function would sanitize the column name by checking against an array of predefined column names. The operator would only be allowed to be >,<,=. The value would be added as a normal paramater.
Would this still be vulnerable to SQL injection?
There will be some string manipulation and small loops which will affect performance but my thoughts are that this will only take a few milliseconds compared to the request which on average take 200ms. Would this tax the server more than I am thinking if these requests are made about once a second?
I know this isn't best practice by any means, but it will greatly speed up development. Please give me any other reasons why this could be a bad idea.
It looks like you're reinventing any number of existing ORM solutions which offer a similar API for creating WHERE clauses.
The answer to your question hinges on what you mean by "The value would be added as a normal paramater." If by that you mean performing string concatenation to produce the string you showed then yes, that would still be subject to SQL injection attack. If you mean using an actual parameterized query then you would be safe. In the later case, you would produce something like
SELECT * FROM Books Where Type='Novel' AND author=? AND pages > ?
and then bind that to a list of values like ['Herman Melville', 1000]. Exactly what it would look like depends on what programming language you're using.
Finally, if you pursue this path I would strongly recommend changing from comma-delimited arguments to three separate arguments, you'd save yourself a lot of programming time.
Pretty much any code that appends together (or interpolates) strings to create SQL is bad form from a security point of view, and is probably subject to some SQLi attack vector. Just use bound parameters and avoid the entire problem; avoiding SQL injection is super-easy.

ColdFusion adding extra quotes when constructing database queries in strings

I am coding in ColdFusion, but trying to stay in cfscript, so I have a function that allows me to pass in a query to run it with
<cfquery blah >
#query#
</cfquery>
Somehow though, when I construct my queries with sql = "SELECT * FROM a WHERE b='#c#'" and pass it in, ColdFusion has replaced the single quotes with 2 single quotes. so it becomes WHERE b=''c'' in the final query.
I have tried creating the strings a lot of different ways, but I cannot get it to leave just one quote. Even doing a string replace has no effect.
Any idea why this is happening? It is ruining my hopes of living in cfscript for the duration of this project
ColdFusion, by design, escapes single quotes when interpolating variables within <cfquery> tags.
To do what you want, you need to use the PreserveSingleQuotes() function.
<cfquery ...>#PreserveSingleQuotes(query)#</cfquery>
This doesn't address, however, the danger of SQL injection to which you are exposing yourself.
Using <cfqueryparam> also allows your database to cache the query, which in most cases will improve performance.
It might be helpful to read an old Ben Forta column and a recent post by Brad Wood for more information about the benefits of using <cfqueryparam>.
The answer to your question, as others have said, is using preserveSingleQuotes(...)
However, the solution you actually want, is not to dynamically build your queries in this fashion. It's Bad Bad Bad.
Put your SQL inside the cfquery tags, with any ifs/switches/etc as appropriate, and ensure all CF variables use the cfqueryparam tag.
(Note, if you use variables in the ORDER BY clause, you'll need to manually escape any variables; cfqueryparam can't be used in ORDER BY clauses)
ColdFusion automatically escapes single quotes quotes in <cfquery> tags when you use the following syntax:
SELECT * FROM TABLE WHERE Foo='#Foo#'
In case you would want to preserve single quotes in #Foo# you must call #PreserveSingleQuotes(Foo)#.
Be aware the the automatic escaping works only for variable values, not for function results.
SELECT * FROM TABLE WHERE Foo='#LCase(Foo)#' /* Single quotes are retained! */
In that light, the function PreserveSingleQuotes() (see Adobe LiveDocs) is not much more than a "null operation" on the value - turning it into a function result to bypass auto-escaping.
I voted up Dave's answer since I thought he did a good job.
I'd like to add however that there are also several different tools designed for ColdFusion that can simplify a lot of the common SQL tasks you're likely to perform. There's a very light-weight tool called DataMgr written by Steve Bryant, as well as Transfer from Mark Mandel, Reactor which was originally created by Doug Hughes and one I developed called DataFaucet. Each of these has its own strengths and weaknesses. Personally I think you're apt to consider DataFaucet to be the one that will give you the best ability to stay in cfscript, with a variety of syntaxes for building different kinds of queries.
Here are a few examples:
qry = datasource.select_avg_price_as_avgprice_from_products(); //(requires CF8)
qry = datasource.select("avg(price) as avgprice","products");
qry = datasource.getSelect("avg(price) as avgprice","products").filter("categoryid",url.categoryid).execute();
qry = datasource.getSelect(table="products",orderby="productname").filter("categoryid",url.categoryid).execute();
The framework ensures that cfqueryparam is always used with these filter statements to prevent sql-injection attacks, and there are similar syntaxes for insert, update and delete statements. (There are a couple of simple rules to avoid sql-injection.)

Is there some way to inject SQL even if the ' character is deleted?

If I remove all the ' characters from a SQL query, is there some other way to do a SQL injection attack on the database?
How can it be done? Can anyone give me examples?
Yes, there is. An excerpt from Wikipedia
"SELECT * FROM data WHERE id = " + a_variable + ";"
It is clear from this statement that the author intended a_variable to be a number correlating to the "id" field. However, if it is in fact a string then the end user may manipulate the statement as they choose, thereby bypassing the need for escape characters. For example, setting a_variable to
1;DROP TABLE users
will drop (delete) the "users" table from the database, since the SQL would be rendered as follows:
SELECT * FROM DATA WHERE id=1;DROP TABLE users;
SQL injection is not a simple attack to fight. I would do very careful research if I were you.
Yes, depending on the statement you are using. You are better off protecting yourself either by using Stored Procedures, or at least parameterised queries.
See Wikipedia for prevention samples.
I suggest you pass the variables as parameters, and not build your own SQL. Otherwise there will allways be a way to do a SQL injection, in manners that we currently are unaware off.
The code you create is then something like:
' Not Tested
var sql = "SELECT * FROM data WHERE id = #id";
var cmd = new SqlCommand(sql, myConnection);
cmd.Parameters.AddWithValue("#id", request.getParameter("id"));
If you have a name like mine with an ' in it. It is very annoying that all '-characters are removed or marked as invalid.
You also might want to look at this Stackoverflow question about SQL Injections.
Yes, it is definitely possible.
If you have a form where you expect an integer to make your next SELECT statement, then you can enter anything similar:
SELECT * FROM thingy WHERE attributeID=
5 (good answer, no problem)
5; DROP table users; (bad, bad, bad...)
The following website details further classical SQL injection technics: SQL Injection cheat sheet.
Using parametrized queries or stored procedures is not any better. These are just pre-made queries using the passed parameters, which can be source of injection just as well. It is also described on this page: Attacking Stored Procedures in SQL.
Now, if you supress the simple quote, you prevent only a given set of attack. But not all of them.
As always, do not trust data coming from the outside. Filter them at these 3 levels:
Interface level for obvious stuff (a drop down select list is better than a free text field)
Logical level for checks related to data nature (int, string, length), permissions (can this type of data be used by this user at this page)...
Database access level (escape simple quote...).
Have fun and don't forget to check Wikipedia for answers.
Parameterized inline SQL or parameterized stored procedures is the best way to protect yourself. As others have pointed out, simply stripping/escaping the single quote character is not enough.
You will notice that I specifically talk about "parameterized" stored procedures. Simply using a stored procedure is not enough either if you revert to concatenating the procedure's passed parameters together. In other words, wrapping the exact same vulnerable SQL statement in a stored procedure does not make it any safer. You need to use parameters in your stored procedure just like you would with inline SQL.
Also- even if you do just look for the apostrophe, you don't want to remove it. You want to escape it. You do that by replacing every apostrophe with two apostrophes.
But parameterized queries/stored procedures are so much better.
Since this a relatively older question, I wont bother writing up a complete and comprehensive answer, since most aspects of that answer have been mentioned here by one poster or another.
I do find it necessary, however, to bring up another issue that was not touched on by anyone here - SQL Smuggling. In certain situations, it is possible to "smuggle" the quote character ' into your query even if you tried to remove it. In fact, this may be possible even if you used proper commands, parameters, Stored Procedures, etc.
Check out the full research paper at http://www.comsecglobal.com/FrameWork/Upload/SQL_Smuggling.pdf (disclosure, I was the primary researcher on this) or just google "SQL Smuggling".
. . . uh about 50000000 other ways
maybe somthing like 5; drop table employees; --
resulting sql may be something like:
select * from somewhere where number = 5; drop table employees; -- and sadfsf
(-- starts a comment)
Yes, absolutely: depending on your SQL dialect and such, there are many ways to achieve injection that do not use the apostrophe.
The only reliable defense against SQL injection attacks is using the parameterized SQL statement support offered by your database interface.
Rather that trying to figure out which characters to filter out, I'd stick to parametrized queries instead, and remove the problem entirely.
It depends on how you put together the query, but in essence yes.
For example, in Java if you were to do this (deliberately egregious example):
String query = "SELECT name_ from Customer WHERE ID = " + request.getParameter("id");
then there's a good chance you are opening yourself up to an injection attack.
Java has some useful tools to protect against these, such as PreparedStatements (where you pass in a string like "SELECT name_ from Customer WHERE ID = ?" and the JDBC layer handles escapes while replacing the ? tokens for you), but some other languages are not so helpful for this.
Thing is apostrophe's maybe genuine input and you have to escape them by doubling them up when you are using inline SQL in your code. What you are looking for is a regex pattern like:
\;.*--\
A semi colon used to prematurely end the genuine statement, some injected SQL followed by a double hyphen to comment out the trailing SQL from the original genuine statement. The hyphens may be omitted in the attack.
Therefore the answer is: No, simply removing apostrophes does not gaurantee you safety from SQL Injection.
I can only repeat what others have said. Parametrized SQL is the way to go. Sure, it is a bit of a pain in the butt coding it - but once you have done it once, then it isn't difficult to cut and paste that code, and making the modifications you need. We have a lot of .Net applications that allow web site visitors specify a whole range of search criteria, and the code builds the SQL Select statement on the fly - but everything that could have been entered by a user goes into a parameter.
When you are expecting a numeric parameter, you should always be validating the input to make sure it's numeric. Beyond helping to protect against injection, the validation step will make the app more user friendly.
If you ever receive id = "hello" when you expected id = 1044, it's always better to return a useful error to the user instead of letting the database return an error.