Custom, user-definable "wildcard" constants in SQL database search -- possible?

Custom, user-definable "wildcard" constants in SQL database search -- possible? - sql

My client is making database searches using a django webapp that I've written. The query sends a regex search to the database and outputs the results.
Because the regex searches can be pretty long and unintuitive, the client has asked for certain custom "wildcards" to be created for the regex searches. For example.
Ω := [^aeiou] (all non-vowels)
etc.
This could be achieved with a simple permanent string substitution in the query, something like
query = query.replace("Ω", "[^aeiou]")
for all the elements in the substitution list. This seems like it should be safe, but I'm not really sure.
He has also asked that it be possible for the user to define custom wildcards for their searches on the fly. So that there would be some other input box where a user could define
∫ := some other regex
And to store them you might create a model
class RegexWildcard(models.Model):
symbol = ...
replacement = ...
I'm personally a bit wary of this, because it does not seem to add a whole lot of functionality, but does seem to add a lot of complexity and potential problems to the code. Clients can now write their queries to a db. Can they overwrite each other's symbols?
That I haven't seen this done anywhere before also makes me kind of wary of the idea.
Is this possible? Desirable? A great idea? A terrible idea? Resources and any guidance appreciated.

Well, you're getting paid by the hour....
I don't see how involving the Greek alphabet is to anyone's advantage. If the queries are stored anywhere, everyone approaching the system would have to learn the new syntax to understand them. Plus, there's the problem of how to type the special symbols.
If the client creates complex regular expressions they'd like to be able to reuse, that's understandable. Your application could maintain a list of such expressions that the user could add to and choose from. Notionally, the user would "click on" an expression, and it would be inserted into the query.
The saved expressions could have user-defined names, to make them easier to remember and refer to. And you could define a syntax that referenced them, something otherwise invalid in SQL, such as ::name. Before submitting the query to the DBMS, you substitute the regex for the name.
You still have the problem of choosing good names, and training.
To prevent malformed SQL, I imagine you'll want to ensure the regex is valid. You wouldn't want your system to store a ; drop table CUSTOMERS; as a "regular expression"! You'll either have to validate the expression or, if you can, treat the regex as data in a parameterized query.
The real question to me, though, is why you're in the vicinity of standardized regex queries. That need suggests a database design issue: it suggests the column being queried is composed of composite data, and should be represented as multiple columns that can be queried directly, without using regular expressions.

Related

Does Laminas Framework DB Adapter automatically strip tags etc. like mysqli_real_escape_string?

I'm using Laminas framework (formerly ZEND Framework) and would like to know if the DB Adapter automatically strip tags etc. at Insert and Select Statements / Execution ($statement->prepare() / $statement->execute()->current()) to avoid SQL injection?
If not what is the best method to implement it while working with LAMINAS DB Adapter? A Wrapper function to get a clean SQL statement?
Before, I was using a wrapper function that cleaned the input with mysqli_real_escape_string

You'd need to provide some sample code in order to get a more accurate answer to this question, however the topic you're talking of here is parameterisation. If you use laminas-db properly, you won't have to call mysqli_real_escape_string or similar to protect against SQL injection. However, you should still be very wary of any potential cross site scripting (XSS) issues if any data written to your database eventually gets echoed back to the browser. See the section below on that.
How to NOT do it
If you do anything like this, laminas-db won't parameterise your query for you, and you'll be susceptible to SQL injection:
// DO NOT DO THIS
$idToSelect = $_POST['id'];
$myQuery = "select * from myTable where id = $idToSelect";
$statement->prepare($myQuery);
$results = $statement->execute();
How to do it properly
Personally, I make use of the Laminas TableGatewayInterface abstraction, because it generally makes life a lot easier. If you want to operate at the lower level, something like this is a much safer way of doing it; it'll ensure that you don't accidentally add a whole heap of SQL injection vulnerabilities to your code
Named Parameters
$idToSelect = $_POST['id'];
$myQuery = "select * from myTable where id = :id";
$statement->prepare($myQuery);
$results = $statement->execute(['id' => $idToSelect]);
Anonymous Parameters
$idToSelect = $_POST['id'];
$myQuery = "select * from myTable where id = ?";
$statement->prepare($myQuery);
$results = $statement->execute([$idToSelect]);
If you do it in a way similar to this, you won't need any additional wrapper functions or anything, the underlying components will take care of it for you. You can use either the named or anonymous parameter version, however I personally prefer the named parameters, as I find it aids code readability and generally makes for easier debugging.
Sanitising Input/Escaping Output
Be very careful with sanitising input. Not only can it lead to a false sense of security, it can also introduce subtle and very frustrating behaviour that the end user will perceive as a bug.
Take this example: Someone tries to sign up to a site under the name "Bob & Jane Smith", however the input sanitisation routine spots the ampersand character and strips it out. All of a sudden, Bob has gained a middle name they never knew they had.
One of the biggest problems with input sanitisation is context. Some characters can be unsafe in one given context, but entirely benign in another. Furthermore, if you need to support the entire unicode character set (as we generally should be unless we have a very good reason to not), trying to sanitise out control characters in data that will potentially pass through several different systems (html, javascript, html, php, sql, etc) becomes a very hard problem due to the way multibyte characters are handled.
When it comes to MySQL, parameterisation pretty much takes care of all of these issues, so we don't need to worry about the "Little Bobby Tables" scenario.
It's much easier to flip all of this on its head and escape output. In theory, the code that's rendering the output data knows in which context it's destined for, so can therefore escape the output for said context, replacing control characters with safe alternatives.
For example, let's say someone's put the following in their bio on a fictional social networking site:
My name is Bob, I like to write javascript
<script>window.postStatusUpdate("Bob is my hero ♥️");</script>
This would need to be escaped in different ways, depending upon the context in which this data is to be displayed. If it's to be displayed in a browser, the escaper needs to replace the angle brackets with an escaped version. If it's going out as JSON output, the double quotes need to be backslash escaped, etc.
I appreciate this is a rather windy answer to a seemingly simple question, however these things should certainly be considered earlier on in the application design; It can save you heaps of time and frustration further on down the line.

In vb.net, I need to split a string multiple ways to create a list of commands

Lets say I have a string that contains stored procedure names, their parameters, and values, like this:
Dim sprocs As String = "Procedure: NameOfProcedure, Parameters: #Param1(int), #Param2(varchar) Values: Something, SomethingElse ; Procedure: NameOfProcedure, Parameters: #Param1(int), #Param2(varchar) Values: Something, Something Else"
My point of the above is I need to be able to take a user defined string that can store multiple procedures separated by... something. In this case I used a ; semicolon.
What I need to be able to do is a FOR EACH on these in vb.net.
I know how to used stored procedures and set parameters... I'm just not sure of the best way to go about this as far as how to split up that string to match parameters and values, etc.
So a bad example:
Dim ConnectionString As String = "String...removed"
Dim conn As New System.Data.SqlClient.SqlConnection(ConnectionString)
' So I need a for each here
Dim cmd As New System.Data.SqlClient.SqlCommand(ProcedureName)
With cmd
.Connection = conn
.CommandType = CommandType.StoredProcedure
' and a for each here on the parameters, datatypes, and their values
.Parameters.Add(ParamName, DataType).Value = TheValue
.Connection.Open()
.BeginExecuteReader()
.Connection.Close()
End With
So the splitting and parsing out the string is my biggest roadblock on this. I'm not sure how to separate it out at this level, match parameters, datatypes and their values all together. If there's a better way to have the end-user organize the string to make parsing and splitting easier, please share.
Any suggestions or examples are greatly appreciated.
Thank you very much.

There's a temptation here to use String.Split() or a regular expression. Neither solution is actually very good for this. To understand what I mean, take a quick moment to search Google for the many thousands of posts of programmers working with CSV running into edge cases where their regex didn't quite cut it. For this purpose, your semi-colon delimiter is not significantly different than a fancy comma.
Depending on the regex engine, it may actually be possible to build an expression that does this, especially if you can put certain constraints on the data. However, such expressions tend to be very tough to build and maintain, and they tend to require backtracking, which hurts performance.
The next alternative to examine is a dedicated parser. For CSV data, this is the panacea. There a number of good ones available for most any platform that you can just plug in, and they're typically pretty easy to work with. However, what you have, while similar, goes a bit beyond the level of what these parsers are written to handle.
One related solution is write your own csv-like parser. This may be your best option, and if you choose this route all I can do is recommend that you build a state machine to go character by character through the input string.
The next rung up the ladder is to use a generic parser/interpretter that relies on something like ANTLR. I'm haven't personally dug very deep into that well, but I know the tools are out there.
One related option, that I think is probably your best option here overall, is a Domain Specific Language (DSL). This is something that may be a little easier to get started with than a tool like ANTLR, especially because it's already built into Visual Studio. However, this option requires that you be able to control your input format.
If all this sounds like more work than it should be, you're right. The good news is that is possible to skip all this. The trick is that you must first impose artificial constraints on your data. For example, a constraint that parameter values are not allowed to contain ; characters means you can get back to a String.Split() solution to divide out each separate procedure call. Now your function is no longer generally applicable, but then maybe is doesn't need to be. When you create these constraints, make sure they are documented, so that when something violates them and code explodes you'll have clear definitions for why.

Expanding an arbitrary Lucene Query

I am using Lucene 3.6.1. I receive a query from a user. This query may contain + or - operators, and may also contain phrases. In certain circumstances, I would like to expand the query by adding some extra terms that I compute. These terms are optional. However, any required include/exclude constraints specified by the user must be respected.
My initial strategy was to create a BooleanQuery, add a clause to it that contains the parsed user query, and then add further clauses that contain my expansion terms. The expansion terms would all be added as Occur.SHOULD. My question is how to constrain the user's query. I can imagine three possibilities:
The user's query contains no operators, which means I can include it as an Occur.SHOULD clause.
The user's query contains a + operator, so I need to include it as an Occur.MUST clause.
The user's query contains a - operator, but also other terms: Do I still include it as an Occur.MUST clause?
The question implicit in these three choices is how do I tell which condition is appropriate? I suppose I can rewrite the query and test for BooleanQuery instances, but that seems brittle.
I suppose can also try to tactic of creating a single string from the user's input and from my expansion terms, like this:
(fld1:userterm1 userterm2 -fld2:userterm3 +userterm4)^10 (fld1:expterm1)^8 (fld2:expterm2)^7 ...
Is this the best way to go? Or is there some elegant programmatic solution?

Okay, Not sure how useful this answer will be, but can't seem to come up with a hard and fast answer here, so I'll list a couple possibilities that come to mind:
First, a problem:
Modifying the query to look like:
(userquery) (other) (stuff)
I makes some sense to add the + with he rules you've shown, but a '-' prohibited term will be hard to respect correctly, since (query -prohibition) (other) will allow matches on other with prohibition present as well, and +(query -prohibition) (other) will require 'query' be matched.
The only way I see to really do that part right is to propagate the prohibited term into your automatically added terms as well, or extract it out to a parent query layer, more like (query -prohibition) --> (query) (other) -(prohibition).
And with user entered queries of arbitrary complexity, that may not be a great strategy.
If you want to tackle it by modifying the query string, then you should probably just add any terms to the end of the query. Nothing more to it.
I don't believe
(fld1:userterm1 userterm2 -fld2:userterm3 +userterm4)^10 (fld1:expterm1)^8 (fld2:expterm2)^7 ...
Is satisfactory, because userterm4 is only required within it's subquery, but a match Only on expterm1 is still acceptable. However, a query like:
fld1:userterm1 userterm2 -fld2:userterm3 +userterm4 (fld1:expterm1)^.8 (fld2:expterm2)^.7 ...
Should, I think, satisfy your needs, and prevents you from having to worry about the internals of your queryparser. I think this is the best approach.
I can also see logic in a query structured like
+(parsed userquery) (other stuff)
Effectively, always requiring a match on the user query. Lucene implicitly does this, in a sense, as it won't return a result that matches no term, even if no required fields are present in the query. This would then be using your added terms to impact scoring, rather than return a larger set of documents. This doesn't quite address what your asking, but might be worth considering.
If, despite the aforementioned problems of applying them, you still want to detect '+' and '-' operators, I think it can be reasonably assumed that a StandardQueryParser will return a BooleanQuery at base level for any query that you need to check for these operands on. You might have to worry about, for instance, DisjunctionMaxQueries, as well as what will happen when you have a simple query with an operator, like:
+myterm
I don't know if QueryParser would simply return a TermQuery, losing the plus (since it would be redundant without another term present). Concerns like that make me hesitant to address it in this way.
Similarly, attempting to detect these values from the query string must make assumptions about how things are parsed, and could become complicated.
To sumamrize, I think the best options are to, either: add terms to the end of the raw query string before doing any parsing, or treat the user query as atomic, and define the appropriate booleanclause independant of it's contents when adding to a boolean clause wrapping it with whatever other queries you need to include.

What must be escaped in SQL?

When using SQL in conjunction with another language what data must be escaped? I was just reading the question here and it was my understanding that only data from the user must be escaped.
Also must all SQL statements be escaped? e.g. INSERT, UPDATE and SELECT

EVERY type of query in SQL must be properly escaped. And not only "user" data. It's entirely possible to inject YOURSELF if you're not careful.
e.g. in pseudo-code:
$name = sql_get_query("SELECT lastname FROM sometable");
sql_query("INSERT INTO othertable (badguy) VALUES ('$name')");
That data never touched the 'user', it was never submitted by the user, but it's still a vulnerability - consider what happens if the user's last name is O'Brien.

Most programming languages provide code for connecting to databases in a uniform way (for example JDBC in Java and DBI in Perl). These provide automatic techniques for doing any necessary escaping using Prepared Statements.

All SQL queries should be properly sanitized and there are various ways of doing it.
You need to prevent the user from trying to exploit your code using SQL Injection.
Injections can be made in various ways, for example through user input, server variables and cookie modifications.
Given a query like:
"SELECT * FROM tablename WHERE username= <user input> "
If the user input is not escaped, the user could do something like
' or '1'='1
Executing the query with this input will actually make it always true, possibly exposing sensitive data to the attacker. But there are many other, much worse scenarios injection can be used for.
You should take a look at the OWASP SQL Injection Guide. They have a nice overview of how to prevent those situations and various ways of dealing with it.

I also think it largely depends on what you consider 'user data' to be or indeed orignate from. I personally consider user data as data available (even if this is only through exploitations) in the public domain, i.e. can be changed by 'a' user even if it's not 'the' user.
Marc B makes a good point however that in certain circumstances you may dirty your own data, so I guess it's always better to be safer than sorry in regards to sql data.
I would note that in regards to direct user input (i.e. from web forms, etc) you should always have an additional layer server side validation before the data even gets near a sql query.

Regular expression to match common SQL syntax?

I was writing some Unit tests last week for a piece of code that generated some SQL statements.
I was trying to figure out a regex to match SELECT, INSERT and UPDATE syntax so I could verify that my methods were generating valid SQL, and after 3-4 hours of searching and messing around with various regex editors I gave up.
I managed to get partial matches but because a section in quotes can contain any characters it quickly expands to match the whole statement.
Any help would be appreciated, I'm not very good with regular expressions but I'd like to learn more about them.
By the way it's C# RegEx that I'm after.
Clarification
I don't want to need access to a database as this is part of a Unit test and I don't wan't to have to maintain a database to test my code. which may live longer than the project.

Regular expressions can match languages only a finite state automaton can parse, which is very limited, whereas SQL is a syntax. It can be demonstrated you can't validate SQL with a regex. So, you can stop trying.

SQL is a type-2 grammar, it is too powerful to be described by regular expressions. It's the same as if you decided to generate C# code and then validate it without invoking a compiler. Database engine in general is too complex to be easily stubbed.
That said, you may try ANTLR's SQL grammars.

As far as I know this is beyond regex and your getting close to the dark arts of BnF and compilers.
http://savage.net.au/SQL/
Same things happens to people who want to do correct syntax highlighting. You start cramming things into regex and then you end up writing a compiler...

I had the same problem - an approach that would work for all the more standard sql statements would be to spin up an in-memory Sqlite database and issue the query against it, if you get back a "table does not exist" error, then your query parsed properly.

Off the top of my head: Couldn't you pass the generated SQL to a database and use EXPLAIN on them and catch any exceptions which would indicate poorly formed SQL?

Have you tried the lazy selectors. Rather than match as much as possible, they match as little as possible which is probably what you need for quotes.

To validate the queries, just run them with SET NOEXEC ON, that is how Entreprise Manager does it when you parse a query without executing it.
Besides if you are using regex to validate sql queries, you can be almost certain that you will miss some corner cases, or that the query is not valid from other reasons, even if it's syntactically correct.

I suggest creating a database with the same schema, possibly using an embedded sql engine, and passing the sql to that.

I don't think that you even need to have the schema created to be able to validate the statement, because the system will not try to resolve object_name etc until it has successfully parsed the statement.
With Oracle as an example, you would certainly get an error if you did:
select * from non_existant_table;
In this case, "ORA-00942: table or view does not exist".
However if you execute:
select * frm non_existant_table;
Then you'll get a syntax error, "ORA-00923: FROM keyword not found where expected".
It ought to be possible to classify errors into syntax parsing errors that indicate incorrect syntax and errors relating to tables name and permissions etc..
Add to that the problem of different RDBMSs and even different versions allowing different syntaxes and I think you really have to go to the db engine for this task.

There are ANTLR grammars to parse SQL. It's really a better idea to use an in memory database or a very lightweight database such as sqlite. It seems wasteful to me to test whether the SQL is valid from a parsing standpoint, and much more useful to check the table and column names and the specifics of your query.

The best way is to validate the parameters used to create the query, rather than the query itself. A function that receives the variables can check the length of the strings, valid numbers, valid emails or whatever. You can use regular expressions to do this validations.

public bool IsValid(string sql)
{
string pattern = #"SELECT\s.*FROM\s.*WHERE\s.*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
return rgx.IsMatch(sql);
}

I am assuming you did something like .\* try instead [^"]* that will keep you from eating the whole line. It still will give false positives on cases where you have \ inside your strings.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas