SQL Concatenation filling up tempDB - sql

We are attempting to concatenate possibly thousands of rows of text in SQL with a single query. The query that we currently have looks like this:
DECLARE #concatText NVARCHAR(MAX)
SET #concatText = ''
UPDATE TOP (SELECT MAX(PageNumber) + 1 FROM #OrderedPages) [#OrderedPages]
SET #concatText = #concatText + [ColumnText] + '
'
WHERE (RTRIM(LTRIM([ColumnText])) != '')
This is working perfectly fine from a functional standpoint. The only issue we're having is that sometimes the ColumnText can be a few kilobytes in length. As a result, we're filling up tempDB when we have thousands of these rows.
The best reason that we have come up with is that as we're doing these updates to #concatText, SQL is using implicit transactions so the strings are effectively immutable.
We are trying to figure out a good way of solving this problem and so far we have two possible solutions:
1) Do the concatenation in .NET. This is an OK option, but that's a lot of data that may go back across the wire.
2) Use .WRITE which operates in a similar fashion to .NET's String.Join method. I can't figure out the syntax for this as BoL doesn't cover this level of SQL shenanigans.
This leads me to the question: Will .WRITE work? If so, what's the syntax? If not, are there any other ways to do this without sending data to .NET? We can't use FOR XML because our text may contain illegal XML characters.
Thanks in advance.

I'd look at using CLR integration, as suggested in #Martin's comment. A CLR aggregate function might be just the ticket.

What exactly is filling up tempdb? It cannot be #concatText = #concatText + [ColumnText], there is no immutability involved and the #concatText variable will be at worst case 2GB size (I expect your tempdb is much larger than that, if not increase it). It seems more like your query plan creates a spool for haloween protection and that spool is the culprit.
As a generic answer, using the UPDATE ... SET #var = #var + ... for concatenation is known to have correctness issues and is not supported. Alternative approaches that work more reliably are discussed in Concatenating Row Values in Transact-SQL.

First, from your post, it isn't clear whether or why you need temp tables. Concatenation can be done inline in a query. If you show us more about the query that is filling up tempdb, we might be able to help you rewrite it. Second, an option that hasn't been mentioned is to do the string manipulation outside of T-SQL entirely. I.e., in your middle-tier query for the raw data, do the manipulation and push it back to the database. Lastly, you can use Xml such that the results handle escapes and entities properly. Again, we'd need to know more about what and how you are trying to accomplish.

Agreed..A CLR User Defined Function would be the best approach for what you guys are doing. You could actually read the text values into an object and then join them all together (inside the CLR) and have the function spit out a NVARCHAR(MAX) result. If you need details on how to do this let me know.

Related

Storing SQL code in the database

My code actually works, I don't need help with that. What I would like to know if what I have done is considered acceptable.
In one particular part of a T-SQL script I am writing I have to run almost similar insert statements about 20 times. Only a portion of the WHERE clause is different in each case. Wanting to loop, rather than have 20 almost identical inserts, I use a WHILE loop to run some dynamic SQL and I store the portion of the WHERE clause that differs in the database. Works like a charm. It's worth noting that the INSERT statements in this case may vary in number or in content and I felt this solution allowed a way to deal with that rather simply.
When showing one of my peers at work this solution to the problem, his one eyebrow went up and he looked at me as though I was growing a new head. He suggested that there was a better way. That may be and with me being the junior I'll humbly accept it. But, I did want to ask the community if this seems like a weird, unprofessional or against general standards / best practices.
I can post the code if needed but for the purposes hopefully I have given you enough to comment one way or the other.
TIA
Edit--
OK, as requested here is the code. I won't try to explain it as it's a can of worms but here it is.
DECLARE #varOfferId INT = 1
DECLARE #MaxOfferId INT = (SELECT COUNT(DISTINCT offer_id) FROM obp.CellCodes_Offers
DECLARE #SQLWhereClause VARCHAR(1000)
DECLARE #SQLStatement VARCHAR(1000)
WHILE #varOfferId <= #MaxOfferId
BEGIN
SET #SQLWhereClause = (SELECT where_clause FROM obp.Offers WHERE offer_id = #varOfferId)
SET #SQLStatement =
'INSERT INTO obp.Offers_Contacts ' +
'SELECT DISTINCT o.contact_id, ' + CONVERT(VARCHAR(2), #varOfferId) +
' FROM obp.Onboarding AS o
WHERE ' + #SQLWhereClause +
' AND o2.contact_id = o.contact_id)
AND ' + CONVERT(VARCHAR(2), #varOfferId) + ' IN(
SELECT cc.offer_id
FROM obp.CellCodes_Offers AS cc
WHERE cc.cellcode = o.cellcode)'
EXECUTE (#SQLStatement)
SET #varOfferId = #varOfferId + 1
END
So, it seems that the consensus thus far is thinking this is not a good idea. OK, I'm good with that. But I'm not sure I agree that it is easier from a maintenance standpoint. Right now my code looks at the 'Offers' table, gets the row count and loops that many times. If they add more offers going forward (or reduce the offers) all I have to do is an INSERT (or DELETE) and include the offer with the appropriate WHERE clause and we are on our way. Alternatively, if I write all the individual INSERTS if they add or remove I've got to touch the code which means testing/qa. Thoughts?
However, I do agree with several other points so I guess I'll be going back to the drawing board tomorrow!
Pros:
You've kept your code shorter, saved some time
Cons:
You are now susceptible to SQL Injection
Your DB code is now half in the DB and half in the table - this will make maintenance harder for whoever maintains your code.
Debugging is going to be difficult.
If you have to write 20 different statements, it may be possible to autogenerate them using a very similar WHILE LOOP to the one you've already made.
e.g.
SELECT 'insert into mytable (x,y,z) from a join b on a.x = b.x ' + wherecolumn
from wheretable
This would give you the code you need to paste into your stored procedure. You could even keep that statement above in the stored procedure, commented out, so others may re-use it in future if column structures change.
For the best post I've ever seen on dynamic SQL check out Erland Somerskog's page here.
I think recording the difference in a database is relatively less straightforward and less convenient to modify afterwards. I would just write a script to do this, and write the conditions in the script directly.
For example, in Python you may write something like this.
import MySQLdb
import MySQLdb.cursors
field_value_pairs = {'f1':'v1', 'f2':'v2', 'f3':'v3'} # this is where you could modify to meet your different cases
db = MySQLdb.connect(host=host_name, user=user_name, passwd=password, \
unix_socket=socket_info)
cursor = db.cursor()
db.select_db(db_name)
for field in field_value_pairs.keys():
cursor.execute("""INSERT INTO tbl_name (%s) VALUES (%s)""", field, field_value_pairs[field])
db.commit()
cursor.close()
db.close()

Dynamic Pivot Query without storing query as String

I am fully familiar with the following method in the link for performing a dynamic pivot query. Is there an alternative method to perform a dynamic pivot without storing the Query as a String and inserting a column string inside it?
http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx
Short answer: no.
Long answer:
Well, that's still no. But I will try to explain why. As of today, when you run the query, the DB engine demands to be aware of the result set structure (number of columns, column names, data types, etc) that the query will return. Therefore, you have to define the structure of the result set when you ask data from DB. Think about it: have you ever ran a query where you would not know the result set structure beforehand?
That also applies even when you do select *, which is just a sugar syntax. At the end, the returning structure is "all columns in such table(s)".
By assembling a string, you dynamically generate the structure that you desire, before asking for the result set. That's why it works.
Finally, you should be aware that assembling the string dynamically can theoretically and potentially (although not probable) get you a result set with infinite columns. Of course, that's not possible and it will fail, but I'm sure you understood the implications.
Update
I found this, which reinforces the reasons why it does not work.
Here:
SSIS relies on knowing the metadata of the dataflow in advance and a
dynamic pivot (which is what you are after) is not compatible with
that.
I'll keep looking and adding here.

Performance improvement to a big if clause in SQL Server function

I am maintaining a function in SQL Server 2005, that based on an integer input parameter needs to call different functions e.g.
IF #rule_id = 1
-- execute function 1
ELSE IF #rule_id = 2
-- execute function 2
ELSE IF #rule_id = 3
... etc
The problem is that there are a fair few rules (about 100), and although the above is fairly readable, its performance isn't great. At the moment it's implemented as a series of IF's that do a binary-chop, which is much faster, but becomes fairly unpleasant to read and maintain. Any alternative ideas for something that performs well and is fairly maintainable?
I would suggest you generate the code programatically, eg. via XML+XSLT. the resulted T-SQL will be the same as you have now, but maintaining it would be much easier (adding/removing functions).
Inside a function you don't have much choice, using IFs is pretty much the only solution. You can't do dynamic SQL in functions (you can't invoke exec). If its a stored procedure, then you have much more libery as you can use dynamic SQL and have tricks like a lookup table: select #function = function from table where rule_id = #rule_id; exec sp_executesql #function;.
Can you change it so that it execs a function as a string? I'd normally recommend against this sort of dynamic sql, and there may be better ways if you step back and look at overall design... but with what is known here you may have found one of the rare exceptions when it's better.
ex:
set #functionCall = 'functionRootName' + #rule_id
exec #functionCall
Whatever is calling the SQL function - why does it not choose the function?
This seems like a poorly chosen distribution of responsibility.

Sql Optimization: Xml or Delimited String

This is hopefully just a simple question involving performance optimizations when it comes to queries in Sql 2008.
I've worked for companies that use Stored Procs a lot for their ETL processes as well as some of their websites. I've seen the scenario where they need to retrieve specific records based on a finite set of key values. I've seen it handled in 3 different ways, illustrated via pseudo-code below.
Dynamic Sql that concatinates a string and executes it.
EXEC('SELECT * FROM TableX WHERE xId IN (' + #Parameter + ')'
Using a user defined function to split a delimited string into a table
SELECT * FROM TableY INNER JOIN SPLIT(#Parameter) ON yID = splitId
USING XML as the Parameter instead of a delimited varchar value
SELECT * FROM TableZ JOIN #Parameter.Nodes(xpath) AS x (y) ON ...
While I know creating the dynamic sql in the first snippet is a bad idea for a large number of reasons, my curiosity comes from the last 2 examples. Is it more proficient to do the due diligence in my code to pass such lists via XML as in snippet 3 or is it better to just delimit the values and use an udf to take care of it?
There is now a 4th option - table valued parameters, whereby you can actually pass a table of values in to a sproc as a parameter and then use that as you would normally a table variable. I'd be preferring this approach over the XML (or CSV parsing approach)
I can't quote performance figures between all the different approaches, but that's one I'd be trying - I'd recommend doing some real performance tests on them.
Edit:
A little more on TVPs. In order to pass the values in to your sproc, you just define a SqlParameter (SqlDbType.Structured) - the value of this can be set to any IEnumerable, DataTable or DbDataReader source. So presumably, you already have the list of values in a list/array of some sort - you don't need to do anything to transform it into XML or CSV.
I think this also makes the sproc clearer, simpler and more maintainable, providing a more natural way to achieve the end result. One of the main points is that SQL performs best at set based/not looping/non string manipulation activities.
That's not to say it will perform great with a large set of values passed in. But with smaller sets (up to ~1000) it should be fine.
UDF invocation is a little bit more costly than splitting the XML using the built-in function.
However, this only needs to be done once per query, so the performance difference will be negligible.

Is there some way to inject SQL even if the ' character is deleted?

If I remove all the ' characters from a SQL query, is there some other way to do a SQL injection attack on the database?
How can it be done? Can anyone give me examples?
Yes, there is. An excerpt from Wikipedia
"SELECT * FROM data WHERE id = " + a_variable + ";"
It is clear from this statement that the author intended a_variable to be a number correlating to the "id" field. However, if it is in fact a string then the end user may manipulate the statement as they choose, thereby bypassing the need for escape characters. For example, setting a_variable to
1;DROP TABLE users
will drop (delete) the "users" table from the database, since the SQL would be rendered as follows:
SELECT * FROM DATA WHERE id=1;DROP TABLE users;
SQL injection is not a simple attack to fight. I would do very careful research if I were you.
Yes, depending on the statement you are using. You are better off protecting yourself either by using Stored Procedures, or at least parameterised queries.
See Wikipedia for prevention samples.
I suggest you pass the variables as parameters, and not build your own SQL. Otherwise there will allways be a way to do a SQL injection, in manners that we currently are unaware off.
The code you create is then something like:
' Not Tested
var sql = "SELECT * FROM data WHERE id = #id";
var cmd = new SqlCommand(sql, myConnection);
cmd.Parameters.AddWithValue("#id", request.getParameter("id"));
If you have a name like mine with an ' in it. It is very annoying that all '-characters are removed or marked as invalid.
You also might want to look at this Stackoverflow question about SQL Injections.
Yes, it is definitely possible.
If you have a form where you expect an integer to make your next SELECT statement, then you can enter anything similar:
SELECT * FROM thingy WHERE attributeID=
5 (good answer, no problem)
5; DROP table users; (bad, bad, bad...)
The following website details further classical SQL injection technics: SQL Injection cheat sheet.
Using parametrized queries or stored procedures is not any better. These are just pre-made queries using the passed parameters, which can be source of injection just as well. It is also described on this page: Attacking Stored Procedures in SQL.
Now, if you supress the simple quote, you prevent only a given set of attack. But not all of them.
As always, do not trust data coming from the outside. Filter them at these 3 levels:
Interface level for obvious stuff (a drop down select list is better than a free text field)
Logical level for checks related to data nature (int, string, length), permissions (can this type of data be used by this user at this page)...
Database access level (escape simple quote...).
Have fun and don't forget to check Wikipedia for answers.
Parameterized inline SQL or parameterized stored procedures is the best way to protect yourself. As others have pointed out, simply stripping/escaping the single quote character is not enough.
You will notice that I specifically talk about "parameterized" stored procedures. Simply using a stored procedure is not enough either if you revert to concatenating the procedure's passed parameters together. In other words, wrapping the exact same vulnerable SQL statement in a stored procedure does not make it any safer. You need to use parameters in your stored procedure just like you would with inline SQL.
Also- even if you do just look for the apostrophe, you don't want to remove it. You want to escape it. You do that by replacing every apostrophe with two apostrophes.
But parameterized queries/stored procedures are so much better.
Since this a relatively older question, I wont bother writing up a complete and comprehensive answer, since most aspects of that answer have been mentioned here by one poster or another.
I do find it necessary, however, to bring up another issue that was not touched on by anyone here - SQL Smuggling. In certain situations, it is possible to "smuggle" the quote character ' into your query even if you tried to remove it. In fact, this may be possible even if you used proper commands, parameters, Stored Procedures, etc.
Check out the full research paper at http://www.comsecglobal.com/FrameWork/Upload/SQL_Smuggling.pdf (disclosure, I was the primary researcher on this) or just google "SQL Smuggling".
. . . uh about 50000000 other ways
maybe somthing like 5; drop table employees; --
resulting sql may be something like:
select * from somewhere where number = 5; drop table employees; -- and sadfsf
(-- starts a comment)
Yes, absolutely: depending on your SQL dialect and such, there are many ways to achieve injection that do not use the apostrophe.
The only reliable defense against SQL injection attacks is using the parameterized SQL statement support offered by your database interface.
Rather that trying to figure out which characters to filter out, I'd stick to parametrized queries instead, and remove the problem entirely.
It depends on how you put together the query, but in essence yes.
For example, in Java if you were to do this (deliberately egregious example):
String query = "SELECT name_ from Customer WHERE ID = " + request.getParameter("id");
then there's a good chance you are opening yourself up to an injection attack.
Java has some useful tools to protect against these, such as PreparedStatements (where you pass in a string like "SELECT name_ from Customer WHERE ID = ?" and the JDBC layer handles escapes while replacing the ? tokens for you), but some other languages are not so helpful for this.
Thing is apostrophe's maybe genuine input and you have to escape them by doubling them up when you are using inline SQL in your code. What you are looking for is a regex pattern like:
\;.*--\
A semi colon used to prematurely end the genuine statement, some injected SQL followed by a double hyphen to comment out the trailing SQL from the original genuine statement. The hyphens may be omitted in the attack.
Therefore the answer is: No, simply removing apostrophes does not gaurantee you safety from SQL Injection.
I can only repeat what others have said. Parametrized SQL is the way to go. Sure, it is a bit of a pain in the butt coding it - but once you have done it once, then it isn't difficult to cut and paste that code, and making the modifications you need. We have a lot of .Net applications that allow web site visitors specify a whole range of search criteria, and the code builds the SQL Select statement on the fly - but everything that could have been entered by a user goes into a parameter.
When you are expecting a numeric parameter, you should always be validating the input to make sure it's numeric. Beyond helping to protect against injection, the validation step will make the app more user friendly.
If you ever receive id = "hello" when you expected id = 1044, it's always better to return a useful error to the user instead of letting the database return an error.