Extensive String Manipulation - Cursor or While-Loop? - sql

Quick Background
I have received a project from our marketing team to make bulk updates to the descriptions of products that will be displayed on our website (>500k items). They have decided to take many decades worth of descriptions and try to make them as similar as possible. (Ex. 'screw driver', 'screwdriver', 'screw-driver' should all look like 'Screwdriver')
I successfully accomplished the task at hand to 95% of their satisfaction using a clunky, long and hard to maintain series of update statements on a table only I maintain, to modify the strings. I then pass these on to our web deployment team, but wasn't thinking they would want to maintain this indefinitely.
I can easily produce a table of sub-strings and conditions to find and what to replace portions of the string. I think something depending on a table like this would be easiest to maintain for 90% of the cases we encounter.
Now, I'm uncertain about the best way to proceed to make this dependable and easy to maintain. I've received conflicting information that a good use would be a 'while loop' and other say a Cursor would be just fine.
Now to the question
Given we will/may/could be adding somewhere around 1k new products a month, and I have a table of conditions like the following, what is the most efficient and dependable way to execute the manipulation regularly?
Condition, Find_substring, Replace_with
like '%screw driver%', 'screw driver', 'Screwdriver'
like '%screw-driver%', 'screw driver', 'Screwdriver'
like '%screwdriver%', 'screwdriver', 'Screwdriver'
Open to any and all ideas, suggestions and advice.

If your rules are really as simple as that then simply having "old_value" and "new_value" in the table should suffice, with a single statement to fix all of the data:
UPDATE
MT
SET
description = REPLACE(description, old_value, new_value)
FROM
dbo.My_Table MT
INNER JOIN dbo.Fix_Table FT ON MT.description LIKE '%' + FT.old_value + '%'
You might need to adjust the query if you expect multiple matches on a single product. Also, be careful of strings that might be part of another string. For example, fixing "ax" to "axe" might cause problems with a "fax machine". There are a lot of little details like this that might affect the exact approach.

Have a table like bad_val and good_val (call it tblMod). you can write a stored procedure that loops on tblMod and generate SQL statement and execute the statement as dynamic SQL.
loop on tblMod
-- generate SQL statements like:
set sqlText = 'update myTable set description = ' + good_val + ' where description = ' + bad_val
sp_execute sqlText
This approach also allows you to use SQL functions or any other functions in good_val field of tblMod. for instance you can have below data in good_val field: 'upper(description)'
or 'substring(description, 1 ,4)'
as you are generating dynamic SQL, those will work.
in this case your sqlText will be something like
'update myTable set description = substring(description, 1, 4) where description = 'some bad value'
example above might not be correct but I hope you get my idea.

Related

SQL DB2 - How to SELECT or compare columns based on their name?

Thank you for checking my question out!
I'm trying to write a query for a very specific problem we're having at my workplace and I can't seem to get my head around it.
Short version: I need to be able to target columns by their name, and more specifically by a part of their name that will be consistent throughout all the columns I need to combine or compare.
More details:
We have (for example), 5 different surveys. They have many questions each, but SOME of the questions are part of the same metric, and we need to create a generic field that keeps it. There's more background to the "why" of that, but it's pretty important for us at this point.
We were able to kind of solve this with either COALESCE() or CASE statements but the challenge is that, as more surveys/survey versions continue to grow, our vendor inevitably generates new columns for each survey and its questions.
Take this example, which is what we do currently and works well enough:
CASE
WHEN SURVEY_NAME = 'Service1' THEN SERV1_REC
WHEN SURVEY_NAME = 'Notice1' THEN FNOL1_REC
WHEN SURVEY_NAME = 'Status1' THEN STAT1_REC
WHEN SURVEY_NAME = 'Sales1' THEN SALE1_REC
WHEN SURVEY_NAME = 'Transfer1' THEN Null
ELSE Null
END REC
And also this alternative which works well:
COALESCE(SERV1_REC, FNOL1_REC, STAT1_REC, SALE1_REC) as REC
But as I mentioned, eventually we will have a "SALE2_REC" for example, and we'll need them BOTH on this same statement. I want to create something where having to come into the SQL and make changes isn't needed. Given that the columns will ALWAYS be named "something#_REC" for this specific metric, is there any way to achieve something like:
COALESCE(all columns named LIKE '%_REC') as REC
Bonus! Related, might be another way around this same problem:
Would there also be a way to achieve this?
SELECT (columns named LIKE '%_REC') FROM ...
Thank you very much in advance for all your time and attention.
-Kendall
Table and column information in Db2 are managed in the system catalog. The relevant views are SYSCAT.TABLES and SYSCAT.COLUMNS. You could write:
select colname, tabname from syscat.tables
where colname like some_expression
and syscat.tabname='MYTABLE
Note that the LIKE predicate supports expressions based on a variable or the result of a scalar function. So you could match it against some dynamic input.
Have you considered storing the more complicated properties in JSON or XML values? Db2 supports both and you can query those values with regular SQL statements.

Storing SQL code in the database

My code actually works, I don't need help with that. What I would like to know if what I have done is considered acceptable.
In one particular part of a T-SQL script I am writing I have to run almost similar insert statements about 20 times. Only a portion of the WHERE clause is different in each case. Wanting to loop, rather than have 20 almost identical inserts, I use a WHILE loop to run some dynamic SQL and I store the portion of the WHERE clause that differs in the database. Works like a charm. It's worth noting that the INSERT statements in this case may vary in number or in content and I felt this solution allowed a way to deal with that rather simply.
When showing one of my peers at work this solution to the problem, his one eyebrow went up and he looked at me as though I was growing a new head. He suggested that there was a better way. That may be and with me being the junior I'll humbly accept it. But, I did want to ask the community if this seems like a weird, unprofessional or against general standards / best practices.
I can post the code if needed but for the purposes hopefully I have given you enough to comment one way or the other.
TIA
Edit--
OK, as requested here is the code. I won't try to explain it as it's a can of worms but here it is.
DECLARE #varOfferId INT = 1
DECLARE #MaxOfferId INT = (SELECT COUNT(DISTINCT offer_id) FROM obp.CellCodes_Offers
DECLARE #SQLWhereClause VARCHAR(1000)
DECLARE #SQLStatement VARCHAR(1000)
WHILE #varOfferId <= #MaxOfferId
BEGIN
SET #SQLWhereClause = (SELECT where_clause FROM obp.Offers WHERE offer_id = #varOfferId)
SET #SQLStatement =
'INSERT INTO obp.Offers_Contacts ' +
'SELECT DISTINCT o.contact_id, ' + CONVERT(VARCHAR(2), #varOfferId) +
' FROM obp.Onboarding AS o
WHERE ' + #SQLWhereClause +
' AND o2.contact_id = o.contact_id)
AND ' + CONVERT(VARCHAR(2), #varOfferId) + ' IN(
SELECT cc.offer_id
FROM obp.CellCodes_Offers AS cc
WHERE cc.cellcode = o.cellcode)'
EXECUTE (#SQLStatement)
SET #varOfferId = #varOfferId + 1
END
So, it seems that the consensus thus far is thinking this is not a good idea. OK, I'm good with that. But I'm not sure I agree that it is easier from a maintenance standpoint. Right now my code looks at the 'Offers' table, gets the row count and loops that many times. If they add more offers going forward (or reduce the offers) all I have to do is an INSERT (or DELETE) and include the offer with the appropriate WHERE clause and we are on our way. Alternatively, if I write all the individual INSERTS if they add or remove I've got to touch the code which means testing/qa. Thoughts?
However, I do agree with several other points so I guess I'll be going back to the drawing board tomorrow!
Pros:
You've kept your code shorter, saved some time
Cons:
You are now susceptible to SQL Injection
Your DB code is now half in the DB and half in the table - this will make maintenance harder for whoever maintains your code.
Debugging is going to be difficult.
If you have to write 20 different statements, it may be possible to autogenerate them using a very similar WHILE LOOP to the one you've already made.
e.g.
SELECT 'insert into mytable (x,y,z) from a join b on a.x = b.x ' + wherecolumn
from wheretable
This would give you the code you need to paste into your stored procedure. You could even keep that statement above in the stored procedure, commented out, so others may re-use it in future if column structures change.
For the best post I've ever seen on dynamic SQL check out Erland Somerskog's page here.
I think recording the difference in a database is relatively less straightforward and less convenient to modify afterwards. I would just write a script to do this, and write the conditions in the script directly.
For example, in Python you may write something like this.
import MySQLdb
import MySQLdb.cursors
field_value_pairs = {'f1':'v1', 'f2':'v2', 'f3':'v3'} # this is where you could modify to meet your different cases
db = MySQLdb.connect(host=host_name, user=user_name, passwd=password, \
unix_socket=socket_info)
cursor = db.cursor()
db.select_db(db_name)
for field in field_value_pairs.keys():
cursor.execute("""INSERT INTO tbl_name (%s) VALUES (%s)""", field, field_value_pairs[field])
db.commit()
cursor.close()
db.close()

Identify Row as having changes excluding changes in certain columns

Within our business rules, we need to track when a row is designated as being changed. The table contains multiple columns designated as non-relevant per our business purposes (such as a date entered field, timestamp, reviewed bit field, or received bit field). The table has many columns and I'm trying to find an elegant way to determine if any of the relevant fields have changed and then record an entry in an auditing table (entering the PK value of the row - the PK cannot be edited). I don't even need to know which column actually changed (although it would be nice down the road).
I am able to accomplish it through a stored procedure, but it is an ugly SP using the following syntax for an update (OR statements shortened considerably for post):
INSERT INTO [TblSourceDataChange] (pkValue)
SELECT d.pkValue
FROM deleted d INNER JOIN inserted i ON d.pkValue=i.pkValue
WHERE ( i.[F440] <> d.[F440]
OR i.[F445] <> d.[F445]
OR i.[F450] <> d.[F450])
I'm trying to find a generic way where I could designated the ignore fields and the stored proc would still work even if I added additional relevant fields into the table. The non-relevant fields do not change very often whereas the relevant fields tend to be a little more dynamic.
Have a look at Change Data Capture. This is a new feature in SQL Server 2008.
First You enable CDC on the database:
EXEC sys.sp_cdc_enable_db
Then you can enable it on specific tables, and specify which columns to track:
EXEC sys.sp_cdc_enable_table
#source_schema = 'dbo',
#source_name = 'xxx',
#supports_net_changes = 1,
#role_name = NULL,
#captured_column_list = N'xxx1,xxx2,xxx3'
This creates a change table named cdc.dbo_xxx. Any changes made to records in the table are recorded in that table.
I object! The one word I cannot use to describe the option available is elegant. I have yet to find a satisfying way to accomplish what you want. There are options, but all of them feel a bit unsatisfactory. When/why you chose these options depends on some factors you didn't mention.
How often do you need to "ask" what fields changed? meaning, do users infrequently click on the "audit history" link? Or is this all the time to sort out how your app should behave?
How much does disk space cost you ? I'm not being flippant, but i've worked places where the storage strategy for our auditing was million dollar issue based on what we were being charged for san space -- meaning expensive for SQL server to reconstitute wasn't a consideration, storage size was. You maybe be the same or inverse.
Change Data Capture
As #TGnat mentioned you can use CDC. This method is great because you simply enable change tracking, then call the sproc to start tracking. CDC is nice because it's pretty efficient storage and horsepower wise. You also kind of set it and forget it---that is, until developers come along and want to change the shape of your tables. For developer sanity you'll want to generate a script that disables/enables tracking for your entities.
I noticed you want to exclude certain columns, rather than include them. You could accomplish this with a FOR XML PATH trick. You could write a query something like the following, then use the #capturedColList variable when calling sys.sp_cdc_enable_table ..
SET #capturedColList = SELECT Substring( (
SELECT ',' + COLUMN_Name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = '<YOUR_TABLE>' AND
COLUMN_NAME NOT IN ('excludedA', 'excludedB')
FOR XML PATH( '' )
) , 2, 8000)
Triggers w/Cases
The second option I see is to have some sort of code generation. It could be an external harness or a SPROC that writes your triggers. Whatever your poison, it will need to be automated and generic. But you'll basically code that writes DDL for triggers that compare current to INSERTED or DELETED using tons of unweildy CASE statements for each column.
There is a discussion of the style here.
Log Everything, Sort it out later
The last option is to use a trigger to log every row change. Then, you write code (SPROCS/UDFs) that can look through your log data and recognize when a change has occured. Why would you choose this option? Disk space isn't a concern, and while you need to be able to understand what changed, you only rarely ask the system this question.
HTH,
-eric
Use a trigger and make sure it can handle multiple row inserts.
I found the answer in the post SQL Server Update, Get only modified fields and adapted the SQL to fit my needs (this sql is in a trigger). The SQL is posted below:
DECLARE #idTable INT
SELECT #idTable = T.id
FROM sysobjects P JOIN sysobjects T ON P.parent_obj = T.id
WHERE P.id = ##procid
IF EXISTS
(SELECT * FROM syscolumns WHERE id = #idTable
AND CONVERT(VARBINARY,REVERSE(COLUMNS_UPDATED())) & POWER(CONVERT(BIGINT, 2), colorder - 1) > 0 AND name NOT IN ('timestamp','Reviewed')
)
BEGIN
--Do appropriate stuff here
END

SQL to filter by multiple criteria including containment in string list

so i have a table lets say call it "tbl.items" and there is a column "title" in "tbl.items" i want to loop through each row and for each "title" in "tbl.items" i want to do following:
the column has the datatype nvarchar(max) and contains a string...
filter the string to remove words like in,out, where etc (stopwords)
compare the rest of the string to a predefined list and if there is a match perform some action which involves inserting data in other tables as well..
the problem is im ignotent when it comes to writing T-sql scripts, plz help and guide me how can i achieve this?
whether it can be achieved by writing a sql script??
or i have to develope a console application in c# or anyother language??
im using mssql server 2008
thanks in advance
You want a few things. First, look up SQL Server's syntax for functions, and write something like this:
-- Warning! Code written off the top of my head,
-- don't expect this to work w/copy-n-paste
create function removeStrings(#input nvarchar(4000))
as begin
-- We're being kind of simple-minded and using strings
-- instead of regular expressions, so we are assuming a
-- a space before and after each word. This makes this work better:
#input = ' ' + #input
-- Big list of replaces
#input = replace(' in ','',#input)
#input = replace(' out ','',#input)
--- more replaces...
end
Then you need your list of matches in a table, call this "predefined" with a column "matchString".
Then you can retrieve the matching rows with:
select p.matchString
from items i
join predefined p
on removeStrings(i.title) = p.matchString
Once you have those individual pieces working, I suggest a new question on what particular process you may be doing with them.
Warning: Not knowing how many rows you have or how often you have to do this (every time a user saves something? Once/day?), this will not exactly be zippy, if you know what I mean. So once you have these building blocks in hand, there may also be a follow-up question for how and when to do it.

SQL to search and replace in mySQL

In the process of fixing a poorly imported database with issues caused by using the wrong database encoding, or something like that.
Anyways, coming back to my question, in order to fix this issues I'm using a query of this form:
UPDATE table_name SET field_name =
replace(field_name,’search_text’,'replace_text’);
And thus, if the table I'm working on has multiple columns I have to call this query for each of the columns. And also, as there is not only one pair of things to run the find and replace on I have to call the query for each of this pairs as well.
So as you can imagine, I end up running tens of queries just to fix one table.
What I was wondering is if there is a way of either combine multiple find and replaces in one query, like, lets say, look for this set of things, and if found, replace with the corresponding pair from this other set of things.
Or if there would be a way to make a query of the form I've shown above, to run somehow recursively, for each column of a table, regardless of their name or number.
Thank you in advance for your support,
titel
Let's try and tackle each of these separately:
If the set of replacements is the same for every column in every table that you need to do this on (or there are only a couple patterns), consider creating a user-defined function that takes a varchar and returns a varchar that just calls replace(replace(#input,'search1','replace1'),'search2','replace2') nested as appropriate.
To update multiple columns at the same time you should be able to do UPDATE table_name SET field_name1 = replace(field_name1,...), field_name2 = replace(field_name2,...) or something similar.
As for running something like that for every column in every table, I'd think it would be easiest to write some code which fetches a list of columns and generates the queries to execute from that.
I don't know of a way to automatically run a search-and-replace on each column, however the problem of multiple pairs of search and replace terms in a single UPDATE query is easily solved by nesting calls to replace():
UPDATE table_name SET field_name =
replace(
replace(
replace(
field_name,
'foo',
'bar'
),
'see',
'what',
),
'I',
'mean?'
)
If you have multiple replaces of different text in the same field, I recommend that you create a table with the current values and what you want them replaced with. (Could be a temp table of some kind if this is a one-time deal; if not, make it a permanent table.) Then join to that table and do the update.
Something like:
update t1
set field1 = t2.newvalue
from table1 t1
join mycrossreferncetable t2 on t1.field1 = t2.oldvalue
Sorry didn't notice this is MySQL, the code is what I would use in SQL Server, my SQL syntax may be different but the technique would be similar.
I wrote a stored procedure that does this. I use this on a per database level, although it would be easy to abstract it to operate globally across a server.
I would just paste this inline, but it would seem that I'm too dense to figure out how to use the markdown deal, so the code is here:
http://www.anovasolutions.com/content/mysql-search-and-replace-stored-procedure